D-Net: Siamese based Network with Mutual Attention for Volume Alignment

01/25/2021 ∙ by Jian-Qing Zheng, et al. ∙ University of Oxford 0

Alignment of contrast and non-contrast-enhanced imaging is essential for the quantification of changes in several biomedical applications. In particular, the extraction of cartilage shape from contrast-enhanced Computed Tomography (CT) of tibiae requires accurate alignment of the bone, currently performed manually. Existing deep learning-based methods for alignment require a common template or are limited in rotation range. Therefore, we present a novel network, D-net, to estimate arbitrary rotation and translation between 3D CT scans that additionally does not require a prior standard template. D-net is an extension to the branched Siamese encoder-decoder structure connected by new mutual non-local links, which efficiently capture long-range connections of similar features between two branches. The 3D supervised network is trained and validated using preclinical CT scans of mouse tibiae with and without contrast enhancement in cartilage. The presented results show a significant improvement in the estimation of CT alignment, outperforming the current comparable methods.



There are no comments yet.


page 6

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

It is currently impossible to accurately quantify the damage to cartilage during the progression of disease in small animal models of osteoarthritis. Visualisation of cartilage in Computed Tomography (CT) requires a contrast agent. The preclinical development of such a contrast agent [10] has highlighted the problem of accurate cartilage shape extraction from contrasted images; the partial volume effect and adjacency to bone necessitates the use of pre-and post-contrasted CT images. In preclinical scanners, the animal can be placed in various unsystematic (i.e. arbitrary) positions during the acquisition. Tibial cartilage shape may be extracted from the contrast enhanced image by subtracting the non-contrasted scan but this requires accurate alignment of the tibial bone. However, current semi-accurate manual alignment using ImageJ requires over 1 hour and is prone to error, calling for an automated and accurate method to estimate rigid transformation between 3D volumes acquired in the preclinical setup.

The standardised protocols for image acquisition in clinical scanners means that the range of rotation and translation required to register scans are small and the bigger challenge is to perform deformable registration, especially in between image modalities [15]. In pre-clinical studies, protocols are usually study and machine specific. The limbs of mice are particularly challenging as they may be extended or tucked, dependent on posture (prone, supine and on the side). Post-mortem ex vivo tissue may also be scanned in fixative solution, which increases the variability of orientations and positions of scans. Our initial dataset is comprised of such ex vivo tissue which will later be used to validate the in vivo scans. Thus estimation of large-range rigid transformations is required.

Classic approaches to preclinical image alignment [1] used state-of-the-art, iterative image registration with a similarity measure capturing intensity changes caused by contrast and an appropriate transformation model. However, such approaches are easily trapped in local minimum especially when large translation or rotation is present. More recently, deep learning approaches [16],[9],[11] were employed to improve the performance of iterative image registration algorithms, however, the slow performance and dependency on initialization motivates one-step transformation estimation via regression [5, 6]. For example, two-branch Siamese Encoder (SE) used to learn similarity measure between two images, was applied to 2D brain images alignment [17]

. A convolution neural network called AIRNet

[2] was used for affine registration of 3D brain Magnetic Resonance Imaging (MRI) with dense convolution layers as SE. The SE structure was also used within the framework of deformable image registration in [3, 18] to estimate an initial, affine transformation between two volumes. Alternatively, affine transformation can be estimated using the Global-net [7] with the input images being concatenated and fed into an one-branch encoder. Despite the success of the previous approaches, the capture range of rotation is heavily limited between 15 [17] and 45.84 (0.8 rad) [2], yielding unsatisfactory results in preclinical imaging acquisition setup (shown in Sec. 3

). The 3D pose estimation of arbitrary oriented subject was presented in

[14] but it requires a prior standard template, which is not available for preclinical cartilage imaging.

In this paper, a new architecture, D-net, is proposed for estimation of arbitrary rigid transformation based on a Siamese Encoder Decoder (SED) with novel Mutual Non-local Links (MNL) between two Siamese branches, as described in Sec. 2.1 and Sec. 2.2. Data collection and experiment design are described in Sec. 2.3 and Sec. 2.4 respectively. Experimental results are shown in Sec. 3, discussed and concluded in Sec. 4.

The contributions of this work are as follows. We propose a new network with SED used for first time for rigid registration and we present a concept of MNL showing significantly improved performance on 3D volume alignment. Our network achieves consistent accuracy for wide range of volume orientations apparent in challenging preclinical data set while it does not require prior atlas or template.

2 Methodology

The objective of 3D image registration is to estimate the transformation between a fixed volume and a moving volume , where , and are the thickness, height, and width. For 3D rigid registration, the transformation , consists of rotation and translation , with the parameters including and for translation and rotation. The task of the networks in registration is to estimate from the two preprocessed volumes and by networks’ mapping , where are the parameters estimated by networks.

As is redundant for rotation, the 3D orthogonalization mapping of 6D rotation representation [20] is used as , calculated by:



denotes a column vector consist of

, denotes a Euclidean normalization function, denotes a determinant calculation, e

is a vector of the 3 canonical basis vectors of the 3D Euclidean space. This mapping keeps the continuous representation of 3D rotation and is equivalent to Gram-Schmidt process for a rotation in right handed coordinate system but just requires 6 input values.

Thus the rigid transformation can be estimated by: .

Figure 1: The architecture of (a) D-net, (b) the novel Mutually non-local linked Res-down block, (c) Res-down block, and (d) Res-up block.

2.1 D-net architecture

D-net consists of SE part, decoder part, and regression part, and its schematic architecture is shown in Fig. 1(a). Similar structure of SED was applied to segmentation [8] and tracking [4], but with different connection structure between contracting and expansive parts comparing to D-net. The SE in the D-net includes two branches of six Residual-down-sampling (Res-down) blocks with shared parameters. Four pairs of the Res-down blocks are linked by MNL and detailed in the Fig. 1(b). In MNL, two matching matrices, from left branch to right and the inverse, are computed by dot product of each pair of voxels’ feature vectors. The matrices from two branches are normalized via softmax to correspond and connect the voxels of the feature maps between two branches. MNL, therefore, captures the long-range connection of similar high and low level features between two branches.

The details of the Res-down blocks are illustrated in the Fig. 1(c). The decoder part of D-net includes four Residual-up-sampling (Res-up) blocks receiving skip connections from the corresponding Mutual Non-local Linked Res-down blocks, shown in the Fig. 1

(d). The regression part of D-net includes two fully connected layers with 128 and 12 neurons for 12 transformation parameters. In Fig. 

1, is the block number, , , and denote the sequences of thickness, heights, widths, and channel number of input volume and feature maps for each branch respectively.

2.2 Mutual Non-local Link

An approach with non-local links was presented in a classic image registration, where non-local motion was estimated using a range of spatial scales, naturally captured by graph representation [13]. Similarly, a concept of unique matching between a pair of voxels by weighting function and mutual saliency was previously shown in [12]. Here, the mutual attention mechanism is proposed with the deep-learning design of MNL inspired by the Self Non-local Link (SNL) on one branch proposed in [19]. In this section we provide the general definition of MNL:


where are the indices of the position in a feature map, are the input signals from two branches, are the output signals from this block, is the similarity measurement function, is a unary function.

The instantiated MNL in D-net is based on embedded Gaussian similarity representation with




, where W is a matrix of trainable weights.

2.3 Data collection and pre-processing

A total of 100 ex-vivo micro CT scans of tibiae from 50 mice were acquired using , Quantum FX with a resolution of /vox, volume size of . Scans varied from week to weeks post osteoarthritic surgery. Each tibiae was scanned once pre-contrast and once post-contrast with washout. Due to the handling, the muscles, solution and small broken bone fragments are displaced differently from the tibial bone.

To remove the possible influence of muscle and solution on alignment of tibial bones, we preprocessed the collected data by thresholding and normalizing with mapping by:


, where is the entry of , is the maximum intensity in the dataset and

is the threshold value set as 2000 because of the high density of background solution. Finally, because of data size, the input volumes are sub-sampled with linear interpolation to

/vox. The CT slices of two exemplar subjects are shown in Fig. 2.

Figure 2: The tibial CT slices (at original resolution) of two exemplar subjects show the large ranges of spatial transformation between fixed and moving volumes.

In the training dataset, each CT volume is transformed to synthesize the fixed and the moving volumes with uniformly distributed random translation

mm in 3D, varying by of the whole volume size, and random rotation with angle around random axes uniformly distributed in 3D sphere surface. To enlarge the training dataset, data augmentation including intensity scale transforming and Gaussian noise is applied, where the intensity scale coefficient is

and each voxel is added with random variable


2.4 Training and validation

The loss function in terms of

and is calculated as:


where is Euclidean norm, the both weights of relative translation error and rotation error are set as , and to avoid singularity. Momentum Stochastic Gradient Descend was applied with the learning rate 0.0001 and the learning momentum 0.9.

We split our CT data set into two folders (A and B) each containing pairs of contrast and non contrast-enhanced CTs from the same mouse, and two-fold cross validation was performed on different mouse. Furthermore, we performed four validation strategies: (S1) training on all data from folder A, and testing on B; (S2) training on all data from B, testing on A; (S3) training on non contrast-enhanced data from A, and testing on all data from B; (S4) training on non contrast-enhanced data from B, and testing on all data from A. Training on non contrast-enhanced data was performed to check whether our network can be used for alignment of follow-up contrast-enhanced data, when only the baseline data are available for training, thus modeling real scenario of data acquisition. For the validation strategy, known transformation (as described in Sec 2.3) was applied to volume from the folder to create a pair of CTs in synthetic test, and in real test, each pair of corresponding contrasted and non-contrasted CT was registered. The synthetic test includes rotation test with fixed translation mm and 11 angles rotation uniformly ranges from to around axis as well as translation test with fixed rotation around axis and 11 translation uniformly ranges from mm to mm along the axis .

2.5 Comparison and evaluation

D-net was compared with other relevant image registration approaches:
SITK: Simple ITK with metric Joint Histogram Mutual Information and optimizer Regular Step Gradient Descent with gradient tolerance 0.0001, max iter number 10k and learning rate 1.
ME (Mixed Encoder): The ”Global-net” [7] concatenating the two input volumes together and feeding into one mixed branch. All the architecture settings are set with default values, with and .
SE (Siamese Encoder): An architecture employing two branches of 6 Res-down blocks for SE and two fully connected layers for regression, a similar structure was used in [3] but without residual structure and fewer down-sampling blocks.
SED (Siamese Encoder Decoder): A proposed architecture inserting 4 Res-up blocks into SE between the SE and regression parts, with skip connection from the 4 latter Res-down block of SE.
SNL-SED (Self Non-local Linked - Siamese Encoder Decoder): A proposed SED architecture with the 4 latter Res-down blocks self non-locally linked by the Embedded Gaussian similarity based - non-local block [19] in each branch.

SITK, ME and SE are previously published methods; while SED and SNL-SED are transitional forms towards D-Net and their impact is separately validated; SE, SED SNL-SED and D-net are validated with and .

2.5.1 Criteria

The Euclidean distance of Translation Error (TE) between the predicted and expected translation, , and the Rotation Error (RE) between the predicted and expected rotation, , are calculated for synthetic tests. Since the ground truth for real examples are unknown, the Dice Similarity Coefficient (DSC) between the cortical bone segmented from contrasted and non-contrasted tibial CT is calculated for both synthetic and real tests.

Figure 3: The training curves exemplified with (S1) left: Loss values, middle: Rotation errors (RE) and right: Translation Errors (TE) shows SED, SNL-SED and D-net are trainable across range of translations and rotations.

3 Results

Figure 4: D-net outperforms all other methods exemplified by (S1) in (a) rotation (rot) test, (b) translation test, and (c) real test exemplified by 30 subjects, with Translation Errors (TE), Rotation Errors (RE) and DSC, avgstd; (d) DSC, avgstd in rot test and real test with validation strategies (S1)(S4).
Figure 5: Segmentation surfaces for the two exemplar volumes used in real experiment. D-net achieve the most plausible registration (overlapping red and white surfaces) with highest DSC shown at the right bottom corner (images are shown at original resolution).

All networks were trained for 120k iterations. In all training strategies used, ME and SE failed to converge, whereas SED, SNL-SED and D-net were trainable (exemplified in Fig. 3).

The results of rotation test and translation test with validation strategy (S1&S2) are shown in Fig. 4(a) and Fig. 4(b), where only D-net achieves the sub-voxel average TE in the rotation and translation tests. The performance of ME and especially SE is sensitive to the initial translation and rotation as shown in Fig. 4(a)-middle and Fig. 4(b)-left because they intend to predict a small range transformations for any input volumes. DSCs for 30 subjects in real test with strategy (S1&S2) are shown in Fig. 4(c), where the SED increases DSC by 0.10.5 from ME and D-net further raises DSC by 0.10.4 from SED. The average DSC for strategy (S1)(S4) in rotation test and real test are plotted in Fig. 4(d). It shows average DSC of D-net is higher than all others in both rotation and real test but is slightly lower in real test compared with rotation test.

The tibial bone shapes of two registration examples for ME, SED, and D-net from real test are shown in Fig. 5 with the same subjects previously shown in Fig. 2. The figure illustrates bone fragments and segmentation difference caused by the varying intensity influenced by contrast, decreasing the DSC values and making registration of preclinical tibia data particularly challenging. Visual results in Fig. 5 confirms that D-Net performs robustly and this is further supported by the quantitative results shown in Tab. 1, where TE, RE, and DSC for all methods are presented.

Comparing with others, D-net achieves the lowest TE and RE and highest DSC with consistent performance across range of rotations. Using two-way Analysis of Variance (ANOVA) in rotation and translation tests and one-way ANOVA in real test, D-net significantly outperforms all other approaches with

on TE, RE and DSC in rotation and translation tests and on DSC in real test by strategy (S1&S2) and (S3&S4); SED significantly outperforms SE, ME and SITK with on TE, RE and DSC in rotation and translation test and on DSC in real test by all the strategies.

Variable No. - 0.7M 9.2M 5.2M 4.9M 4.9M
Strategy(S) - 1&2 3&4 1&2 3&4 1&2 3&4 1&2 3&4 1&2 3&4
Transl TE 710.5 406.4 395.1 472.6 472.5 124.3 132.7 163.4 146.2 55.8 77.7
RE 135.9 82.7 83.4 89.8 89.2 7.1 8.0 9.3 9.7 3.2 4.2
DSC 12.98 15.04 14.94 12.86 13.42 52.28 51.08 45.61 47.69 73.57 66.44
Rot TE 696.2 575.4 569.4 692.4 692.4 154.4 144.1 201.5 175.7 65.3 81.0
RE 119.6 92.6 95.1 98.5 98.0 7.4 8.1 9.1 9.5 3.6 4.5
DSC 12.75 14.54 14.54 12.22 12.75 48.28 47.97 40.88 44.69 69.83 64.64
Real DSC 16.91 16.75 15.16 18.86 17.23 43.66 42.52 41.85 39.26 67.92 61.68
Table 1: Average values of Translation Error (TE/m), Rotation Error (RE/) and Dice Similarity Coefficient (DSC/%) for different methods in Translation (Transl), Rotation (Rot) and Real test, with strategies of training on both contrasted and non-contrasted data (S1&S2) and just non-contrasted data (S3&S4)

4 Discussion and Conclusion

In this paper, we proposed a new network, D-net, and a new structure, Mutual Non-local Link (MNL), for estimation of transformation between CT volumes.

The experimental results shows D-net outperforms other methods and achieves state-of-the-art performance for rigid registration of preclinical mouse CT scans with and without contrast. While ME and SE did not converge during training using full range of rotations, we were able to train them using smaller range of rotations (30), similarly as in [2, 5]. This further shows superiority of D-net at consistently estimating full range of rotations.

The average DSCs of D-net in real test are slightly lower than in synthetic tests potentially due to the difference in segmentation for contrast-enhanced volumes. However, D-net is still able to extract the common features of tibial bone and align two volumes plausibly, showing usefulness for shape extraction of cartilage from contrast-enhaced CT of tibiae.

For the rotation representation, the widely used quaternion, Euler angles and Lee algebra were not applied due to the discontinuity of 3D rotation represented in the real Euclidean space with dimension lower than 5D [20].

In future work, a pipeline for cartilage shape extraction will be further validated for morphological analysis and in application for diagnosis and staging of osteoarthritis, and D-net will be generalized to other modalities to explore inter-subject and inter-modality registration.

5 Acknowledgement

This work was supported by a Kennedy Trust for Rheumatology Research Studentship, the Centre for OA Pathogenesis Versus Arthritits (Versus Arthritis grant 21621). The authors acknowledge Patricia das Neves Borges as the researcher who collected the preclinical CT dataset, as part of the National Centre for Replacement, Refinement and Reduction of Animals in Research (NC3R grant NC/M000141/1).B. W. Papież acknowledges Rutherford Fund at Health Data Research UK


  • [1] M. Baiker, M. Staring, C. W. Löwik, J. H. Reiber, and B. P. Lelieveldt (2011) Automated registration of whole-body follow-up microCT data of mice. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 516–523. Cited by: §1.
  • [2] E. Chee and Z. Wu (2018)

    Airnet: self-supervised affine registration for 3d medical images using neural networks

    arXiv preprint arXiv:1810.02583. Cited by: §1, §4.
  • [3] B. D. de Vos, F. F. Berendsen, M. A. Viergever, H. Sokooti, M. Staring, and I. Išgum (2019) A deep learning framework for unsupervised affine and deformable image registration. Medical image analysis 52, pp. 128–143. Cited by: §1, §2.5.
  • [4] M. Dunnhofer, M. Antico, F. Sasazawa, Y. Takeda, S. Camps, N. Martinel, C. Micheloni, G. Carneiro, and D. Fontanarosa (2020) Siam-U-Net: encoder-decoder siamese network for knee cartilage tracking in ultrasound images. Medical Image Analysis 60, pp. 101631. Cited by: §2.1.
  • [5] G. Haskins, J. Kruecker, U. Kruger, S. Xu, P. A. Pinto, B. J. Wood, and P. Yan (2019) Learning deep similarity metric for 3D MR–TRUS image registration. International journal of computer assisted radiology and surgery 14 (3), pp. 417–425. Cited by: §1, §4.
  • [6] G. Haskins, U. Kruger, and P. Yan (2020) Deep learning in medical image registration: a survey. Machine Vision and Applications 31 (1), pp. 8. Cited by: §1.
  • [7] Y. Hu, M. Modat, E. Gibson, N. Ghavami, E. Bonmati, C. M. Moore, M. Emberton, J. A. Noble, D. C. Barratt, and T. Vercauteren (2018)

    Label-driven weakly-supervised learning for multimodal deformable image registration

    In 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), pp. 1070–1074. Cited by: §1, §2.5.
  • [8] D. Kwon, J. Ahn, J. Kim, I. Choi, S. Jeong, Y. Lee, J. Park, and M. Lee (2019) Siamese U-Net with healthy template for accurate segmentation of intracranial hemorrhage. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 848–855. Cited by: §2.1.
  • [9] R. Liao, S. Miao, P. de Tournemire, S. Grbic, A. Kamen, T. Mansi, and D. Comaniciu (2017) An artificial agent for robust image registration. In

    Thirty-First AAAI Conference on Artificial Intelligence

    Cited by: §1.
  • [10] N. H. Lim and M. M. Fowkes (2019-June 5) Radiopaque compound containing diiodotyrosine. Google Patents. Note: EU Patent EP3490614A1 Cited by: §1.
  • [11] K. Ma, J. Wang, V. Singh, B. Tamersoy, Y. Chang, A. Wimmer, and T. Chen (2017)

    Multimodal image registration with deep context reinforcement learning

    In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 240–248. Cited by: §1.
  • [12] Y. Ou, A. Sotiras, N. Paragios, and C. Davatzikos (2011) DRAMMS: Deformable registration via attribute matching and mutual-saliency weighting. Medical image analysis 15 (4), pp. 622–639. Cited by: §2.2.
  • [13] B. W. Papież, A. Szmul, V. Grau, J. M. Brady, and J. A. Schnabel (2016) Non-local graph-based regularization for deformable image registration. In

    Medical Computer Vision and Bayesian and Graphical Models for Biomedical Imaging

    pp. 199–207. Cited by: §2.2.
  • [14] S. S. M. Salehi, S. Khan, D. Erdogmus, and A. Gholipour (2018) Real-time deep pose estimation with geodesic loss for image-to-template rigid registration. IEEE TMI 38 (2), pp. 470–481. Cited by: §1.
  • [15] J. A. Schnabel, M. P. Heinrich, B. W. Papież, and J. M. Brady (2016) Advances and challenges in deformable image registration: from image fusion to complex motion modelling. Medical image analysis 33, pp. 145–148. Cited by: §1.
  • [16] M. Simonovsky, B. Gutiérrez-Becker, D. Mateus, N. Navab, and N. Komodakis (2016) A deep metric for multimodal registration. In International conference on medical image computing and computer-assisted intervention, pp. 10–18. Cited by: §1.
  • [17] J. M. Sloan, K. A. Goatman, and J. P. Siebert (2018) Learning rigid image registration-utilizing convolutional neural networks for medical image registration. In 11th International Joint Conference on Biomedical Engineering Systems and Technologies,, pp. 89–99. Cited by: §1.
  • [18] C. Wang, G. Papanastasiou, A. Chartsias, G. Jacenkow, S. A. Tsaftaris, and H. Zhang (2019) FIRE: unsupervised bi-directional inter-modality registration using deep networks. arXiv preprint arXiv:1907.05062. Cited by: §1.
  • [19] X. Wang, R. Girshick, A. Gupta, and K. He (2018) Non-local neural networks. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 7794–7803. Cited by: §2.2, §2.5.
  • [20] Y. Zhou, C. Barnes, J. Lu, J. Yang, and H. Li (2019) On the continuity of rotation representations in neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5745–5753. Cited by: §2, §4.