1 Introduction
Image registration is the process of aligning two or more sources of data to the same coordinate system. Through all the different registration methods used in medical applications, deformable registration is the one most commonly used due to its richness of description [15]. The goal of deformable registration is to calculate the optimal nonlinear dense transformation to align in the best possible way, a source (moving) image to a reference (target) image [6, 2]. Existing literature considers the mapping once the local alignment has been performed and therefore is often biased towards the linear component. Furthermore, state of the art methods are sensitive to the application setting, involve multiple hyperparameters (optimization strategy, smoothness term, deformation model, similarity metric) and are computationally expensive.
Recently, deep learning methods have gained a lot of attention due to their state of the art performance on a variety of problems and applications [4, 12]
. In computer vision, optical flow estimation—a problem highly similar to deformable registration—has been successfully addressed with numerous deep neural network architectures
[9]. In medical imaging, some methods in literature propose the use of convolutional neural networks (CNNs) as robust methods for image registration [14, 5]. More recently, adversarial losses have been introduced with impressive performance [16]. The majority of these methods share two limitations: (i) dependency on the linear component of the transformation and (ii) dependency on ground truth displacement which is used for supervised training.In this paper, we address the previous limitations of traditional deformable registration methods and at the same time propose an unsupervised method for efficient and accurate registration of 3D medical volumes that determines the linear and deformable parts in a single forward pass. The proposed solution outperforms conventional multimetric deformable registration methods and demonstrates evidence of clinical relevance that can be used for the classification of patients with ILD using the transformation between the extreme moments of the respiration circle.
The main contributions of the study are fourfold: (i) coupling linear and deformable registration within a single optimization step / architecture, (ii) creating a modular, parameterfree implementation which is independent of the different similarity metrics, (iii) reducing considerably the computational time needed for registration allowing realtime applications, (iv) associating deformations with clinical information.
2 Methodology
In this study, we propose the use of an unsupervised CNN for the registration of pairs of medical images. A source image and a reference image are presented as inputs to the CNN while the output is the deformation along with the registered source image . This section presents details of the proposed architecture as well as the dataset that we utilized for our experiments. Please note that henceforth, we will use the terms deformation, grid, and transformation interchangeably.
2.1 Linear and Deformable 3D Transformer
One of the main components of the proposed CNN is the 3D transformer layer. This layer is part of the CNN and is used to warp its input under a deformation . The forward pass for this layer is given by
(1) 
where indicates a sampling operation under the deformation . is a dense deformation which can be thought of as an image of the same size as , and which is constructed by assigning for every output voxel in , a sampling coordinate in the input .
In order to allow gradients to flow backwards though this warping operation and facilitate backpropagation training, the gradients with respect to the input image as well as the deformation should be defined. Similar to [10]
, such gradients can be calculated for a backward trilinear interpolation sampling. The deformation is hence fed to the transformer layer as sampling coordinates for backward warping. The sampling process is illustrated by
(2) 
where and denote pixel locations, denotes an axis, and denotes the component of .
Our modeling of the deformation offers a choice of the type of deformation we wish to use—linear, deformable, or both. The linear (or affine) part of the deformation requires the prediction of a affine transformation matrix according to the relation , where represents the augmented points to be deformed, whereas represents their locations in the deformed image. The matrix can then be used to build a grid, , which is the affine component of the deformation .
To model the deformable part , a simple and straightforward approach is to generate sampling coordinates for each output voxel (). We can let the network calculate these sampling points directly. Such a choice would however require the network to produce feature maps with large value ranges which complicates training. Moreover without appropriate regularization, nonsmooth and even unconnected deformations could be produced. In order to circumvent this problem, we adopt the approach proposed by [13] and predict spatial gradients of the deformation along each dimension instead of the deformation itself. This quantity measures the displacements of consecutive pixels. By enforcing these displacements to have positive values and subsequently applying an integration operation along each dimension, the spatial sampling coordinates can be retrieved. This integration operation could be approximated by simply applying a cumulative sum along each dimension of the input (i.e integral image). In such a case, for example, when there is no change in the distance between the pixels and in the deformed image along the axis . On the other hand, when , the distance between these consecutive pixels along will decrease, while it will increase when . Such an approach ensures the generation of smooth deformations that avoid selfcrossings, while allows the control of maximum displacements among consecutive pixels.
Finally, to compose the two parts we apply the deformable component to a moving image, followed by the linear component. When operating on a fixed image , this step can be written as
(3) 
During training, the optimization of the decoders of and is done jointly, as the network is trained endtoend. We also impose regularization constraints on both these components. We elaborate on the importance of this regularization for the joint training in Section 2.3.
2.2 Architecture
The architecture of the CNN is based on an encoderdecoder framework presented in [1] (Fig. 1). The encoder adopts dilated convolutional kernels along with multiresolution feature merging, while the decoder employs nondilated convolutional layers and upsampling operations. Specifically, a kernel size of
was set for the convolutional layers while LeakyReLU activation was employed for all convolutional layers except the last two. Instance normalization was included before most of the activation functions. In total five layers are used in the encoder and their outputs are merged along with the input pair of image to form a feature map of 290 features with a total receptive field of
. In the decoder, two branches were implemented—one for the spatial deformation gradients and the other for the affine matrix. As far as the former is concerned, a squeezeexcitation block
[8] was added in order to weigh the most important features for the spatial gradients calculation while for the latter a simple global average operation was used to reduce the spatial dimensions to one. For the affine parameters and the spatial deformation gradients, a linear layer and sigmoid activation were respectively used. Finally to retrieve, the output of the sigmoid function should be scaled by a factor of 2 in order to fall in the range
and hence allow for consecutive pixels to have larger distance than the initial.2.3 Training
The network was trained by minimizing the mean squared error (MSE) between the and image intensities as well as the regularization terms of the affine transformation parameters and the spatial deformation gradients using the Adam optimizer [11]. Our loss is defined as
(4) 
where represents the identity affine transformation matrix, is the spatial gradient of the identity deformation, and and are regularization weights. As mentioned before, regularization is essential to the joint optimization. To elaborate, without the L1 regularization on , the network might get stuck in a local minimum where it aligns only highlevel features using the affine transformation. This will result in a high reconstruction error. On the other hand, without the smoothness regularizer on , the spatial gradients decoder network can predict very nonsmooth grids which again makes it prone to fall in a local minimum. Having both linear and deformable components is helpful to the network because these two components now share the work. This hypothesis aligns with [13] and is also evaluated in Section 3.
The initial learning rate is and subdued by a factor of
if the performance on the validation set does not improve for 50 epochs while the training procedure stops when there is no improvement for 100 epochs. The regularization weights
and were set to so that neither of the two components has an unreasonably large contribution to the final loss. As training samples, random pairs among all cases were selected with a batch size of due to the limited memory resources on the GPU. The performance of the network was evaluated every batches, and both proposed models—with and without affine components—converged after nearly epochs. The overall training time was calculated to hours.2.4 Dataset
MRI exams were acquired as a part of a prospective study aiming to evaluate the feasibility of pulmonary fibrosis detection in systemic sclerosis patients by using magnetic resonance imaging (MRI) and an elastic registrationdriven biomarker. This study received institutional review board approval and all patients gave their written consent. The study population consisted of 41 patients (29 patients with systemic sclerosis and 12 healthy volunteers). Experienced radiologists annotated the lung field for the total of the 82 images and provided information about the pathology of each patient (healthy or not). Additionally, eleven characteristic landmarks inside the lung area had been provided by two experienced radiologists.
All MRI examinations were acquired on a 3TMRI unit (SKYRA magneton, Siemens Healthineers) using an 18phasedarraybody coil. All subjects were positioned in the supine position with their arms along the body. Inspiratory and expiratory MRI images were acquired using an ultrashort time of echo (UTE) sequence, the spiral VIBE sequence, with the same acquisition parameters (repetition time ms, echo time ms, flip angle , fieldofview mm, slice thickness 2.5 mm, matrix , with an inplane resolution of mm).
As a preprocessing step, the image intensity values were cropped within the window and mapped to . Moreover, all the images were scaled down along all dimensions by a factor of with cubic interpolation resulting to an image size of to compensate GPU memory constraints. A random split was performed and 28 patients (56 pairs of images) were selected for the training set, resulting to training pairs, while the rest 13 were used for validation.
3 Experimental Setup and Results
3.1 Evaluation
We evaluated the performance of our method against two different stateoftheart methods, namely, Symmetric Normalization (SyN) [2], using its implementation on the ANTs package [3] and the deformable method presented in [7, 6] for a variety of similarity metrics (normalized cross correlation (NCC), mutual information (MI) and discrete wavelet metric (DWM), and their combination). For the evaluation we calculated the Dice coefficient metric, measured on the lung masks, after we applied the calculated deformation on the lung mask of the moving image. Moreover, we evaluate our method using the provided landmark locations. For comparison reasons we report the approximate computational time each of these methods needed to register a pair of images. For all the implementations we used a GeForce GTX 1080 GPU except for SyN implementation where we used a CPU implementation running on cores of an i74700HQ CPU.
3.2 Results and Discussion
Starting with the quantitative evaluation, in Table 1
the mean Dice coefficient values along with their standard deviations are presented for different methods. We performed two different types of tests. In the first set of experiments (Table
1: InhaleExhale), we tested the performance of the different methods for the registration of the MRI images, between the inhale and exhale images, for the 13 validation patients. The SyN implementation reports the lowest Dice scores while at the same time, it is computationally quite expensive due to its CPU implementation. Moreover, we tested three different similarity metrics along with their combinations using the method proposed in [6] as described earlier. In this specific setup, the MI metric seam to report the best Dice scores. However, the scores reported by the proposed architecture are superior by at least % to the ones reported by the other methods. For the proposed method, the addition of a linear component to the transformation layer does not change the performance of the network significantly in this experiment. Finally, we calculated the errors over all axes in predicted locations for eleven different manually annotated landmark points on inhale volumes after they were deformed using the decoded deformation for each patient. We compare the performance of our method against the interobserver (two different medical experts) distance and the method presented in [6] in Table 2. We observe that both methods perform very well considering the interobserver variability, with the proposed one reporting slitly better average euclidean distances.Method  InhaleExhale  All Combinations  Time/subject (s) 

Unregistered  75.6210.89  57.2212.90  
Deformable with NCC [6]  84.256.89  76.107.92  1 (GPU) 
Deformable with DWM [6]  88.634.67  75.928.81  2 (GPU) 
Deformable with MI [6]  88.865.13  76.338.74  2 (GPU) 
Deformable with all above [6]  88.815.85  78.718.56  2 (GPU) 
SyN [2]  83.866.04  2500 (CPU)  
Proposed w/o Affine  91.282.47  81.757.88  0.5 (GPU) 
Proposed  91.482.33  82.347.68  0.5 (GPU) 
For the second set of experiments (Table 1: All combinations), we report the Dice scores for all combinations of the 13 different patients, resulting on validation pairs. Due to the large number of combinations, this problem is more challenging since the size of the lungs in the extreme moments of the respiratory circles can vary significantly. Again, the performance of the proposed architecture is superior to the tested baselines, highlighting its very promising results. In this experimental setup, the linear component plays a more important part by boosting the performance by %.
Concerning the computation time, both [6] and the proposed method report very low inference time, due to their GPU implementations, with the proposed method reaching seconds per subject. On the other hand, [2] is computationally quite expensive, making it difficult to test it for all the possible combinations on the validation set.
Method  
Interobserver  1.664  2.545  1.555  3.905 
Deformable with NCC, DWM, and MI [6]  1.855  3.169  2.229  4.699 
Proposed w/o Affine  2.014  2.947  1.858  4.569 
Proposed  1.793  2.904  1.822  4.358 
Finally, in Figure 2, we present the deformed image produced by the proposed method on coronal view for a single patient in the two different moments of the respiratory cyrcle. The grids were superimposed on the images, indicating the displacements calculated by the network. The last column shows the difference between the reference and deformed image. One can observe that the majority of the errors occur on the boundaries, as the network fails to capture large local displacements.

3.3 Evaluation of the Clinical Relevance of the Deformation
To asses the relevance of the decoded transformations in a clinical setting, we trained a small classifier on top of the obtained residual deformations to classify patients as healthy or unhealthy. The residual deformation associated with a pair of images indicates voxel displacements, written as
, where is the deduced deformation between the two images, and is the identity deformation.We trained a downsampling convolutional kernel followed by a multilayer perceptron (MLP) to be able to predict whether a case is healthy or not. The network architecture is shown in Figure
3. The model includes batch normalization layers, to avoid overfitting, as we have few training examples at our disposal. Further, a Tanh activation function is used in the MLP. The downsampling kernel is of size
, with a stride of
and a padding of
. The number of units in the hidden layer of the MLP was set to . We trained with binary cross entropy loss, with an initial learning rate of , which is halved every fifty epochs. Training five models in parallel took about hours on two GeForce GTX 1080 GPUs.We crossvalidate five models on the training set of 28 patients, and report the average response of these models on the rest 13 patients. We conduct the same experiment for deformations obtained using [6] and all similarity measures (NCC, DWM, MI). The results on the test set using a threshold of
on the predicted probability are reported in Table
3, suggesting that indeed the deformations between inhale and exhale carry information about lung diseases.Method  Accuracy 

Deformable with NCC, DWM, and MI [6]  
Proposed 
4 Conclusion
In this paper, we propose a novel method which exploits the 3D CNNs to calculate the optimal transformation (combining a linear and a deformable component within a coupled framework) between pair of images that is modular with respect to the similarity function, and the nature of transformation. The proposed method generates deformations with no selfcrossings due to the way the deformation layer is defined, efficient due to the GPU implementation of the inference and reports high promising results compared to other unsupervised registration methods. Currently, the proposed network was tested on the challenging problem of lung registration, however, its evaluation on the registration of other modalities, and other organs is one of the potential directions of our method.
References
 [1] Anthimopoulos, M., Christodoulidis, S., Ebner, L., Geiser, T., Christe, A., Mougiakakou, S.G.: Semantic segmentation of pathological lung tissue with dilated fully convolutional networks. IEEE Journal of Biomedical and Health Informatics (2018)
 [2] Avants, B., Epstein, C., Grossman, M., Gee, J.: Symmetric diffeomorphic image registration with crosscorrelation: Evaluating automated labeling of elderly and neurodegenerative brain. Medical Image Analysis (2008)
 [3] Avants, B.B., Tustison, N.J., Song, G., Cook, P.A., Klein, A., Gee, J.C.: A reproducible evaluation of ants similarity metric performance in brain image registration. NeuroImage (2011)
 [4] Chandra, S., Usunier, N., Kokkinos, I.: Dense and lowrank gaussian crfs using deep embeddings. In: 2017 IEEE International Conference on Computer Vision (ICCV) (2017)
 [5] Cheng, X., Zhang, L., Zheng, Y.: Deep similarity learning for multimodal medical images (2016)
 [6] Ferrante, E., Dokania, P.K., Marini, R., Paragios, N.: Deformable Registration Through Learning of ContextSpecific Metric Aggregation, pp. 256–265. Springer International Publishing (2017)
 [7] Glocker, B., Sotiras, A., Komodakis, N., Paragios, N.: Deformable medical image registration: Setting the state of the art with discrete methods. Annual Review of Biomedical Engineering pp. 219–244 (2011)

[8]
Hu, J., Shen, L., Sun, G.: Squeezeandexcitation networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
 [9] Hui, T.W., Tang, X., Loy, C.C.: Liteflownet: A lightweight convolutional neural network for optical flow estimation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)

[10]
Jaderberg, M., Simonyan, K., Zisserman, A., et al.: Spatial transformer networks. In: Advances in neural information processing systems. pp. 2017–2025 (2015)
 [11] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
 [12] Riza, A., Neverova, N., Kokkinos, I.: Densepose: Dense human pose estimation in the wild. arXiv (2018)

[13]
Shu, Z., Sahasrabudhe, M., Güler, R.A., Samaras, D., Paragios, N., Kokkinos, I.: Deforming Autoencoders: Unsupervised Disentangling of Shape and Appearance. 2018 IEEE European Conference on Computer Vision (ECCV) (2018)
 [14] Simonovsky, M., GutiérrezBecker, B., Mateus, D., Navab, N., Komodakis, N.: A deep metric for multimodal registration. In: Medical Image Computing and ComputerAssisted Intervention – MICCAI 2016. Springer International Publishing, Cham (2016)
 [15] Sotiras, A., Davatzikos, C., Paragios, N.: Deformable medical image registration: A survey. IEEE Transactions on Medical Imaging pp. 1153–1190 (July 2013)
 [16] Yan, P., Xu, S., Rastinehad, A.R., Wood, B.J.: Adversarial image registration with application for MR and TRUS image fusion (2018)