## 1 Introduction

Image registration is a fundamental task in image processing, whose importance soars with growing number of different types of devices and increasing availability. It also serves as a crucial step in a great variety of biomedical imaging applications. Registration techniques can be generally divided into rigid registration and non-rigid registration [oh2017deformable]. Rigid image registration moves or rotates image pixels uniformly so that the pixelwise relations are kept before and after the transformation. Non-rigid image registration method, also known as deformable registration, changes the pixelwise relation and produces a translation map for each pixel.

The registration task requires user to input a pair of images acquired from different devices, among which one denoted as moving, indicating the image that needs to be aligned, and the other denoted as fixed, indicating the target coordinate system of the alignment, and outputs the aligned image after transformation, commonly denoted as moved. Traditional rigid registration methods often solve an optimization problem for each pair of images. However, solving a pairwise optimization problem can be computationally intensive and slow in practice [balakrishnan2019voxelmorph]

. With the advent of deep learning in recent decades, a number of Convolutional Neural Network (CNN) architectures are proposed for registration. The pipelines of deep learning-based techniques consist of two types. One is to use a CNN to directly model the transformation from the input image pair and the final aligned output. The limitation is that it requires ground truth during the training phase, either the pixelwise translation map (registration field) or the aligned image corresponding to the

moving image. The output moved image is not necessarily the same as the fixed image in all registration scenarios due to the measurement noise and the artifacts in the generation process of the fixedimage. Thus, the ground truth is usually hard to acquire. The other type of the pipeline is to use a CNN to model the registration field and utilize a Spatial Transformer Network

[jaderberg2015spatial] to perform the registration instead of modeling the transformation directly, making the pipeline unsupervised.Although there are a number of approaches proposed for image registration, few of them provide a thorough study for the differences between the state-of-art deep learning-based non-rigid registration trained with rigid and non-rigid data and rigid registration approaches on rigid and non-rigid testing data recently. In our work, we compared several state-of-art non-rigid and rigid registration frameworks and our major contributions are: 1) we generated our data set using images from the Kaggle Dog vs Cat Competition; 2) we reproduced the state-of-art 3D unsupervised non-rigid registration approach Voxelmorph [balakrishnan2019voxelmorph] in 2D and improve the registration results by adding a gaussian layer for registration field compared to the original architecture; 3) we reproduced several state-of-art rigid registration methods including SimpleElastix [marstal2016simpleelastix], Oriented FAST and Rotated BRIEF (ORB) [rublee2011orb] and intensity-based image registration by Matlab.

## 2 Related Work

### 2.1 Rigid image registration

Rigid image registration generally utilizes a linear transformation, which includes translation, rotation, scaling, shearing and other affine transformations. Extensive studies have been conducted on the topic of rigid registration

[eggert1997estimating, letteboer2003rigid, leroy2004rigid, commowick2012block, debayle2016rigid, ourselin2000block, feldmar1996rigid]. Recently, there are three widely used tools for rigid registration. The first is the intensity-based approach [rohde2003adaptive, myronenko2010intensity, klein2009elastix]. Matlab has embedded function called imregister for this method, making it accessible and easy to use. The second is the ORB based approach, which builds on FAST key points detector [rosten2006machine, rosten2008faster] and BRIEF feature descriptor [calonder2010brief]. The third is the famous SimpleElastix model [marstal2016simpleelastix], which is an extension of Elastix [klein2009elastix]. SimpleElastix also contains spline non-rigid transformation function. These rigid registration methods require user to specify the transformation model before the registration, which limits their generalization ability when dealing when unknown transformation.### 2.2 Non-rigid image registration

Several studies propose pairwise optimization methods for non-rigid image registration within displacement vector fields, including elastic type models, free-form deformations with b-splines

[rueckert1999nonrigid], discrete methods [dalca2016patch, glocker2008dense] and Demons [pennec1999understanding, thirion1998image]. There are also several methods proposing diffeomorphic transformation-based methods including Large Diffeomorphic Distance Metric Mapping (LDDMM) [beg2005computing, zhang2017frequency, cao2005large, ceritoglu2009multi, hernandez2009registration, joshi2000landmark, oishi2009atlas], DARTEL [ashburner2007fast] and diffeomorphic demons [vercauteren2009diffeomorphic]. These methods are not learning based and need to be repeated for each pair, which is time-consuming when dealing with large data set.There are also some recent papers proposing using neural networks to learn the registration function, but all of them rely on ground truth translation map [cao2017deformable, krebs2017robust, rohe2017svf, sokooti2017nonrigid, yang2017quicksilver, liao2017artificial, cao2018deformable]. In common registration applications, it is hard to acquire the translation map from two natural images taken by two different camera systems. Hu [hu2018weakly] put forward a weakly supervised deep learning-based registration approach, but it still requires a proportion of ground truth. More recently, several unsupervised methods have been proposed [de2017end, li2017non, li2018non]. They utilize neural networks to model the registration field and then apply the spatial transformer network [jaderberg2015spatial] to warp the image. However, the methods are only tested on a small subsets of volumes, such as 3D regions and have not been compared with other popular models like LDDMM or U-net. Balakrishnan [balakrishnan2018unsupervised]

then propose an unsupervised learning method for deformable image registration. The group extends the method to Voxelmorph

[balakrishnan2019voxelmorph] and demonstrates impressive performance on various data set, which is considered as the state-of-art. In this work, we reproduce this paper and we improve the registration result by adding a gaussian layer after obtaining the registration flow.## 3 Method

### 3.1 2D Voxelmorph

Let be the input pair for the Voxelmorph over 2D spatial domain , where denotes moving image and denotes fixed image. The Voxelmorph models the registration field function (pixelwise translation map) by using a neural network. The denotes the network parameter and

denotes the estimated registration field. Voxelmorph utilizes a Spatial Transformer Network (STN)

[jaderberg2015spatial] to compute the moved image. Stochastic gradient descent is used to find the optimal

.The CNN architecture used in the 2D Voxelmorph is based on U-net. First proposed by Ronneberger [ronneberger2015u] at 2015, U-net architecture has widely been used in registration and segmentation. The architecture implemented in our work is shown in Fig. 1.

The input size of the U-net architecture is as we concatenate the RGB channels of the moving and fixed

image. Convolutions in 2D with kernel size 3 and stride 2 are implemented in the encoder and decoder. Each convolution is followed by a LeakyReLU with parameter

. The original input size is denoted as for simplification and the successive layer utilize max pooling operation, shrinking the size by . The size of the output registration field is . Different from the original 3D Voxelmorph architecture, we add a Gaussian blur layer after registration field to smooth the pixelwise displacement.After obtaining the registration field, we construct a differentiable operation based on STN [jaderberg2015spatial] to compute the

with bilinear interpolation. We utilize the unsupervised loss function in the original 3D Voxelmorph, which penalizes the appearance difference and local spatial variation in

(1) |

The pipeline of 2D Voxelmorph is depicted in Fig. 2.

### 3.2 SimpleElastix

Developed based on Elastix [klein2009elastix], SimpleElastix is one of the favorite tools for rigid image registration. It also contains non-rigid image library using B-spline polynomial. The main idea of SimpleElastix is to solve a pairwise optimization problem by minimizing the cost function . The optimization can be formulated as

(2) |

with cost function defined as

(3) |

where is the transformation matrix, is the similarity measurement and is the penalty term with regularizer parameter

. SimpleElastix is based on the parametric approach to solve the optimization problem, where the number of possible transformation is limited by introducing a parametrization (model) of the transform. The optimization becomes

(4) |

denotes the parametrization model and vector contains the values of the transformation parameters. In our case for 2D rigid transformation, the parameter vector contains one rotation angle and the translation in and direction

(5) |

### 3.3 Orb

ORB-based registration is often called feature-based, since sparse sets of features are detected and matched in two images [rublee2011orb]. The final output of the method is a moved image after calculating the transformation matrix . The ORB registration approach could be divided into 4 stages including image preprocessing, feature detection, feature matching and image warping. The pipeline is shown in Fig. 3. We first read both moving and fixed image and convert them into grayscale using the following empirical function

(6) |

where denotes the original pixel value for three channels in , and denotes the output grayscale image. We then use feature detector, which consists of a locator and descriptor, to extract features from the input image. A locator identifies points on the image that are consistent under image transformations and a detector tells the appearance of the identified points by encoding them into arrays of numbers. In the implementation, we adopt FAST locator and BRIEF descriptor. In the next stage, we match the generated features using hamming distance and sort out the top corresponding points in the two images for the transformation matrix calculation. RANSAC is further utilized for improving the robustness. In the last stage, we warp the image with the to calculate the final output.

### 3.4 Intensity-based Registration

Intensity-based image registration is an iterative optimization process and is widely used in Matlab. The method requires prior information of initial transformation matrix , the metric and an optimizer. In our work, we choose mean square similariy metric and regular step gradient descent optimizer. The initial transformation matrix defines the type of 2-D transformation that aligns with . The metric is used to describe the similarity for evaluating the accuracy of our registration. We need two images as the input and get a scalar result to show how similar these two images are. And in order to reshape the metric, the optimizer is used to define the method for minimizing or maximizing the similarity metric. The pipeline is shown in Fig. 4. The key pairwise optimization objective is to provide accurate estimation on transformation matrix , which is a rigid registration approach.

## 4 Experiment

### 4.1 Data generator

The data set used in this work is generated from the Kaggle Dogs vs Cats competition. We downloaded 1200 images and separate them into two groups: 1000 images for training and 200 images for testing. These downloaded images are considered as moving images. The fixed images in the training and testing set are generated using Spatial Transformer Network [jaderberg2015spatial] with ground truth translation map. The pipeline is shown in Fig. 5.

The transformation matrices and its random entries for each pair used in the generator are listed in Table 1.

Type | Matrix | Random Seed |
---|---|---|

[.5] Translation | ||

[.5] Shearing | ||

[.5] Scaling | ||

[.5] Rotation | ||

[.5] Pixelwise |

Transformation and random matrix used in the data generator.

The rigid transformation matrix is a

matrix. We take the pixel shift in Cartesian coordinate system to calculate the translation map from

. Let denote the homogeneous coordinate in moving image and denote the coordinate in the fixed image, we have(7) |

and pixel shift can be calculated as

(8) |

Thus we could have the ground truth translation map with shape , where the first channel represents the pixel shift in and second represents for each pixel. For non-rigid transformation, we produce a random matrix and upsample it to for and and concatenate the two channels to generate the ground truth translation map. Random seed in Table 1 denotes the random entries generated in the transformation matrices for each image pair. For instance, in translation transformation, the random entries in are and in range .

Two different types of training data are produced using the technique described above. The first type, rigidset, is generated by separating our 1000 downloaded images into 5 categories, each containing 200 images, and conducting spatial transformations mentioned in Table 1 on each category separately. The second type, nonrigidset, is generated with the entire 1000 images using translation map upsampled from the pixelwise random matrix. The testing set is separated into 5 types, which are translation, rotation, scaling, shearing, pixelwise nonrigid, and each contains 40 images. We test the performance of Voxelmorph using rigidset and nonrigidset separately and compare the result. For notation simplicity, we denote Voxelmorph(NN) as trained with nonrigidset without gaussian layer, Voxelmorph(RN) as trained with rigidset without gaussian layer, Voxelmorph(NG) as trained with nonrigidset with gaussian layer and Voxelmorph(RG) as trained with rigidset with gaussian layer in the following article.

### 4.2 Experiment setup

The Voxelmorph is implemented in Python with Keras in Tensorflow backend and CUDA Deep Neural Network (cuDNN) library. The model is trained and tested on NVIDIA GPU GTX 2080 Ti with 11GB memory. The total number of epoch is 1500 and each image is resized to

. The SimpleElastix, ORB are implemented in Python and Intensity-based registration is implemented in Matlab.### 4.3 Evaluation metrics

The quantitative evaluation is conducted by calculating root-mean-square error (RMSE) and mean absolute error (MAE) between the estimated translation map and ground truth translation map, each with a size of . Two channels represent pixel shift in and separately.

#### 4.3.1 Root mean square error

Let denotes the element in estimated translation map and denotes the element in ground truth translation map. The RMSE is calculated as

(9) |

where denotes the total number of points, denotes number of column pixels, denotes number of row pixels. In our case, .

#### 4.3.2 Mean absolute error

The MAE is calculated as

(10) |

## 5 Results

### 5.1 Quantitative assessment

The quantitative assessment using RMSE metric in Cartesian and are reported in Table 2 and Table 3 respectively. The MAE in and are reported in Table 4 and Table 5. The best performance observed in rigid transformation testing is implemented by SimpleElastix. It achieves a high score in both RMSE and MAE with an average of in , in reported in RMSE and in , in y reported in MAE. In non-rigid transformation, Voxelmorph(RN) achieves the best score with in and in using RMSE metric, in and in using MAE metric.

From Table 2, we notice that by introducing a gaussian blur layer, Voxelmorph(RG) improves the RMSE score significantly compared with Voxelmorph(RN) in scaling and shearing, which is the original architecture trained with rigidset. Similar results are also observed in Table 3, Table 4 and Table 5. This demonstrates the effectiveness of the gaussian blur, which smooths the translation map. In our tasks such as translation, rotation and non-rigid pixelwise transformation, Voxelmorph with gaussian blur layer shows relative the same result after training using both nonrigidset and rigidset.

RMSE(px) | SimpleElastix | ORB | Intensity-based | Voxelmorph(NN) | Voxelmorph(RN) | Voxelmorph(NG) | Voxelmorph(RG) |
---|---|---|---|---|---|---|---|

Translation | |||||||

Rotation | |||||||

Scaling | |||||||

Shearing | |||||||

Pixelwise |

RMSE(px) | SimpleElastix | ORB | Intensity-based | Voxelmorph(NN) | Voxelmorph(RN) | Voxelmorph(NG) | Voxelmorph(RG) |
---|---|---|---|---|---|---|---|

Translation | |||||||

Rotation | |||||||

Scaling | |||||||

Shearing | |||||||

Pixelwise |

MAE(px) | SimpleElastix | ORB | Intensity-based | Voxelmorph(NN) | Voxelmorph(RN) | Voxelmorph(NG) | Voxelmorph(RG) |
---|---|---|---|---|---|---|---|

Translation | |||||||

Rotation | |||||||

Scaling | |||||||

Shearing | |||||||

Pixelwise |

MAE(px) | SimpleElastix | ORB | Intensity-based | Voxelmorph(NN) | Voxelmorph(RN) | Voxelmorph(NG) | Voxelmorph(RG) |
---|---|---|---|---|---|---|---|

Translation | |||||||

Rotation | |||||||

Scaling | |||||||

Shearing | |||||||

Pixelwise |

### 5.2 Visual assessment

The visual assessment is demonstrated in Fig. 6 and Fig. 7. We compare the results generated by algorithms mentioned in this paper with the ground truth. In our case, the input is the moving image and the ground truth image is the image warped by the ground truth translation map using spatial transformer network. Pixelwise 1-4 in Fig. 7 denote different types of translation map generated by different random matrices.

From Fig. 6, we can see that rigid transformation methods based on SimpleElastix, ORB and Intensity-based produce a black boundary on moved images and lose some information. The reason is that the warping is performed for the entire image instead of each pixel. In column 5-8, Voxelmorph produces a more consistent result compared with column 1-4. We also notice that the training data demonstrates a difference in Voxelmorph. When trained with rigidset, Voxelmorph preserves the relative pixel relation better compared to the model trained with nonrigidset. For instance, in the shearing row, Voxelmorph(RN) and Voxelmorph(RG) preserve a straight line in a cat body while the Voxelmorph(NN) and Voxelmorph(NG) warp the line into a curve.

From Fig. 7, we can see that ORB and Intensity-based approach fail to produce pixelwise moved image. For instance in Pixelwise 4, the box lines are still straight in these two methods as the rigid transformation considers a linear transformation instead of a pixelwise warping. SimpleElastix demonstrates a impressive result in non-rigid transformation. Voxelmorph trained in nonrigidset and rigidset show comparable performance.

## 6 Conclusion

In this paper, we provide a comparative study for the state-of-art non-rigid image registration method and rigid image registration methods and show that the deep learning-based method doesn’s always have a better performance. We reproduce Voxelmorph and its variations from rigidset and nonrigidset. We also reproduce several rigid transformation approaches including SimpleElastix, ORB and Intensity-based registration. We add a gaussian blur layer and improve the Voxelmorph performance in rigid transformation. Our result is evaluated in terms of RMSE and MAE and it is observed that SimpleElastix demonstrates the best performance in rigid transformation while Voxelmorph(RN) achieves best score in pixelwise transformation. In the future, we intend to evaluate our idea on natural images and combine the advantages of SimpleElastix and Voxelmorph.

## 7 Acknowledgement

The authors would like to thank Prof. Kadambi and TA Guangyuan Zhao for the excellent teaching and service in the entire quarter.