## I Introduction

Computerized phantoms for nuclear medicine imaging research have been built based on anatomical and physiological models of human beings. They have played a crucial part in evaluation and optimization of medical imaging techniques and image reconstruction, processing and analysis methods[Christoffersen2013, Zhang2017, Chen2019, Abdoli2013]. Since the exact structural and physiological properties of the phantom are known, they can serve as a gold standard for the evaluation and optimization process. The 4D extended cardiac-torso (XCAT) phantom [Segars2010] was developed based on anatomical images from the Visible Human Project data. This realistic phantom includes parameterized models for anatomy, which allows the generation of different anatomical variations of the phantom. These phantoms have been used in Nuclear Medicine imaging researches [He2008, Ghaly2016, Li2018], as well as in the various applications of deep learning[Gong2018, Gong2019, Lee2017]. By changing the values of parameters that control organ anatomy, the volumes and shapes of some tissues can be varied. However, the scaling of organs, even when different factors are used in orthogonal directions, does not fully and truly capture the interior anatomical variations within different patients. In [Segars2013], Segars et al. used a deformable image registration technique to map phantom labels to patient segmentation; the resulting deformation fields were then being applied to the phantom, thus creating a population of the new XCAT models that capture the anatomical variability among patients. This method relies on the segmentation of patient images, which is tedious and time consuming. In this work, we propose a Convolutional Neural Networks (ConvNets) based approach to perform patient to patient registration. The resulting deformation field can then be applied to organ label maps to automatically generate a segmentation of the patient image.

Deformable Image registration is a process of transforming two images into a single coordinate system, where one image is often referred to as the moving image, we denote it as , while the other is referred to as the fixed image, denoted as

. Traditional methods formulate registration as a variational problem for estimating a smooth mapping,

, between the points in one image and those in another. They often tend to iteratively minimize the following energy function (eq. 1) on a single image pair [Sotiras2013]:(1) |

where, measures the level of alignment between the transformed moving image, , and the fixed image, . Some common choices for are mean squared error (MSE) or the norm of the difference [Beg2005], sum of squared differences (SSD) [Wolberg2000], cross-correlation (CC) [Avants2008], and mutual information (MI) [Viola1997]. The transformation, , at every point is defined by an identity transformation with the displacement field u, or , where represents the identity transform [Balakrishnan2019]. The second term, , is referred to as the regularization on the deformation, , which enforces the spatial smoothness. It is usually characterized by the gradients of u. One common assumption is that similar structures are presented in both moving and fixed images. Hence, a continuous and invertible deformation field (a diffeomorphism) is more desired, and the regularization term, , was designed for such reason. While diffeomorphisms are essential in some studies, for which the registration field is analyzed further. In the application of registration-based segmentation, the quality of the segmentation propagation is more critical than the diffeomorphic property of the underlying deformation fields [Rueckert2006]. In this study, due to the large interior and exterior shape variability between digital phantoms and patients, we did not impose the registration to be diffeomorphic.

Recently, many deep learning-based methods were proposed to perform the registration tasks, for instance, [Balakrishnan2019, Dalca2019, DeVos2017, Balakrishnan2018, Krebs2019]. Some of the listed methods were introduced as the unsupervised (or more percisely, self-supervised) techniques, but they still require a prior training stage with a large amount of training data. These methods assume that neural networks could learn the universal computation of the displacement field by minimizing the registration energy function over a dataset of images. This is a common assumption to make with the deep learning based approach. Yet, such an assumption could be unreliable according to a recent study from Zhang et al. [Zhang2016]

, where they showed that a well-generalized CNN classifier trained by a large dataset can still easily overfit a random labeling of the training data. More studies on fooling the deep neural networks (DNNs) with adversarial images also suggested that the well-trained networks can be unstable to small or even tiny perturbations of the data

[Su2019, Moosavi-Dezfooli2016, Goodfellow2014, Papernot2016, Szegedy2013]. Whereas, our proposed registration method is fully unsupervised, meaning that no previous training is required. Instead of following the conventional pattern of training a network on a huge dataset of accurately annotated images, we show that CNN is able to estimate an optimal deformation field for a single image pair by minimizing the energy function described in eq. 1 iteratively. This idea was inspired by Lempitsky et al.’s work on the Deep Image Prior[Lempitsky2018] (DIP), where they found that learning from a large amount of data is not necessary for building useful image priors, but the structure of a convolutional generator network itself is sufficient to capture plenty of image statistics. They treated the training of ConvNets with random initialization as a regularization prior, in order to achieve good solutions in their application of image denoising, determining early stopping points are often required. Whereas in image registration, instead of starting from a random initialization (i.e., random noises), it makes logical sense to initialize the ConvNet with moving images. Since one would like to make a moving image as similar to a target image as possible, an early stopping is not desired. In this work, we treat ConvNet as an optimization tool, where it minimizes the energy function via reparametrization in each iteration.## Ii Method

### Ii-a Computerized Phantom Generation

The phantom used in this study was created on the 3D attenuation distributions of the realistic NURBS-based XCAT phantom [Segars2008]. Attenuation values were computed based on the material compositions of the materials and the attenuation coefficients of the constituents at 140 keV, the photon energy of Tc-99m used in Nuclear Medicine. Only a single 3D phantom image was used to be deformed to multiple patient CT image. Simulated attenuation map image can be treated as the template image, and phantom image can then be think of as the atlas in the traditional paradigm of medical image registration. Our aim is to register phantom image to patient CT images for the segmentation of patient scans and creating patient-like phantoms.

### Ii-B Image Registration with ConvNet

Let a moving image be , and a fixed image be , we assume that they are 2D grayscale images and affinely aligned. We model the computation of the displacement field, , given the image pair, and , using a deep ConvNet with its parameters , i.e., . Figure 1 describes the architecture of our proposed method, it consists of a ConvNet that outputs registration field, and a B-spline spatial transformer. First, the ConvNet generates the for the given image pair of and . Second, the deformed moving image is obtained by applying a B-spline spatial transformer that wraps with (i.e., ). Finally, We backpropagate the loss from the similarity measure between and to update in the ConvNet. The steps were repeated iteratively until the loss converges, then the resulting

represents the optimal registration field for the given image pair. The loss function (

) of this problem can be formulated mathematically as:(2) |

Then, the parameters that generate the optimal registration field can be enumerated by the minimizer:

(3) |

and the optimal is given by:

(4) |

The different choices of image similarity metrics and registration field regularizers () were also studied in this work, and they are described in detail in the following subsections. The next subsection describes the design of ConvNet architecture.

#### Ii-B1 ConvNet Architecture

From our experiments, we found that the choice of different network architectures does not have a large impact on the results. Our ConvNet was designed to be a U-Net-like ”hourglass” architecture [UNET]. The network consists of one encoding path, which takes a single input formed by concatenating moving and fixed images into a volume, where represents the shape of one image. Each convolutional layer has 3

3 filter followed by a rectified linear unit (ReLU), and the downsampling was performed by the 2

2 max pooling operations. In the decoding stage, the upsampling was done by ”up-convolution”

[UNET]. Each of the upsampled feature maps in the decoding stage, were concatenated with the corresponding feature maps from the encoding path. The outputting registration field, , was generated by the application of sixteen 33 convolutions followed by two 11 convolution to the 16 feature maps. The network architecture can be visualized in Figure 2.#### Ii-B2 Image Similarity Metrics

Over the years, many research groups put considerable efforts into designing the image similarity metrics. We introduced some of the metrics that are broadly adopted in image registration in the last section. In this paper, we studied the effectiveness of four different loss functions, and later, we propose a new , which takes the advantage from the Pearson’s correlation coefficient and Structural similarity index (SSIM). In the following subsections, we denote the deformed moving image as (i.e., ) for simplicity.

##### Mean Squared Error (MSE)

MSE is a simple measurement of how the intensity values line up between two images, it is applicable when and have similar contrast and intensity distributions. MSE is given by:

(5) |

where is the image domain. Then, the similarity loss function can be defined as .

##### Pearson’s Correlation Coefficient (PCC)

PCC measures the linear correlation between two images. Unlike MSE, PCC is less sensitive to the linear transformations of intensity values from one image to another. Its usage in medical image registration can be found in

[Saad2009]. PCC is defined as the covariance between images divided by the product of their standard deviations:

(6) |

where and represents the mean intensities. PCC has a range from -1 to 1, where 0 implies that there is no linear correlation, and -1 corresponds to the maximum negative correlation between two images. Since a positive correlation is more desired, we can define the loss function to be: .

##### Local Cross Correlation (CC)

Another popular image similarity metric is CC, for its robustness to intensity variations between images, it can be formulated as follows [Balakrishnan2018, Avants2008, Zhu2019]:

(7) |

where is the deformed image (i.e., ), represents the pixel location within a window , and and denote the local mean intensities within the window. Since , we minimize the negative CC. Then, the loss function is .

##### Structural Similarity Index (SSIM)

SSIM was first introduced in [Wang2004] for robust image quality assessments based on the degradation of structural information. Within a given image window, SSIM is defined by:

(8) |

where and are small constants for avoiding instability, and , and and are local means and standard deviations of the image and , respectively. SSIM has a range from -1 to 1, where 1 indicates a perfect structural similarity. Thus, .

##### Pcc + Ssim

While PCC is robust to noises, it was also found to be less sensitive to blurring. A motivating example is shown in Figure 4, where in (b), the ”Shepp-Logan” phantom image [Shepp1974] was corrupted with Gaussian noise, and in (c), the image was blurred by a Gaussian filter. Both (b) and (c) yield a lower SSIM and a higher PCC. If we think of (a) as moving image, and (b) and (c) as fixed images, SSIM would impose the ConvNets to model the details, including noises and artifacts. Whereas, using PCC alone as the loss function might converge to a blurred result. Hence, there is a need to balance those two similarity measures. Fortunately, both PCC and SSIM are bounded with a range from -1 to 1, where 1 means the most similar. Thus, we propose to combine SSIM and PCC by an equal weight:

(9) |

### Ii-C Registration Procedure

The algorithm for the proposed method is shown in Algorithm. 1. For a given pair of moving and fix image, and , an untrained ConvNet () was initialized. First, the untrained produces an initial deformation field, . Second, we deform the moving image by (i.e., ). Then, the loss between and is computed for the use of updating the parameters in the . The above procedure is repeated until we hit the maximum number of iterations.

Since no information other than the given image pair is needed, the proposed method requires no previous training, thus it is fully unsupervised. The ConvNet is capable of learning an ”optimal” deformation from a single pair of image. In the next section, we discuss the performance comparisons between this method and the state-of-the-art unsupervised methods.

## Iii Experiments

We aim to create patient-like phantoms by registering the existing XCAT phantom with patient CT images. There were nine clinical low-dose whole-body CT patient scans used in this study; for those, only the torso part of the scans was extracted, which is 1153 2D-transaxial slices in total. The data was obtained from a publicly available dataset (NaF Prostate, [Kurdziel2012]) in the Cancer imaging archive (TCIA, [Clark2013]). We first compare the performance generated by the ConvNets with different image similarity metrics. Then, we compare the proposed method with a state-of-the-art registration algorithm, the symmetric image normalization method (SyN) [Avants2008] from the ANTs package [Avants2009].

### Iii-a Loss Function Comparisons

Some examples of the registered XCAT phantom image by the five loss functions were shown in Figure. 3. (a) and (h) represent a same moving image, and (b) and (i) are the target images from the same CT slice, where the later was blurred by a low-pass Gaussian filter due to the presence of beam hardening artifacts. The third column to the last column are registration results by PCC, SSIM, PCC+SSIM, MSE, and CC, respectively. MSE and CC are two common loss functions in both traditional and learning-based image registration methods [Beg2005, Avants2008, Dalca2019, Balakrishnan2019, DeVos2017, Balakrishnan2018, Krebs2019, Zhu2019], but they failed to converge to good results (last two columns in Figure. 3). While PCC is robust to beam hardening artifacts, it produced an cartoonish contents around the spine (referring to (c) and (j)). On the other hand, SSIM completely models the noises and artifacts in the target image. The results produced by SSIM+PCC are much more balanced, combining with the Gaussian filter to additionally reduce noises, SSIM+PCC generated the best qualitative results among other loss functions.

Method | SSIM | MSE |
---|---|---|

Affine only | ||

Ours | ||

SyN (MSE) | ||

SyN (MI) | ||

SyN (CC) |

### Iii-B Registration Performance Comparisons

In this experiment, we compared the proposed method with the SyN algorithm [Avants2008]. Figure. 5 shows some qualitative comparisons between the proposed method and the SyN method. The first and the second columns indicate moving and fixed images. The third column shows the results generated by the proposed method. The fourth to the last columns represent the results obtained by SyN with MSE, CC, and MI, respectively. At a glance, our method gave a more accurate deformation than SyN, where the anatomy of bone structures and soft tissues were modeled precisely. Figure. 6 displays some qualitative results from the proposed method. The first row indicates the target images. The second and last rows show the deformed moving image and the deformed bone labels, respectively. Since the gold-standard is not available for the NaF Prostate dataset [Kurdziel2012], the registration performances were evaluated quantitatively based on MSE and SSIM between and . The results are shown in Table. I. The proposed method gave a mean SSIM of 0.947 and a mean MSE of 41.235, which outperformed the SyN method by a significant margin.

### Iii-C SPECT image simulations

Figure. 7 shows the results of mapping the XCAT phantom to a patient CT image. (a) and (b) exhibit the volume renderings of the phantom using 3DSlicer [FEDOROV20121323]. (c) and (d) show a coronal slice of the deformed phantom and the SPECT simulation, respectively. SPECT projections were simulated by an analytic projection algorithm that realistically models attenuation, scatter, and the spatially-varying collimator-detector response [Frey1993, Kadrmas1996]

. SPECT images were reconstructed using a subsets expectation-maximization algorithm (OS-EM)

[Hudson1994] based method [He2005], with 2 iterations and 10 subsets.## Iv Conclusion

This paper proposed to create patient-like phantoms with a ConvNet-based unsupervised and end-to-end registration technique that requires no prior training. Furthermore, we showed that the registration performance was significantly improved by combining SSIM and PCC as a data similarity loss function. The registration method was evaluated on the application of registering XCAT phantom with real patient CT scans and compared the registration performance in terms of SSIM and MSE to a state-of-the-art image registration method. Both quantitative and qualitative analysis indicate that our method provided the best results. Combined with Monte Carlo and CT simulation programs, the phantoms generated by our method are able to be transformed into more realistic human-like simulations.

## Acknowledgment

This work was supported by a grant from the National Cancer Institute, U01-CA140204. The views expressed in written conference materials or publications and by speakers and moderators do not necessarily reflect the official policies of the NIH; nor does mention by trade names, commercial practices, or organizations imply endorsement by the U.S. Government.

We would like to show our gratitude to Dr. Daniel Tward and Shuwen Wei for sharing their pearls of wisdom with us during the course of this research.

Comments

There are no comments yet.