Timely detection and prompt treatment are crucial for modern cancer care to be effective[DBLP:journals/corr/abs-1809-04430]. A recurring problem for many hospitals that hinders the administration of timely radiation therapy arises from the immense workload required for the radiation therapy pipeline[DBLP:journals/corr/abs-1809-04430]. Automating the radiation therapy process, which includes the segmentation of both tumor volumes and organs-at-risk (OARs) in patients receiving treatment, drastically reduces the burden on physicians to contour large numbers of patient images in such a time-sensitive environment. As treatment planning at minimum requires the contouring of OARs surrounding a tumor volume, segmentation of OARs often accounts for the largest proportion of the overall segmentation task. Segmentation of these numerous OARs through automatic pipelines, thus, could have the potential to greatly reduce the physician workload and expedite treatment planning. Due to these reasons, developing automatic OAR segmentation tools has significant impact in the field of radiation therapy and could potentially save numerous lives by increasing patient turnover[DBLP:journals/corr/abs-1809-04430]
. As a result, automatic segmentation of OARs has spurred significant interest in medical and deep learning communities. Many recent works in automatic segmentation focus on the supervised paradigm of deep neural network models. For those models to be effective, they require numerous contours of OARs for training, which are manually generated by physicians. One major limitation to this approach for segmenting OARs is that clinical data for contours of OARs is often incomplete. It is common for physicians to only contour OARs near the tumor volume due to time constraints. Therefore, clinical data for contours of OARs often do not contain contours for all possible OARs.
In addition to automated segmentation, there has also been great progress in developing automated dose prediction methods that produce voxel-wise dose predictions, often through employing an atlas or employing previously treated patients. The process of registering a currently treated patient’s CT scan to that of either an atlas or a previously treated patient is crucial for accurate voxel-wise dose predictions[mcintosh2016voxel].
Similarly, cross-modality registration is a critical component in the segmentation of tumor volumes under conditions of poor contrast. When tumor volumes are difficult to delineate in CT scans, it is common for physicians to contour tumor volumes on positron emission tomography (PET) scans or magnetic resonance imaging (MRI) scans for better visibility. Contouring on these other non-CT modalities is also a critical component for evaluation of patient response to radiotherapy. Transferring contours of tumor volumes onto a CT scan then requires a robust cross-modality registration method[Piert2018].
To address the challenges of segmenting OARs and to produce accurate registrations, we propose an automatic framework for atlas-based segmentation of head and neck OARs using a semi-supervised diffeomorphic registration to the atlas. The proposed framework generates contours of all OARs on the head and neck atlas as well as a registration of each individual patient to the atlas, thus being useful for both automated segmentation and automated treatment planning pipelines.
I-B Related Works
I-B1 Segmentation Methods
There are numerous automatic segmentation software packages that are commercially available and two categories by which these segmentation methods can be distinguished: atlas-based segmentations and supervised CNN based segmentations. Conventional atlas-based segmentation methods made use of either rigid or deformable registration techniques to register an atlas to a patient. These methods typically solve the registration optimization problem by searching over the space of deformations. They then apply this deformation to contours made on the atlas to warp those contours into the patient-space. For segmentations from supervised CNN models, the general methodology is to train a U-net[DBLP:journals/corr/RonnebergerFB15] to mimic ground truth contours of OARs that were provided by physicians.
Although supervised CNN models provide the current state of the art segmentation performance, they do have some limitations compared to atlas-based methods for many OARs. For instance, supervised CNN models often rely on incomplete OAR datasets. Training a supervised model requires providing the model with ground truth contours. These OAR contours are often taken from clinical data or created manually by physicians specifically for the purpose of training these models. Due to the amount of labor and time required to manually contour every possible head and neck OAR, the datasets prepared for training these supervised models are often incomplete, as they do not contain contours of every possible OAR. A second limitation to these supervised CNN models is their sensitivity to visual artifacts in patient images. Distortions of the input CT image may arise from patient-based artifacts (i.e. implants, clothing, jewelry, motion, etc.), physics-based artifacts (i.e. beam hardening, aliasing, etc.), or reconstruction-based artifacts (i.e. ring artifacts, helical artifacts, etc.), which may degrade the performance of supervised models, particularly because these models rely on visual information in the image. In contrast, atlas-based segmentation models provide segmentations for all possible OARs (assuming those OARs are included in the atlas) and are more robust to artifacts, as they solve a registration problem instead of learning a function that outputs contours from CT image inputs. Nevertheless, conventional atlas based segmentation models yield a poor performance compared to neural networks for complex segmentation problems, particularly for the head and neck images which have numerous degrees of freedom (e.g. head/neck shape, head/neck rotation, head/neck bending, opening of the jaw, etc.).
I-B2 Conventional Registration Methods
Volume registration can be characterized as the problem of aligning a moving image () with a fixed image (). The transformation () that warps onto
can be computed by solving an optimization problem where the target transformation minimizes a loss function. The optimization problem has the following form:
where is the warping of image by deformation field , typically is the mean squared error or normalized cross correlation between images and , typically is a spatial smoothness loss to preserve topography, and
is the regularization hyperparameter.
Conventional registration methods solve the optimization problem by searching the space of deformations[Sims2009, Teguh2011, HoangDuc2015, Haq2019]. These methods can be categorized into elastic deformation models[Thirion1998, Bajcsy1989], deformations using b-splines[Li2017, Heinrich2013, Nakano2017], statistical parametric mapping[Ashburner2000], Demons[Nithiananthan2009], and Markov random field based discrete optimization[Li2017, Glocker2008].
The allowable transformations can also be constrained to diffeomorphisms in order to preserve topology and maintain invertibility of the transformation[Ashburner2007]. Diffeomorphic registration algorithms have seen considerable development over the years, resulting in publicly available tools such as ANTs[Avants2008], Large Diffeomorphic Distance Metric Mapping (LDDMM)[Cao2005, Beg2005], diffeomorphic demons[Vercauteren2009, Pukala2016], and DARTEL[Ashburner2007]. Variations of these algorithms have been adapted into commercial packages made available by vendors such as MIM, Varian, RaySearch, and Phillips[Pukala2016].
Under a probabilistic formulation, priors can also be specified on the deformation field[Ashburner2007, Simpson2012], and the underlying cost function can be minimized using an iterative optimization approach to find a deformation field distribution that resembles the prior. Our proposed method improves on a deep learning-based formulation proposed in Voxelmorph[DBLP:journals/corr/abs-1809-05231, DBLP:journals/corr/abs-1903-03545] and will be discussed in subsequent sections. We also provide further background on learning-based registration in Supplemental Materials sections A-C.
Ii Materials and Methods
Ii-a Problem Formulation
The goal of this paper is to find a deformation field that solves an atlas registration problem and then use the deformation field solution to warp atlas-space contours of OARs to the patient-space. The inputs to our registration problem are head and neck CT scans of individual patients (which we call the fixed image, ) , the Brouwer head and neck atlas (which we call the moving image, )[Brouwer2015], and OAR contours defined on the Brouwer head and neck atlas (). Both and are certain intensity functions in , and the proposed model attempts to generate the moved image such that is similar to .
Here, and denote the deformations for an affine transform and dense diffeomorphic transform, respectively. Under a generative model, is parametrized by the latent variable that either defines the velocity field (in the case of probabilistic Voxelmorph[DBLP:journals/corr/abs-1903-03545]
) or a low-dimensional embedding (as in a variational autoencoder[Krebs2019]). Clearly, the definition of ”” changes with the particular registration problem, and we further define ”
” in terms of a training objective and evaluation metric in upcoming subsections. The proposed approach learns network weights to minimize the training objective in either an unsupervised or semi-supervised manner (i.e. unsupervised if no ground truth deformation fields or OAR segmentations are provided and semi-supervised if only the OAR segmentations are provided), and we begin by describing the network and its building blocks below.
Ii-B Network Overview
Our proposed network, presented in Figure 1, consists of a cascade of affine and dense diffeomorphic deformation blocks. The localization network, depicted in Figure 2, learns the deformation fields and given inputs and . Warping of images is performed using a spatial transformer layer [DBLP:journals/corr/JaderbergSZK15]
, which takes as input an image and a deformation field (see Figure 5 in the Supplemental Materials). Based on a training objective function, the network uses stochastic gradient descent methods to find the 12 parameters that specify an affine transform, as well as the voxel-wise velocity field. As head and neck registration typically involves large displacements, our model leverages a cascade of both affine and dense transforms. This cascade is made possible by constraining the transform to a diffeomorphism, requiring the transform to be smooth and invertible. To enforce smoothness, we incorporate gaussian smoothing of the learned velocity field directly into the network and add a KL divergence term between the approximate posterior and prior (described later in the section on losses, as well as in the Supplemental Materials in section C on diffeomorphic transforms). The network then uses scaling and squaring integration layers (with the default step size of 8) on the velocity field, as described in various implementations of Voxelmorph[DBLP:journals/corr/abs-1809-05231, DBLP:journals/corr/abs-1903-03545], to constrain the transformation to a diffeomorphism. In general, the network takes the moving image , first warps with an affine displacement field, and then warps the affine transformed moving image with the dense displacement field cascade to get the warped image . The overall warping can be described with the following equation:
Then, to warp contours of the OARs (which we will call S from here on out) from the atlas-space to the patient space, we leverage the cascading property of diffeomorphisms as follows:
We selected 8 integration steps for scaling and squaring to satisfy the trade-off between voxel folding and computation time (i.e. increasing the number of integration steps reduces the number of folding voxels but increases the computation time)[DBLP:journals/corr/abs-1809-05231, DBLP:journals/corr/abs-1903-03545].
Ii-C Localization Network
The proposed localization network utilizes a U-net architecture for the dense transformations and a traditional CNN for the affine transformation. As we want to incorporate as many dense transformations into the cascade as memory permits, we must compromise by limiting the localization network sizes, which provides the added benefit of preventing overfitting. In comparison to the localization networks of previously proposed frameworks like Voxelmorph and Microsoft’s Volume Tweening Network (VTN)[DBLP:journals/corr/abs-1902-05020], our localization network incorporates dense blocks of convolutional layers, which we found to improve training convergence and testing performance (Figure 2). The implementation details of the localization network, such as convolution filter dimensions, number of convolution filters, feature sizes, etc., are shown in Figure 2
. Each localization network is tasked with extracting deformation field parameters that are used in the deformation blocks to warp the moving image. The model was implemented using Keras[chollet2015keras]
with a Tensorflow[Abadi:2016:TSL:3026877.3026899] backend.
Ii-D Objective Function
Based on our assumptions, we formulate the objective function to be minimized through a deep learning approach. Let the localization network be parametrized by , we can then minimize the following loss using stochastic gradient descent methods:
To improve readability, we choose to only include the overall objective function here. A more detailed explanation of the overall objective function and each component of Equation 5 can be found in Supplemental Materials section D.
Iii-a Experimental Setup and Evaluation
The dataset used in our experiment consists of 500 CT scans of head and neck patients taken from the National Cancer Institute’s Quantitative Imaging Network dataset[Fedorov2016], McGill’s head and neck PET-CT dataset[Vallieres2017], Ibragimov’s head and neck OAR dataset[ibragimov2017], and Stanford Radiation Oncology Clinic (SROC) data. Scanner details, acquisition dates, age, and sex varied across the datasets used. All scans were reoriented to a standard orientation, cropped to a head and neck window above the fourth thoracic vertebrae, down sampled to a size of 128x128x128, thresholded to a soft tissue interval between -170 and 230 HU[Hoang2010], and normalized to between 0 and 1.
Training of our registration model involves matching the input CT images and OAR contours between each patient and an atlas. While OAR contours on the patient CT images are unnecessary for training in an unsupervised manner, we incorporate them into our objective function following typical semi-supervised training protocol. As we only had access to OAR contours for the 40 SROC patients, the remainder 460 patient images were unlabeled. In order to mitigate the discrepancy between the number of labeled and unlabeled images in our training set, we generated pseudo-labels of the 460 originally unlabeled images using a separately trained supervised CNN.
The set of ground truth contours for our data consists of 8 OARs, including the mandible (M), left optic nerve (lON), right optic nerve (rON), left parotid (lP), right parotid (rP), spinal cord (SC), left submandibular gland (lSG), and right submandibular gland (rSG). Our experiment used the Brouwer head and neck atlas [Brouwer2015], which defines a set of 36 OARs that encompass the 8 OARs mentioned above. The dataset was split into a training set of 475 scans, a validation set of 5 scans, and a testing set of 20 scans. In order to ensure that the generated pseudo-labels do not confound our results, all 5 validation scans and all 20 test set scans contained segmentations that were manually contoured by physicians as part of the radio therapy pipeline (i.e. SROC data).
To evaluate the performance of the proposed model, we calculate the segmentation overlap—dice score coefficient—between the warped atlas segmentations () and the segmentations annotated on each patient (). As our test set only has ground truth contours for 8 OARs, our evaluation pertains only to those 8 OARs, but all 36 OARs can be warped to the patient space (as shown in Figure 1 and Figure 4).
Iii-B Comparison to Other Methods
For all comparisons, we used a learning rate of
, a batch size of 1 (due to memory constraints) and train all models until convergence. Table 2 in the Supplemental Materials summarizes the number of network parameters and values for regularization parameters used. There have been numerous works that already compare the performance of unsupervised learning based models to conventional non-learning based registration models (i.e. SyN, Elastix, etc.), and these works show that the performance of learning based models is comparable with or exceeds the performance of conventional models[DBLP:journals/corr/abs-1809-05231, DBLP:journals/corr/abs-1903-03545, DBLP:journals/corr/abs-1902-05020, Krebs2019]. For clarity, we choose to compare our proposed model performance to implementations of the current state of the art learning-based models like Voxelmorph, VTN, and VAE-like networks. We decompose these other models into 4 baselines (an Info VAE, a non-generative cascaded model with 1 affine and 1 dense block, a non-generative cascaded model with 1 affine and 2 dense blocks, and a generative cascaded model with 1 affine and 2 dense blocks). For atlas-based segmentation of head and neck patients, our methods outperform other state of the art registration methods, with key results summarized in Figure 3. Compared to the other learning-based algorithms we tested, our method achieves the best performance on this dataset for OAR dice score. Examining the results in Figure 3 reveals that the non-generative models tend to overfit to the training data, which we mitigate in our proposed network with the incorporation of gaussian smoothing and regularization terms. For our task, it appears that VAE-like networks tend to underfit the training data. Our initial comparisons used a VAE-like model like Krebs et al.[Krebs2019], but due to poor convergence we choose not to include those comparisons in our results. We instead develop an Info VAE network[DBLP:journals/corr/ZhaoSE17b] in order to mitigate underfitting, but even that does not fully resolve the underfitting issue.
Registration of head and neck patients often involves large deformations due to the complexity of different body geometry, position, rotation, and bending angle. Cases that require these large deformations can be better fit by breaking down the overall deformation into a cascade of smaller, more manageable ones. As our framework cascades deformations (i.e. Equation 3 and 4), it maintains a diffeomorphic property if each individual deformation in the cascade remains diffeomorphic[Ashburner2007], which we can enforce by assuming a stationary velocity field and integrating that velocity field using a scaling and squaring method (see section C in the Supplementary Materials). As depicted in Figure 3, incorporating more deformation blocks into the cascade allowed for improved training-time registration performance, which can lead to overfitting as is the case with Baselines 1 and 2. Using more deformation blocks in the cascade improves training efficiency, because it allows the network to perform a coarse-to-fine alignment with each alignment involving smaller displacements than if the network had only used one deformation block. Our proposed method uses a variational approach while leveraging multiple dense deformation blocks in a cascade. The variational regularization terms, along with a built-in gaussian smoothing of the velocity field, help to reduce overfitting for our proposed method. Moreover, our method utilizes an improved localization network composed of dense convolution blocks. Along with the semi-supervised pseudo-labelling of our training data, these improvements contribute to the improved performance of our proposed methods as compared to other state of the art learning-based registration algorithms.
As with any method intended for clinical use, it is natural to question performance under less than ideal situations. Other segmentation methods, such as supervised deep learning ones, may underperform in the presence of large image artifacts. In head and neck data, the presence of metal artifacts from dental implants, for instance, can obscure surrounding OARs and degrade the performance of segmentation methods applied to those images. Under similar conditions presented in Figure 4a-b, we can appreciate the robustness of our proposed methods to these image artifacts.
There are a few potential limitations to our methods. As our current study only uses the Brouwer atlas, performance is largely capped by the similarity between the Brouwer atlas and patient images. In edge cases where there are large differences between the atlas and patient, it may be better to first merge multiple atlases or retrain a model using a single, more representative atlas.
Our comparisons use ground truth OAR contours acquired from routine clinical workflow. While this does improve the relevance of our results to clinical practice, it also introduces biases that may not be as present had the ground truth contours come from multiple expert raters following a specific atlas. To further determine the usefulness of our proposed algorithm for clinical applications, we compare it to the current state of the art supervised learning segmentation algorithms and traditional multi-atlas-based auto segmentation algorithms (multi-ABAS). Though we cannot feasibly test all of these algorithms on our particular dataset, we would like to follow the precedent of previous works and present a rough comparison1. Table 1 demonstrates that the performance of our algorithm compares very favorably against traditional multi-ABAS algorithms and matches the performance of current state of the art supervised segmentation algorithms, making our algorithm highly relevant to the clinic.
Our results demonstrate the clinical applicability of atlas-based segmentation through semi-supervised diffeomorphic registration. We show that our algorithm exceeds the performance of other learning-based registration algorithms and traditional atlas-based auto segmentation algorithms while providing comparable performance to that of current state of the art supervised segmentation algorithms. This work presents the approach behind learning-based registration frameworks and can be further extended to other clinically relevant registration problems (i.e. multimodal registration, atlas-based dose prediction, etc.) or atlas-based segmentation of other regions of the body (i.e. lungs, prostate, etc.).
V-a Deep Learning Based Registration Methods
Conceptually, registration methods that use deep learning require methods for feature extraction and spatial transformation of images. Feature extractors are tasked with transforming high dimensional inputs into meaningful low-dimensional features, and since the inputs to the registration model are images (i.e. intensity matrices M and F), various CNN based architectures are typically used for feature extraction. These extracted features can then provide useful information on how best to warp the moving image to the fixed image. The mechanism typically used for warping images is some variant of a spatial transformer
Conceptually, registration methods that use deep learning require methods for feature extraction and spatial transformation of images. Feature extractors are tasked with transforming high dimensional inputs into meaningful low-dimensional features, and since the inputs to the registration model are images (i.e. intensity matrices M and F), various CNN based architectures are typically used for feature extraction. These extracted features can then provide useful information on how best to warp the moving image to the fixed image. The mechanism typically used for warping images is some variant of a spatial transformer[DBLP:journals/corr/JaderbergSZK15], though there is a mechanism for aligning images using CNNs to perform patch-wise matching that does not require a spatial transformer[Dalca2016]. These patch-wise methods, however, are computationally prohibitive.
Spatial warping in deep learning registration is typically accomplished through a Spatial Transformer layer[DBLP:journals/corr/JaderbergSZK15], which takes an image and some transformation parameters as inputs and generates a warped version of the input image. The spatial transformer layer in our proposed framework performs the following steps:
warp every voxel to a new off grid location such that where is a voxel-wise shift at voxel
compute a linear interpolation of the image at the new location
The first category of deep learning-based registration methods involves training a neural network to map a pair of input images to a ground truth deformation field. These supervised registration methods rely on ground truth deformations that are usually obtained from conventional registration methods. Due to the reliance on a ground truth deformation field, the utility of training these supervised models may be more limited[Krebs2017, Sokooti2017]. Based on the notion of learning deformation fields to perform registration tasks, several works have used neural networks to learn deformation fields in an unsupervised, end-to-end fashion[DBLP:journals/corr/abs-1809-05231, DBLP:journals/corr/abs-1903-03545, DBLP:journals/corr/abs-1902-05020, Krebs2019]. Instead of learning from ground truth deformation fields, these unsupervised approaches learn deformation fields that minimize a registration objective function. Similar to conventional registration methods, deep learning-based methods can constrain the deformation field to be diffeomorphic. Further, recent work by Dalca et al. uses a variational inference approach that tries to minimize the Kullback–Leibler (KL) divergence between their predicted deformation field distribution (posterior) and a gaussian deformation prior[DBLP:journals/corr/abs-1903-03545]. Other works may not use variational inference but still constrain the learned deformation to a diffeomorphism and regularize on the deformation field directly[Krebs2019].
V-B Generative Model
Under a generative model formulation, the network learns mean and of the velocity field distribution instead of the velocity field directly. The velocity field distribution is sampled from the predicted mean and using the reparameterization trick:
There are also attempts in literature to use a VAE-like approach where the latent variable represents a low dimensional embedding instead of the velocity field distribution[Krebs2019]. A key assumption made, as done in reference[DBLP:journals/corr/abs-1903-03545], is to model the prior stationary velocity field distribution as a multivariate gaussian:
where is the latent variable of voxel wise velocities that parametrize the warping function , is the multivariate gaussian with mean and covariance , and is the prior probability. In the case of the VAE-like approach,
is the prior probability. In the case of the VAE-like approach,
The assumption that the posterior distribution can be approximated with a multivariate gaussian typically applies after the moving image M has already been warped by an affine transform.
V-C Diffeomorphic Transforms
The diffeomorphism is specified by integrating the following ODE using the scaling and squaring method:
Diffeomorphisms are generated by initializing to the identity () and integrating over unit time to compute [Ashburner2007, DBLP:journals/corr/abs-1903-03545]. Following our generative model approach, we define the velocity field as the latent variable , but the diffeomorphism can be specified generally for a velocity field without taking a generative model approach. We will subsequently notate as being parameterized by the latent variable . The integration is then computed using the scaling and squaring method where a large number of small deformations is used to maintain accuracy. The scaling and squaring approach assumes that the number of time steps is a power of two and computes
We then can derive the recurrence as follows:
In practice, we set to be large so that each deformation is small. Equation 10 then ensures the mapping is diffeomorphic based on the intuition that the Jacobian of a deformation that conforms to an exponential is always positive.
given the observed images () and find the most probable estimate of the values for , which is known as the maximum a posteriori (MAP) estimate. If we approximate the likelihood as a multivariate gaussian,
A variational learning approach can then be followed, where we minimize the Evidence Lower Bound (ELBO) loss:
The equation above follows from a derivation in the recent Voxelmorph paper[DBLP:journals/corr/abs-1903-03545]. For our objective function, we define
which is further expanded on in the section below on the objective function. In the VAE approach[Krebs2019], the main difference is that the prior is assumed to be a unit gaussian, so the KL divergence term in the above equation is simpler to compute:
As we found vanilla VAE implementations to underfit the training data set, we decided to implement an Info VAE instead. In the Info VAE, we replace the KL divergence term with a maximum mean discrepancy loss where is any positive definite kernel[DBLP:journals/corr/ZhaoSE17b]:
V-D Objective Function (cont.)
For convenience, we reproduce Equation 5 (the objective function) below:
For all reconstruction losses (), we compute the mutual information as follows:
In practice, we compute the mutual information using 32 bins where the standard deviation is calculated as half the width of each bin.
We then define each component of the overall loss function below. The first component (
In practice, we compute the mutual information using 32 bins where the standard deviation is calculated as half the width of each bin. We then define each component of the overall loss function below. The first component () captures the similarity between the fixed image and the final moved image:
The second component () captures the similarity between the fixed image and the moving image warped using an affine transform:
The third component () captures the similarity between segmentations and . Recall that is the set of warped atlas segmentations (where we use 8 out of 36 OARs to generate the segmentation mask) and is the 8 OAR segmentation of either manually contoured or generated by deploying a supervised CNN on image —we describe this process in detail in the experimental setup. We define this component using the MSE as follows:
The fourth component represents the KL divergence between the approximate posterior and our assumed multivariate gaussian prior . The derivation for our overall loss function is shown in the Diffeomorphic Transforms section of our Supplemental Materials, and part of the derivation is reproduced here:
Finally, we note one last component () that is not a part of our proposed method but acts as a regularization term for non-generative models (which we use in our baseline comparisons). is a gradient loss that ensures smoothness of the displacement field :
where we approximate with forming the natural basis for a 3D image.
Finally, we summarize hyperparameter values used in our experiments in Table 2.