A great deal of recent work in image segmentation has been base on deep neural networks, which require a large amount of accurately annotated training data. However, generating annotated data is a time-consuming process, which could take several months to a year to complete [Segars2013]. By contrast, conventional methods, such as clustering [Chen2019] and level-set [Chan2001] based techniques, solely rely on the statistics of intensities in a given image. Although they do not require any training data, the intensity statistics have to be remodeled for every input image, resulting in an significant computation burden. In this work, we take the advantage of the best of both methods and propose a learning-based method that can be trained in a semi-supervised or unsupervised manner. In the unsupervised framework, the prposed ConvNet model minimizes an ACWE [Chan2001] based energy function, which solely depends on the intensity statistics of the given image. The parameters of the CovNet are optimized using a training set of images. The network learns a universal representation that enables the segmentation of an unseen image based on its intensity statistics. In the presence of ground truth data, we leverage the available segmentation labels during network training to incorporate structural information. We evaluate the proposed method on the task of segmentation of bone structures in 2D slices of simulated SPECT. However, the proposed model can be easily extended to 3D, and it can also be readily applied to other imaging modalities.
shows an overview of the proposed method. The ConvNet takes a 2D transaxial slice of a 3D SPECT image as the input, and it outputs a mask. In an unsupervised setting, only the ACWE loss between the predicted mask and the input image is backpropagated to update the parameters. In the situation where the ground truth label exists, the labeling loss can also help to find the optimal parameters of the ConvNets.
ConvNet Architecture The network is built on the basis of the Recurrent convolutional neural network (RCNN) proposed by Liang et al. [Liang2015]. It is a small network that consists of three recurrent and two regular convolutional layers. For each recurrent layer, time-steps where used, resulting in a feed-forward subnetwork with a depth of [Liang2015]He2015]
. Notice that we continue to use PReLU in the final layer. Instead of assigning probabilities to pixel locations using Softmax, we classify any location with a value larger than 0 as foreground, and vice versa. Based on our empirical experiments, we found this was better than using Softmax, especially in the final layer, in the case of binary-class segmentation. Finally, input and output to the network share the same dimension of.
Loss Function We propose to use the famous Chan and Vese model [Chan2001] as an unsupervised loss for ConvNet-based image segmentation. This model relies solely on the intensity statistics, which are independent of ground truth labels. It is expressed as:
where is the index of pixel locations, and are referred to as the averages of image inside and outside the contour, , and , , , and are weighting parameters for each term. Since is in some sense comparable to [Chan2001], and for simplicity, we choose to be 0. Both and are chosen to be 1. The contour/segmentation is modeled by a ConvNet, , with parameters and the input image g, i.e., . The loss function can be written as:
In the case where segmentation labels are avaiable for training, an optional supervised loss can be incorporated to further improve the accuracy, as shown in Fig. LABEL:fig:arch. The loss has a similar form as , but it is evaluated between segmentation labels [ChenACWE2019]:
where is the image domain, and u represents the ground truth labels. A weighting parameter, , was used to combine the two losses, i.e., . We set , and in the to be based on the empirical experiments.
3 Results & Conclusions
We tested the proposed model on the task of bone segmentation using highly realistic simulated SPECT images. The object was generated based on the realistic XCAT phantom [Segars2010]. We used an analytic projection algorithm that realistically models attenuation, scatter, and collimator-detector response [Frey1993, Kadrmas1996]
. Tomographic image reconstruction was done using the ordered subsets expectation-maximization algorithm (OS-EM) with 2 and 5 iterations. A total of 140 3D volumes were simulated from 50 noise realizations. We sampled 8000 2D slices from those volumes for training the ConvNet and 4000 slices for testing and evaluation. We evaluated four settings of the proposed algorithm:
: Unsupervised (self-supervised) training with .
: + fine-tuning using with 10 ground truth (GT) labels.
: + fine-tuning using with 80 GT labels.
: Training with .
Although segmenting bone from SPECT images is challenging, the proposed algorithm performed well. Qualitative results and Dice coefficient (DSC) evaluations are shown in Table 1 and Fig. 2. As visible from the results, fine-tuning the pre-trained unsupervised model with only 80 GT labels leads to a significant improvement in performance. In conclusion, we present an unsupervised/semi-supervised ConvNet-based model for image segmentation that can be trained with or without ground truth labels. The resulting DSC values reported demonstrate the effectiveness of the proposed method. This work was supported by a grant from the National Cancer Institute, U01-CA140204. The views expressed in written conference materials or publications and by speakers and moderators do not necessarily reflect the official policies of the NIH; nor does mention by trade names, commercial practices, or organizations imply endorsement by the U.S. Government.