Organ segmentation has wide applications in disease diagnosis, treatment planning, intervention, radiation therapy and other clinical workflows [gibson2018automatic]
. Thanks to deep learning, remarkable progress has been achieved in organ segmentation tasks. However, existing deep models usually focus on a predefined set of organs given abundant annotations for training (e.g., heart, aorta, trachea, and esophagus in[trullo2019multiorgan]) but fail to generalize to unseen abdominal organ with only limited annotations (in the extreme case, only one annotation is available), which limits their clinical usage in practice.
A potential solution to this problem is the one-shot segmentation methodology [zhao2019data, wang2020lt], which attempts to learn knowledge from only one labeled sample. Nonetheless, these approaches lack the ability to handle large variations among different organ types and thus cannot be directly applied to one-shot organ segmentation. On the other hand, we find that human radiologists naturally maintain the ability to effectively learn unfamiliar organ concepts given limited annotated data, which we think can be attributed to their usage of of anatomical similarity to segment both seen and unseen organs. Meanwhile, it is also a reasonable setting to transfer the learned knowledge from richly annotated organs to less annotated ones. Inspired by these observations, we propose a new one-shot segmentation paradigm where we assume that a generalized organ concept can be learned from a set of sufficiently annotated organs and applied for effective one-shot learning to segment some previously unseen abdominal organs. We illustrate the proposed method in Fig. 1 where we call it one-shot reasoning as it imitates the reasoning process of human radiologists by making use of anatomical similarity.
Anatomical similarity have been widely used in medical image segmentation [dinsdale2019spatial, liang2019comparenet]. Compared to these methods, our work mainly exploits using anatomical similarity within each one-shot pair of images to perform reasoning. In this sense, our work provides a new segmentation paradigm by utilizing learned organ priors, with a focus on the one-shot scenario. In this paper, we propose OrganNet to implement the concept of generalized organ learning for one-shot organ segmentation in medical images. Our contributions can be summarized as follows:
We propose a new organ segmentation paradigm which learns a generalized organ concept from seen organ classes and then generalize to unseen classes using one-shot pairs as supervision.
A reasoning module is developed to exploit the anatomical correlation between adjacent spatial regions of anchor and target computerized tomography (CT) volumes, which can be utilized to enhance the representations of anchor volume and its segmentation annotation.
We introduce OrganNet, which includes two additional encoders to basic 3D U-Net architecture to jointly learn representations from target volume, anchor volume and its corresponding segmentation mask.
We conduct comprehensive experiments to evaluate OrganNet. The experimental results on both organ and non-organ segmentation tasks demonstrate the effectiveness of OrganNet.
2 Related Work
Utilizing the anatomical correlation is one of the key designs of proposed OrganNet. In this section, we first review existing works related to the utilization of anatomical priors and then list the most related works in one-shot medical segmentation.
2.0.1 Anatomical correlation in medical image segmentation.
A large body of literature [bentaieb2016topology, ravishankar2017joint, ravishankar2017learning] exploits anatomical correlation for medical image segmentation within the deep learning framework. The anatomical correlation also serves as the foundation of atlas-based segmentation [iglesias2015multi, liang2019comparenet]
, where one or several labeled reference images (i.e., atlases) are non-rigidly registered to a target image based on the anatomical similarity, and the labels of the atlases are propagated to the target image as the segmentation output. Different from these methods, OrganNet has the ability to learn anatomical similarity between images by employing the reasoning process. With this design, the proposed approach is able to learn a generalized organ concept for one-shot organ segmentation.
2.0.2 One-shot medical segmentation.
Zhao et al. [zhao2019data] presented an automated data augmentation method for synthesizing labeled medical images on resonance imaging (MRI) brain scans. However, DataAug is strictly restricted to segmenting objects when only small changes exist between anchor and target images. Based on DataAug, Dalca et al. [dalca2019unsupervised] further extended it to an alternative strategy that combines a conventional probabilistic atlas-based segmentation with deep learning. Roy et al. [roy2020squeeze] proposed a few-shot volumetric segmenter by optimally pairing a few slices of the support volume to all the slices of the query volume. Similar to DataAug, Wang et al. [wang2020lt] introduced cycle consistency to learn reversible voxel-wise correspondences for one-shot medical image segmentation. Lu et al. [lu2020learning] proposed a one-shot anatomy segmentor which is based on a naturally built-in human-in-the-loop mechanism. Different from these approaches, this paper focuses on a more realistic setting: we use richly annotated organs to assist the segmentation of less annotated ones. Also, our method is partially related to the siamese learning [zhou2020comparing, zhou2017sunrise, bertinetto2016fully, zhou2018semi] which often takes a pair of images as inputs.
3 Proposed Method
In this section, we introduce OrganNet, which is able to perform one-shot reasoning by exploiting anatomical correlation within input pairs. Particularly, we propose to explore such correlation from multiple scales where we use different sizes of neighbour regions as shown in Fig. 2. More details will be presented in the following.
Before we send anchor and target images (in practice, CT volumes) to OrganNet, it is suggested that registration should be conducted in order to align their image space. Since most of data come from abdomen, we apply DEEDS (DEnsE Displacement Sampling) [heinrich2013mrf] as it yielded the best performance in most CT-based datasets [xu2016evaluation]. Note that the organ mask is also aligned according to its corresponding anchor image.
3.1 One-shot Reasoning using OrganNet
As shown in Fig. 2, OrganNet has three encoders which learn representations for target volume, anchor volume and anchor mask, respectively. Moreover, we propose pyramid reasoning modules (PRMs) to connect different encoders which decouples OrganNet into two parts: one part for learning representations from anchor volume and its paired organ mask (top two branches) and the other for exploiting anatomical similarity between the anchor and the target volumes (bottom two branches). We argue that the motivation behind can be summarized as: the generalized organ concept can be learned from a set of sufficiently annotated organs, and then generalize to previously unseen abdominal organs by utilizing only a single one-shot pair.
The OrganNet is built upon the classic 3D U-Net structure [cciccek20163d] and we extend it to a tri-encoder version to include extra supervision from anchor image and its annotation. In practice, we ddesign OrganNet to be light-weight in order to alleviate the overfitting problem caused by small datasets in medical image analysis. Specifically, for each layer, we only employ two bottlenecks for both encoders and only one bottleneck for the decoder branch. Since all operations are in 3D, we use a relatively small number of channels for each convolutional kernel to reduce computational cost.
Imitation is a powerful cognitive mechanism that humans use to make inferences and learn new abstractions [gentner1997reasoning]. We propose to model the anatomical similarity between images to provide strong knowledge prior for learning one-shot segmentation. However, the variations in organ location, shape, size, and appearance among individuals can be regarded as an obstacle for a network to make reasonable decisions. To address these challenges, we propose PRMs to effectively encapsulate representations from multiple encoders.
3.2 Pyramid Reasoning Modules
These modules are designed to address the situation in which the organ morphologies and structures of target and anchor volumes show different levels of variations. Since features in different feature pyramids capture multi-scale information, we propose to aggregate information at each pyramid level with a reasoning function. To account for the displacement between the two images, large sizes of neighbour regions are employed in shallow layers, whereas small region sizes are employed in deep layers. The underlying reason of such allocation is that the receptive fields of shallow layers are smaller than those of deep ones. Concretely, we first compute the correlation matrix between feature maps of target and anchor volumes. Then, we apply this matrix to transform feature representations of anchor input and its segmentation mask, respectively. Finally, we concatenate representations of three inputs and treat them as the input to next layer.
’s input tensors for target volume, anchor volume and anchor mask, respectively. We applysoftmax to tensor’s first dimension where sum operation is also conducted after Hadamard product. In (b), we provide a simplified 2D illustration of the reasoning module.
As shown in Fig. 5, at layer , each reasoning module has three input tensors , and , corresponding to target volume, anchor volume and anchor mask, respectively. We first apply three convolutional operators to above three input tensors in order to normalize them to the same scale. The outputs of such operations are , and , and the size of each is .
To model the anatomical correlation between and
, we apply inner product operation to neighbour regions from both tensors. Particularly, given a specific vectorin tensor with , and , we compute the inner product based on its neighbour regions in which can be summarized as:
where . stands for the size of neighbour region which changes with layer depth. And stands for the total index which can be computed as:
. For better understanding,, we provide an illustration of the computation process of anatomical correlation in 2D network, which can be found in the supplementary material.
Now represents the anatomical similarity between and . Then, we apply softmax normalization along with the first dimension of and expand its dimension to , which can be summarized as:
Considering the efficiency of matrix multiplication, we introduce im2col operation to convert and to tensors sized . In this way, we can multiply them with using Hadamard product and apply summation to aggregate the contribution of adjacent filters. Formally, the computation process can be formalized as:
where denotes Hadamard product and . sum is applied to the first dimension. Finally, we concatenate three outputs to form next layer’s input :
Generally speaking, the reasoning module learns how to align the semantic representations of anchor and target volumes from seen organs. During the inference stage, the learned rule can be well applied to unseen classes using one-shot pair as supervision signals.
3.2.1 Training and Inference
During the training stage, we first build a pool of annotated images for each organ class. In each training iteration, we randomly pick anchor and target samples from the same class pool (thus it is a binary segmentation task). The anchor input is fed to the top two encoders together with its annotation. Meanwhile, the target image is passed to the bottom encoder after registration, and its annotation is used as the ground truth for training. In practice, we use image batches as inputs considering the training efficiency. Particularly, we manually make each batch to have different class annotations which help to produce better segmentation results in our experiments. We train OrganNet with a combination of Dice and cross entropy losses with equal weights. With the large training pool, the organ concept is learned under full supervision. In the inference phase, when a previously unseen abdominal organ needs to be segmented, only one anchor image and its annotation are needed. It is worth noting that we pick the most similar example to the anatomical average computed for each organ class as the anchor image during the inference stage following the instruction from [zhao2019data].
4 Experiments and Results
In this section, we conduct experiments together with ablation studies to demonstrate the strength of OrganNet. First, we briefly introduce the dataset and evaluation metric used for experiments. Then, we present the implementation details of OrganNet and display the experimental results under various settings.
4.1 Dataset and Evaluation Metric
We evaluate our method on 90 abdominal CT images collected from two publicly available datasets: 43 subjects from The Cancer Image Archive (TCIA) Pancreas CT dataset [clark2013cancer] and 47 subjects from the Beyond the Cranial Vault (BTCV) Abdomen dataset [landman2015miccai] with segmentations of 14 classes which include spleen, left/right kidneys, gallbladder, esophagus, liver, stomach, aorta, inferior vena cava, portal vein and splenic vein, pancreas, and left/right adrenal glands111https://zenodo.org/record/1169361.. In practice, we test the effectiveness of the OrganNet on 5 kinds of unseen abdominal organs (spleen, right kidney, aorta, pancreas and stomach), which present great challenges because of their variations in size and shape, and use the rest 9 organs for training. We employ the Dice coefficient as our evaluation metric.
4.2 Implementation Details
We build OrganNet based on 3D U-Net. To be specific, each encoder branch and the decoder (cf. Fig. 2) share the same architecture as those of 3D U-Net. The initial channel number of the encoder is 8 which is doubled after features maps are downsampled. Moreover, as mentioned in Sec. 3.1 and Fig. 2, we use different neighbour sizes in PRMs, which are , and from shallow layers to deep layers, respectively. Following [zhao2019data]
, for each organ class, we pick the most similar example to the anatomical average computed for each organ class from the test set, and treat it as the anchor image during the inference stage. Moreover, we repeat each experiments for three times and report their standard deviation. We conduct all experiments with the PyTorch framework[paszke2019pytorch]. We use the Adam optimizer [kingma2014adam]
and train our model for fifty epochs. The batch size is 8 (one per GPU) with 8 NVIDIA GTX 1080 Ti GPUs. The cosine annealing technique[loshchilov2016sgdr] is adopted to adjust the learning rate from to with a weight decay of . For organs used to train the OrganNet, we use 80% of data for training, and the remaining 20% are used for validation. For organs used to test the OrganNet, we randomly select 20% data for evaluation. The other 80% are used to train a fully-supervised 3D U-Net.
|Squeeze & Excitation||70.6||75.2||47.7||52.3||57.5||60.7|
|Squeeze & Excitation*||79.2||80.7||67.4||67.8||78.1||74.6|
4.3 Comparison with One-shot Segmentation Methods: Better Performance
We compare our OrganNet against DataAug [zhao2019data], Squeeze & Excitation [roy2020squeeze] and LT-Net [wang2020lt]. We do not compare with CTN [lu2020learning] as it is based on human intervention. As we have mentioned above, a prerequisite of using atlas-based method is that the differences between anchor and target inputs should be small enough to learn appropriate transformation matrix. To enable DataAug and LT-Net to segment all 5 test organs, we propose two settings. The first setting is to use the original implementations which are based on only one annotated sample for each class. The second setting is to pretrain DataAug and LT-Net using 9 seen organ classes and then retrain 5 independent models using a number of each unseen class (denoted as * in Table 1). In contrast, our OrganNet only needs one network to be trained once to segment all 5 organs. We report the results in Table 1, from which we observe that our OrganNet outperforms the naive DataAug, Squeeze & Excitation and LT-Net (without “*”) by a significant margin. Even after these three models are pretrained using the other 9 classes’ data, OrganNet is still able to surpass them by at least 3.8 percents. We believe the poor performance of DataAug and LT-Net may be explained by the fact that explicit transformation functions are difficult to learn in abdominal CT images, where large displacements usually happen. For Squeeze & Excitation, we believe it may achieve better performance after adding its human-in-the-loop mechanism.
4.4 Comparison with Supervised 3D U-Nets: Less Labeling Cost
Lastly, we compare our OrganNet with 3D U-Net [cciccek20163d] which follows a supervised training manner. For the sake of fairness, we first pretrain 3D U-Net on 9 seen organs which are used to train OrganNet, and then fine-tune the 3D U-Net for each unseen class using different amount of labeled samples. Results are displayed in Table 2.
One obvious observation is that using one training sample for unseen classes is far not enough for supervised baseline because the supervised training process may lead to severe overfitting problem. Even if we add more labeled samples to 30%, our OrganNet can still achieve competitive results compared with supervised baseline. It is worth noting that annotating CT volumes is an intensive work which may cost several days of several well-trained radiologists. Thus, our OrganNet can greatly reduce the annotation cost. Finally, we offer a fully-supervised model which utilizes all training samples of unseen abdominal organs.
4.5 Ablation Study on Network Design
4.5.1 Number of reasoning modules and different sizes of neighbour regions.
In Fig. 6, we display the experimental results of using different numbers of PRMs and sizes of neighbour regions. We can find that adding more reasoning modules can consistently improve the model results and different sizes of neighbour regions behave similarly. If we compare the results of using different sizes of neighbour regions, it is easy to find that the proposed pyramid strategies works the best (red curve), even surpassing using which considers more adjacent regions. Such comparison verifies our hypothesis that deep layers require smaller neighbour sizes because their receptive fields are much larger while large kernel sizes may import additional noise. Interestingly, we can find that increasing the number from 2 to 3 would bring obvious improvements, suggesting adding more PRMs may not benefit a lot.
4.5.2 Shared encoders or not?
Since the encoder of normal 3D U-Net is heavy, we study if it is possible to perform weight sharing across three different encoders. In Table 6, we report the experimental results of using different weight sharing strategies. It is obvious that sharing weights across all three encoders performs the worst. This phenomenon implies that different inputs may need different encoders. When making the bottom encoder (1) and the middle encoder (2) share weights, the average performance is slightly improve by approximate 0.5 percent. Somewhat surprisingly, building independent encoders (1+2+3) helps to improve the overall performance a lot, showing that learning specific features for each input is workable. Finally, when the top encoder shares the same weights with the middle encoder, our OrganNet is able to achieve the best average performance. Such results suggest that the top two encoders may complement each other.
4.6 Visual Analysis
In this part, we conduct visual analyses on proposed OrganNet. Firstly, in Fig. 7, we visualize the most related regions learned by OrganNet. We can find that, given a specific position in the target slice, OrganNet can automatically discover most related regions in the anchor image based on computed anatomical similarity scores. In practice, OrganNet incorporates segmentation labels from these regions and produces a final prediction for the target region. From Fig. 8, we can see that OrganNet is able to produce comparable results with 3D U-Net trained with 100% labeled data.
4.7 Generalization to Non-organ Segmentation
|Kidney Tumor Dice (%)|
To better demonstrate the effectiveness of OrganNet, we conduct experiments on LiTS [bilic2019liver] and KiTS [heller2019kits19], where we use LiTS to train OrganNet and test it on KiTS with only one labeled sample (following the one-shot segmentation scenario. From Table 3, we can see that OrganNet still maintains its advantages over 3D U-Net trained with 30% labeled data, demonstrating the generalization ability of OrganNet to non-organ segmentation problems.
5 Conclusions and Future Work
In this paper, we present a novel one-shot medical segmentation approach that enables us to learn a generalized organ concept by employing a reasoning process between anchor and target CT volumes. The proposed OrganNet models the key components, i.e., the anatomical similarity between images, within a single deep learning framework. Extensive experiments demonstrate the effectiveness of proposed OrganNet.