PAN: Projective Adversarial Network for Medical Image Segmentation

06/11/2019 ∙ by Naji Khosravan, et al. ∙ 7

Adversarial learning has been proven to be effective for capturing long-range and high-level label consistencies in semantic segmentation. Unique to medical imaging, capturing 3D semantics in an effective yet computationally efficient way remains an open problem. In this study, we address this computational burden by proposing a novel projective adversarial network, called PAN, which incorporates high-level 3D information through 2D projections. Furthermore, we introduce an attention module into our framework that helps for a selective integration of global information directly from our segmentor to our adversarial network. For the clinical application we chose pancreas segmentation from CT scans. Our proposed framework achieved state-of-the-art performance without adding to the complexity of the segmentor.



There are no comments yet.


page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Segmentation has been a major area of interest within the fields of computer vision and medical imaging for years. Owing to their success, deep learning based algorithms have become the standard choice for semantic segmentation in the literature. Most state-of-the-art studies model segmentation as a pixel-level classification problem [2, 3, 4]. Pixel-level loss is a promising direction but, it fails to incorporate global semantics and relations. To address this issue researchers have proposed a variety of strategies. A great deal of previous research uses a post-processing step to capture pairwise or higher level relations. Conditional Random Field (CRF) was used in [2]

as an offline post-processing step to modify edges of objects and remove false positives in CNN output. In other studies, to avoid offline post-processing and provide an end-to-end framework for segmentation, mean-field approximate inference for CRF with Gaussian pairwise potentials was modeled through Recurrent Neural Network (RNN) 


In parallel to post processing attempts, another branch of research tried to capture this global context through multi-scale or pyramid frameworks. In [2, 3, 4], several spatial pyramid pooling at different scales with both conventional convolution layers and Atrous convolution layers were used to keep both contextual and pixel-level information. Despite such efforts, combining local and global information in an optimal manner is not a solved problem, yet.

Following by the seminal work by Goodfellow in [7] a great deal of research has been done on adversarial learning [15, 8, 14, 10]. Specific to segmentation, for the first time, Luc et. al. [8] proposed the use of a discriminator along with a segmentor in an adversarial min-max game to capture long-range label consistencies. In another study SegAN was introduced, in which the segmentor plays the role of generator being in a min-max game with a discriminator with a multi-scale L1 loss [14]. A similar approach was taken for structure correction in chest X-rays segmentation in [5]. A conditional GAN approach was taken in [10] for brain tumor segmentation.

In this paper, we focused on the challenging problem of pancreas segmentation from CT images, although our framework is generic and can be applied to any 3D object segmentgation problem. This particular application has unique challenges due to the complex shape and orientation of pancreas, having low contrast with neighbouring tissues and relatively small and varying size. Pancreas segmentation were studied widely in the literature. Yu et al. introduced a recurrence saliency transformation network, which uses the information from previous iteration as a spatial weight for current iteration 

[16]. In another attempt, U-Net with an attention gate was proposed in [9]. Similarly, a two-cascaded-stage based method was used to localize and segment pancreas from CT scans in [13]. A prediction-segmentation mask was used in [18] for constraining the segmentation with a coarse-to-fine strategy. Furthermore, a segmentation network with RNN was proposed in [1] to capture the spatial information among slices. The unique challenges of pancreas segmentation (complex shape and small organ) shifted the literature towards methods with coarse-to-fine and multi-stage frameworks, promising but computationally expensive.

Summary of our contributions: The current literature on segmentation fails to capture 3D high-level shape and semantics with a low-computation and effective framework. In this paper, for the fist time in the literature, we propose a projective adversarial network (PAN) for segmentation to fill this research gap. Our method is able to capture 3D relations through 2D projections of objects, without relying on 3D images or adding to the complexity of the segmentor. Furthermore, we introduce an attention module to selectively integrate high-level, whole-image features from the segmentor into our adversarial network. With comprehensive evaluations, we showed that our proposed framework achieves the state-of-the-art performance on publicly available CT pancreas segmentation dataset [11] even when a simple encoder-decoder network was used as segmentor.

2 Method

Our proposed method is built upon the adversarial networks. The proposed framework’s overview is illustrated in Figure 1. We have three networks: a segmentor ( in Figure 1), which is our main network and was used during the test phase, and two adversarial networks ( and in Figure 1), each with a specific task. The first adversarial network () captures high-level spatial label contiguity while the second adversarial network () enforces the 3D semantics through a 2D projection learning strategy. The adversarial networks were used only during the training phase to boost the performance of the segmentor without adding to its complexity.

Figure 1: The proposed framework consists of a segmentor and two adversarial networks, and . was trained with a hybrid loss from , and the ground-truth.

2.1 Segmentor (S)

Our base network is a simple fully convolutoinal network with an encoder-decoder architecture. The input to the segmentor is a 2D grey-scale image and the output is a pixel-level probability map. The probability map shows probability of presence of the object at each pixel. We use a hybrid loss function (explained in details in Section 

2.3) to update the parameters our segmentor (). This loss function is composed of three terms enforcing: (1) pixel-level high-resolution details, (2) spatial and high-range label continuity, (3) 3D shape and semantics, through our novel projective learning strategy.

As can be seen in Figure 1, the segmentor contains conv layers in the encoder, conv layers in the decoder and conv layers as the bottleneck. The last conv layer is a conv layer with the channel output of

, combining channel-wise information in the highest scale. This layer is followed by a sigmoid function to create the notion of probability.

2.2 Adversarial Networks

Our adversarial networks are designed with the goal of compensating for the missing global relations and correcting higher-order inconsistencies, produced by a single pixel-level loss. Each of these networks produces an adversarial signal and apply it to the segmentor as a term in the overall loss function (Equation 2). The details of each network is described below:

Spatial semantics network (): This network is designed to capture spatial consistencies within each frame. The input to this network is either the segmented object by the ground-truth or by the segmentor’s prediction. The Spatial semantics network () is trained to discriminate between these two inputs with a binary cross-entropy loss, formulated as in Equation 4. The adversarial signal produced by the negative loss of to forces to produce predictions closer to ground-truth in terms of spatial semantics.

As illustrated in Figure 1 top right, has a two-branch architecture with a late fusion. The top branch processes the segmented objects by ground-truth or segmentor’s prediction. We propose an extra branch of processing, getting the bottleneck features corresponding to the original gray-scale input image, and passing them to an attention module for an information selection. The processed features are then concatenated with the first branch and passed through the shared layers. We believe that having the high-level features of whole image along with the segmentations improves the performance of .

Our attention module learns where to attend in the feature space to have a more discriminative information selection and processing. The details of the attention module are described in the following.

Figure 2: Attention module assigns a weight to each feature allowing for a soft selection of information.

Attention module (A): We feed the high-level features form the segmentor’s bottleneck to . These features contain global information about the whole frame. We use a soft-attention mechanism, in which our attention module assigns a weight to each feature based on its importance for discrimination. The attention module gets the features with shape , as input, and outputs a weight set with a shape of . is composed of two

convolution layers followed by a softmax layer (Figure 

2). The softmax layer introduces the notion of soft selection to this module. The output of is then multiplied to the features before being passed to the rest of the network.

Projective network (): Any 3D object can be projected into 2D planes from specific viewpoints, resulting in multiple 2D images. The 2D projection contains 3D semantics information, to be retrieved. In this section, we introduce our projective network (). The main task of is to capture 3D semantics without relying on 3D data and from the 2D projections. Inducing 3D shapes form 2D images has previously been done for 3D shape generation [6]. Unlike existing notions, however, in this paper we propose 3D semantics induction from 2D projections, to benefit segmentation for the first time in the literature.

The projection module () projects a 3D volume (V) on a 2D plane as:


where each pixel in the 2D projection gets a value in the range of based on the number of voxel occupancy in the third dimension of corresponding volume (). For the sake of simplicity, we refer to the projection of a 3D volume as . We pass each 3D image through our segmentor () slice by slice and stack the corresponding prediction maps. Then, these maps are fed to the projection module () and are projected in the axial view.

The input to is either the projected ground-truth or projected prediction map produced by . is trained to discriminate these inputs using the loss function defined in Equation 5. The adversarial term produced by in Equation 2 forces to create predictions which are closer to ground-truth in terms of 3D semantics. Incorporating as an adversarial network to our segmentation framework helps to capture 3D information through a very simple 2D architecture and without adding to its complexity in the test time.

2.3 Adversarial training

To train our framework, we use a hybrid loss function, which is a weighted sum of three terms. For a dataset of training samples of images and ground truths , we define our hybrid loss function as:


where and are the losses corresponding to and and is the segmentor’s prediction. The first term in Equation 2 is a weighted binary cross-entropy loss. This loss is the state-of-the-art loss function for semantic segmentation and for a grey-scale image with size is defined as:


where is the weight for positive samples, is the ground-truth label and is the network’s prediction. Equation 3 encourages to produce predictions similar to ground-truth and penalizes each pixel independently. High-order relations and semantics cannot be captured by this term.

To account for this drawback, the second and third terms are added to train our auxiliary networks. and are defined below, respectively:


Here is the projection module, is the binary cross-entropy loss with in Equation 3 corresponding to a single number ( or ) as the output.

3 Experiments and Results

We evaluated the efficacy of our proposed system with the challenging problem of pancreas segmentation. This particular problem was selected due to the complex and varying shape of pancreas and relatively more difficult nature of the segmentation problem compared to other abdominal organs. In our experiments we show that our proposed framework outperforms other state-of-the-art methods and captures the complex 3D semantics with a simple encoder-decoder. Furthermore, we have created an extensive comparison to some baselines, designed specifically to show the effects of each block of our framework.

Data and evaluation: We used the publicly available TCIA CT dataset from NIH [11]. This dataset contains a total of CT scans. The resolution of scans is , is the number of slices in the axial plane. The voxels spacing ranges from to . We used a randomly selected set of images for training and for testing to perform a 4-fold cross-validation. Dice Similarity Coefficient (DSC) is used as the metric of evaluation.

Comparison to baselines:

Model DSC%
Encoder-decoder (S) 57.7
Atrous pyramid 48.2


Table 1: Comparison with baselines.

To show the effect of each building block of our framework we designed an extensive set of experiments. In our experiments we start from only training a single segmentor (S) and go to our final proposed framework. Furthermore, we show comparison of encoder-decoder architecture with other state-of-the-art semantic segmentation architectures.

Table 1 shows the results adding of each building block of our framework. The eccoder-decoder architecture is the one showed in Figure 1 as , while the Atrous pyramid architecture is similar to the recent work of [4]. This architecture is currently state-of-the-art for semantic segmentation. In which an Atrous pyramid is used to capture global context. We added an Atrous pyramid with different scales: Atrous convolutions at rates of , with the global image pooling. We also replaced the decoder with simple upsampling and conv layers similar to the main paper [4]. We refer the readers to the main paper for more details about this architecture due to space limitations [4]. We found out having an extensive processing in the decoder improves the results compared to the Atrous pyramid architecture (possibly a better choice for segmentation of objects at multiple scales). This is because our object of interest is relatively small.

Moreover, we showed that adding a spatial adversarial notwork () can boost the performance of dramatically, in our task. Introducing attention () helps for a better information selection (as described in section 2.2) and boosts the performance further. Finally, our best results is achieved by adding the projective adversarial network (), which adds integration of 3D semantics into the framework. This supports our hypothesis that our segmentor has enough capacity in terms of parameters to capture all this information and with proper and explicit supervision can achieve state-of-the-art results.

Comparison to the state-of-the-art: We provide the comparison of our method’s performance with current state-of-the-art literature on the same TCIA CT dataset for pancreas segmentation. As can be seen from experimental validation, our method outperforms the state-of-the-art with dice scores, provides better efficiency (less computational burden). Of a note, the proposed algorithm’s least achievement is consistently higher than the state of the art methods.

Approach Average DSC% Max DSC% Min DSC%
Roth et al.[11] 86.29 23.99
Roth et al.[12] 88.65 34.11
Roth et al.[13] 88.96 50.69
Zhou et al.[18] 90.85 62.43
Cai et al.[1] 90.10 60.00
Yu et al.[16] 91.02 62.81

4-fold cross validation

Ours 88.71 83.20
Table 2: Comparison with state-of-the-art on TCIA dataset.

4 Conclusion

In this paper we proposed a novel adversarial framework for 3D object segmentation. We introduced a novel projective adversarial network, inferring 3D shape and semantics form 2D projections. The motivation behind our idea is that integration of 3D information through a fully 3D network, having all slices as input, is computationally infeasible. Possible workarounds are: 1)down-sampling the data or 2)sacrificing number of parameters, which are sacrificing information or computational capacity, respectively. We also introduced an attention module to selectively pass whole-frame high-level feature from the segmentor’s bottleneck to the adversarial network, for a better information processing. We showed that with proper and guided supervision through adversarial signals a simple encoder-decoder architecture, with enough parameters, achieves state-of-the-art performance on the challenging problem of pancreas segmentation. We achieved a dice score of 85.53%, which is state-of-the art performance on pancreas segmentation task, outperforming previous methods. Furthermore, we argue that our framework is general and can be applied to any 3D object segmentation problem and is not specific to a single application.