1 Introduction
Segmentation has been a major area of interest within the fields of computer vision and medical imaging for years. Owing to their success, deep learning based algorithms have become the standard choice for semantic segmentation in the literature. Most stateoftheart studies model segmentation as a pixellevel classification problem [2, 3, 4]. Pixellevel loss is a promising direction but, it fails to incorporate global semantics and relations. To address this issue researchers have proposed a variety of strategies. A great deal of previous research uses a postprocessing step to capture pairwise or higher level relations. Conditional Random Field (CRF) was used in [2]
as an offline postprocessing step to modify edges of objects and remove false positives in CNN output. In other studies, to avoid offline postprocessing and provide an endtoend framework for segmentation, meanfield approximate inference for CRF with Gaussian pairwise potentials was modeled through Recurrent Neural Network (RNN)
[17].In parallel to post processing attempts, another branch of research tried to capture this global context through multiscale or pyramid frameworks. In [2, 3, 4], several spatial pyramid pooling at different scales with both conventional convolution layers and Atrous convolution layers were used to keep both contextual and pixellevel information. Despite such efforts, combining local and global information in an optimal manner is not a solved problem, yet.
Following by the seminal work by Goodfellow et.al. in [7] a great deal of research has been done on adversarial learning [15, 8, 14, 10]. Specific to segmentation, for the first time, Luc et. al. [8] proposed the use of a discriminator along with a segmentor in an adversarial minmax game to capture longrange label consistencies. In another study SegAN was introduced, in which the segmentor plays the role of generator being in a minmax game with a discriminator with a multiscale L1 loss [14]. A similar approach was taken for structure correction in chest Xrays segmentation in [5]. A conditional GAN approach was taken in [10] for brain tumor segmentation.
In this paper, we focused on the challenging problem of pancreas segmentation from CT images, although our framework is generic and can be applied to any 3D object segmentgation problem. This particular application has unique challenges due to the complex shape and orientation of pancreas, having low contrast with neighbouring tissues and relatively small and varying size. Pancreas segmentation were studied widely in the literature. Yu et al. introduced a recurrence saliency transformation network, which uses the information from previous iteration as a spatial weight for current iteration
[16]. In another attempt, UNet with an attention gate was proposed in [9]. Similarly, a twocascadedstage based method was used to localize and segment pancreas from CT scans in [13]. A predictionsegmentation mask was used in [18] for constraining the segmentation with a coarsetofine strategy. Furthermore, a segmentation network with RNN was proposed in [1] to capture the spatial information among slices. The unique challenges of pancreas segmentation (complex shape and small organ) shifted the literature towards methods with coarsetofine and multistage frameworks, promising but computationally expensive.Summary of our contributions: The current literature on segmentation fails to capture 3D highlevel shape and semantics with a lowcomputation and effective framework. In this paper, for the fist time in the literature, we propose a projective adversarial network (PAN) for segmentation to fill this research gap. Our method is able to capture 3D relations through 2D projections of objects, without relying on 3D images or adding to the complexity of the segmentor. Furthermore, we introduce an attention module to selectively integrate highlevel, wholeimage features from the segmentor into our adversarial network. With comprehensive evaluations, we showed that our proposed framework achieves the stateoftheart performance on publicly available CT pancreas segmentation dataset [11] even when a simple encoderdecoder network was used as segmentor.
2 Method
Our proposed method is built upon the adversarial networks. The proposed framework’s overview is illustrated in Figure 1. We have three networks: a segmentor ( in Figure 1), which is our main network and was used during the test phase, and two adversarial networks ( and in Figure 1), each with a specific task. The first adversarial network () captures highlevel spatial label contiguity while the second adversarial network () enforces the 3D semantics through a 2D projection learning strategy. The adversarial networks were used only during the training phase to boost the performance of the segmentor without adding to its complexity.
2.1 Segmentor (S)
Our base network is a simple fully convolutoinal network with an encoderdecoder architecture. The input to the segmentor is a 2D greyscale image and the output is a pixellevel probability map. The probability map shows probability of presence of the object at each pixel. We use a hybrid loss function (explained in details in Section
2.3) to update the parameters our segmentor (). This loss function is composed of three terms enforcing: (1) pixellevel highresolution details, (2) spatial and highrange label continuity, (3) 3D shape and semantics, through our novel projective learning strategy.As can be seen in Figure 1, the segmentor contains conv layers in the encoder, conv layers in the decoder and conv layers as the bottleneck. The last conv layer is a conv layer with the channel output of
, combining channelwise information in the highest scale. This layer is followed by a sigmoid function to create the notion of probability.
2.2 Adversarial Networks
Our adversarial networks are designed with the goal of compensating for the missing global relations and correcting higherorder inconsistencies, produced by a single pixellevel loss. Each of these networks produces an adversarial signal and apply it to the segmentor as a term in the overall loss function (Equation 2). The details of each network is described below:
Spatial semantics network (): This network is designed to capture spatial consistencies within each frame. The input to this network is either the segmented object by the groundtruth or by the segmentor’s prediction. The Spatial semantics network () is trained to discriminate between these two inputs with a binary crossentropy loss, formulated as in Equation 4. The adversarial signal produced by the negative loss of to forces to produce predictions closer to groundtruth in terms of spatial semantics.
As illustrated in Figure 1 top right, has a twobranch architecture with a late fusion. The top branch processes the segmented objects by groundtruth or segmentor’s prediction. We propose an extra branch of processing, getting the bottleneck features corresponding to the original grayscale input image, and passing them to an attention module for an information selection. The processed features are then concatenated with the first branch and passed through the shared layers. We believe that having the highlevel features of whole image along with the segmentations improves the performance of .
Our attention module learns where to attend in the feature space to have a more discriminative information selection and processing. The details of the attention module are described in the following.
Attention module (A): We feed the highlevel features form the segmentor’s bottleneck to . These features contain global information about the whole frame. We use a softattention mechanism, in which our attention module assigns a weight to each feature based on its importance for discrimination. The attention module gets the features with shape , as input, and outputs a weight set with a shape of . is composed of two
convolution layers followed by a softmax layer (Figure
2). The softmax layer introduces the notion of soft selection to this module. The output of is then multiplied to the features before being passed to the rest of the network.Projective network (): Any 3D object can be projected into 2D planes from specific viewpoints, resulting in multiple 2D images. The 2D projection contains 3D semantics information, to be retrieved. In this section, we introduce our projective network (). The main task of is to capture 3D semantics without relying on 3D data and from the 2D projections. Inducing 3D shapes form 2D images has previously been done for 3D shape generation [6]. Unlike existing notions, however, in this paper we propose 3D semantics induction from 2D projections, to benefit segmentation for the first time in the literature.
The projection module () projects a 3D volume (V) on a 2D plane as:
(1) 
where each pixel in the 2D projection gets a value in the range of based on the number of voxel occupancy in the third dimension of corresponding volume (). For the sake of simplicity, we refer to the projection of a 3D volume as . We pass each 3D image through our segmentor () slice by slice and stack the corresponding prediction maps. Then, these maps are fed to the projection module () and are projected in the axial view.
The input to is either the projected groundtruth or projected prediction map produced by . is trained to discriminate these inputs using the loss function defined in Equation 5. The adversarial term produced by in Equation 2 forces to create predictions which are closer to groundtruth in terms of 3D semantics. Incorporating as an adversarial network to our segmentation framework helps to capture 3D information through a very simple 2D architecture and without adding to its complexity in the test time.
2.3 Adversarial training
To train our framework, we use a hybrid loss function, which is a weighted sum of three terms. For a dataset of training samples of images and ground truths , we define our hybrid loss function as:
(2) 
where and are the losses corresponding to and and is the segmentor’s prediction. The first term in Equation 2 is a weighted binary crossentropy loss. This loss is the stateoftheart loss function for semantic segmentation and for a greyscale image with size is defined as:
(3) 
where is the weight for positive samples, is the groundtruth label and is the network’s prediction. Equation 3 encourages to produce predictions similar to groundtruth and penalizes each pixel independently. Highorder relations and semantics cannot be captured by this term.
To account for this drawback, the second and third terms are added to train our auxiliary networks. and are defined below, respectively:
(4)  
(5) 
Here is the projection module, is the binary crossentropy loss with in Equation 3 corresponding to a single number ( or ) as the output.
3 Experiments and Results
We evaluated the efficacy of our proposed system with the challenging problem of pancreas segmentation. This particular problem was selected due to the complex and varying shape of pancreas and relatively more difficult nature of the segmentation problem compared to other abdominal organs. In our experiments we show that our proposed framework outperforms other stateoftheart methods and captures the complex 3D semantics with a simple encoderdecoder. Furthermore, we have created an extensive comparison to some baselines, designed specifically to show the effects of each block of our framework.
Data and evaluation: We used the publicly available TCIA CT dataset from NIH [11]. This dataset contains a total of CT scans. The resolution of scans is , is the number of slices in the axial plane. The voxels spacing ranges from to . We used a randomly selected set of images for training and for testing to perform a 4fold crossvalidation. Dice Similarity Coefficient (DSC) is used as the metric of evaluation.
Comparison to baselines:
Model  DSC%  
Encoderdecoder (S)  57.7  
Atrous pyramid  48.2  
85.0  
85.9  
1fold 
86.8 
To show the effect of each building block of our framework we designed an extensive set of experiments. In our experiments we start from only training a single segmentor (S) and go to our final proposed framework. Furthermore, we show comparison of encoderdecoder architecture with other stateoftheart semantic segmentation architectures.
Table 1 shows the results adding of each building block of our framework. The eccoderdecoder architecture is the one showed in Figure 1 as , while the Atrous pyramid architecture is similar to the recent work of [4]. This architecture is currently stateoftheart for semantic segmentation. In which an Atrous pyramid is used to capture global context. We added an Atrous pyramid with different scales: Atrous convolutions at rates of , with the global image pooling. We also replaced the decoder with simple upsampling and conv layers similar to the main paper [4]. We refer the readers to the main paper for more details about this architecture due to space limitations [4]. We found out having an extensive processing in the decoder improves the results compared to the Atrous pyramid architecture (possibly a better choice for segmentation of objects at multiple scales). This is because our object of interest is relatively small.
Moreover, we showed that adding a spatial adversarial notwork () can boost the performance of dramatically, in our task. Introducing attention () helps for a better information selection (as described in section 2.2) and boosts the performance further. Finally, our best results is achieved by adding the projective adversarial network (), which adds integration of 3D semantics into the framework. This supports our hypothesis that our segmentor has enough capacity in terms of parameters to capture all this information and with proper and explicit supervision can achieve stateoftheart results.
Comparison to the stateoftheart: We provide the comparison of our method’s performance with current stateoftheart literature on the same TCIA CT dataset for pancreas segmentation. As can be seen from experimental validation, our method outperforms the stateoftheart with dice scores, provides better efficiency (less computational burden). Of a note, the proposed algorithm’s least achievement is consistently higher than the state of the art methods.
Approach  Average DSC%  Max DSC%  Min DSC%  
Roth et al.[11]  86.29  23.99  
Roth et al.[12]  88.65  34.11  
Roth et al.[13]  88.96  50.69  
Zhou et al.[18]  90.85  62.43  
Cai et al.[1]  90.10  60.00  
Yu et al.[16]  91.02  62.81  
4fold cross validation 
Ours  88.71  83.20 
4 Conclusion
In this paper we proposed a novel adversarial framework for 3D object segmentation. We introduced a novel projective adversarial network, inferring 3D shape and semantics form 2D projections. The motivation behind our idea is that integration of 3D information through a fully 3D network, having all slices as input, is computationally infeasible. Possible workarounds are: 1)downsampling the data or 2)sacrificing number of parameters, which are sacrificing information or computational capacity, respectively. We also introduced an attention module to selectively pass wholeframe highlevel feature from the segmentor’s bottleneck to the adversarial network, for a better information processing. We showed that with proper and guided supervision through adversarial signals a simple encoderdecoder architecture, with enough parameters, achieves stateoftheart performance on the challenging problem of pancreas segmentation. We achieved a dice score of 85.53%, which is stateofthe art performance on pancreas segmentation task, outperforming previous methods. Furthermore, we argue that our framework is general and can be applied to any 3D object segmentation problem and is not specific to a single application.
References
 [1] Cai, J., Lu, L., Xie, Y., Xing, F., Yang, L.: Improving deep pancreas segmentation in ct and mri images via recurrent neural contextual learning and direct loss function. arXiv preprint arXiv:1707.04912 (2017)
 [2] Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40(4), 834–848 (2018)
 [3] Chen, L.C., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017)
 [4] Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoderdecoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 801–818 (2018)
 [5] Dai, W., Dong, N., Wang, Z., Liang, X., Zhang, H., Xing, E.P.: Scan: Structure correcting adversarial network for organ segmentation in chest xrays. In: Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, pp. 263–273. Springer (2018)
 [6] Gadelha, M., Maji, S., Wang, R.: 3d shape induction from 2d views of multiple objects. In: 2017 International Conference on 3D Vision (3DV). pp. 402–411. IEEE (2017)
 [7] Goodfellow, I., PougetAbadie, J., Mirza, M., Xu, B., WardeFarley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in neural information processing systems. pp. 2672–2680 (2014)
 [8] Luc, P., Couprie, C., Chintala, S., Verbeek, J.: Semantic segmentation using adversarial networks. arXiv preprint arXiv:1611.08408 (2016)
 [9] Oktay, O., Schlemper, J., Folgoc, L.L., Lee, M., Heinrich, M., Misawa, K., Mori, K., McDonagh, S., Hammerla, N.Y., Kainz, B., et al.: Attention unet: learning where to look for the pancreas. arXiv preprint arXiv:1804.03999 (2018)

[10]
Rezaei, M., Harmuth, K., Gierke, W., Kellermeier, T., Fischer, M., Yang, H., Meinel, C.: A conditional adversarial network for semantic segmentation of brain tumor. In: International MICCAI Brainlesion Workshop. pp. 241–252. Springer (2017)
 [11] Roth, H.R., Lu, L., Farag, A., Shin, H.C., Liu, J., Turkbey, E.B., Summers, R.M.: Deeporgan: Multilevel deep convolutional networks for automated pancreas segmentation. In: International conference on medical image computing and computerassisted intervention. pp. 556–564. Springer (2015)
 [12] Roth, H.R., Lu, L., Farag, A., Sohn, A., Summers, R.M.: Spatial aggregation of holisticallynested networks for automated pancreas segmentation. In: International Conference on Medical Image Computing and ComputerAssisted Intervention. pp. 451–459. Springer (2016)

[13]
Roth, H.R., Lu, L., Lay, N., Harrison, A.P., Farag, A., Sohn, A., Summers, R.M.: Spatial aggregation of holisticallynested convolutional neural networks for automated pancreas localization and segmentation. Medical image analysis 45, 94–107 (2018)
 [14] Xue, Y., Xu, T., Zhang, H., Long, L.R., Huang, X.: Segan: Adversarial network with multiscale l 1 loss for medical image segmentation. Neuroinformatics 16(34), 383–392 (2018)
 [15] Yi, X., Walia, E., Babyn, P.: Generative adversarial network in medical imaging: A review. arXiv preprint arXiv:1809.07294 (2018)

[16]
Yu, Q., Xie, L., Wang, Y., Zhou, Y., Fishman, E.K., Yuille, A.L.: Recurrent saliency transformation network: Incorporating multistage visual cues for small organ segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 8280–8289 (2018)
 [17] Zheng, S., Jayasumana, S., RomeraParedes, B., Vineet, V., Su, Z., Du, D., Huang, C., Torr, P.H.: Conditional random fields as recurrent neural networks. In: Proceedings of the IEEE international conference on computer vision. pp. 1529–1537 (2015)
 [18] Zhou, Y., Xie, L., Shen, W., Wang, Y., Fishman, E.K., Yuille, A.L.: A fixedpoint model for pancreas segmentation in abdominal ct scans. In: International Conference on Medical Image Computing and ComputerAssisted Intervention. pp. 693–701. Springer (2017)
Comments
There are no comments yet.