1 Introduction
This work focuses on multiorgan segmentation in abdominal regions which contain multiple organs such as liver, pancreas and kidneys. The segmentation of internal structures on medical images, , CT scans, is an essential prerequisite for many clinical applications such as computeraided diagnosis, computeraided intervention and radiation therapy. Compared with other internal structures such as heart or brain, abdominal organs are much more difficult to segment due to the morphological and structural complexity, low contrast of soft tissues,
With the development of deep convolutional neural networks (CNNs), many medical image segmentation problems have achieved satisfactory results only when fullsupervision is available
[33, 32, 45, 41, 30, 4]. Despite the recent progress, the annotation of medical radiology images is extremely expensive, as it must be handled by experienced radiologists and carefully checked by additional experts. This results in the lack of highquality labeled training data. More critically, how to efficiently incorporate domainspecific expertise (, anatomical priors) with segmentation models [10, 25], such as organ shape, size, remains an open issue.Our key observation is that, in medical image analysis domain, instead of scribbles [17, 36, 37] , points [3] and imagelevel tags [26, 27, 40], there exists a considerable number of datasets in the form of abdominal CT scans [31, 33, 34]. To meet different research goals or practical usages, these datasets are annotated to target different organs (a subset of abdominal organs), , pancreas datasets [31] only have the pancreas labeled while leaving the rest marked as background.
The aim of this work is to fully leverage these existing partiallyannotated datasets to assist multiorgan segmentation, which we refer as partial supervision. To address the challenge of partial supervision, an intuitive solution is to simply train the segmentation model directly on both the labeled data and the partiallylabeled data in the semisupervised manner [29, 2, 26]. However, it 1) fails to take advantages of the fact that medical images are naturally more constrained compared with natural images [24]; 2) is intuitively misleading as it treats the unlabeled pixels/voxels as background. To overcome these issues, we propose Prioraware Neural Network (PaNN) to handle such background ambiguity via incorporating prior knowledge on organ size distributions. We achieve this via a prioraware loss, which acts as an auxiliary and soft constraint to regularize that the average output size distributions of different organs should approximate their prior proportions. The overall pipeline is illustrated in Fig. 2. Based on the anatomical similarities (Fig. 1) across different patient scans [10, 25, 15]
, the prior proportions are estimated by statistics from the fullylabeled data. It is important to note that the training objective is hard to be directly optimized using stochastic gradient descent. To address this issue, we propose to formulate our objective in a minmax form, which can be well optimized via the stochastic primaldual gradient algorithm
[20]. To summarize, our contributions are threefold:1) We propose Prioraware Neural Network, which incorporates domainspecific knowledge from medical images, to facilitate multiorgan segmentation via using partiallyannotated datasets.
2) As the training objective is difficult to be directly optimized using stochastic gradient descent, it is essential to reformulate it in a minmax form and optimize via stochastic primaldual gradient [20].
3) PaNN significantly outperforms previous stateofthearts even using fewer annotations. It achieves on the MICCAI2015 challenge “MultiAtlas Labeling Beyond the Cranial Vault” in the free competition for organ segmentation in the abdomen.
2 Related Work
Currently, the most successful deep learning techniques for semantic segmentation stem from a common forerunner, , Fully Convolutional Network (FCN)
[21]. Based on FCN, many recent advanced techniques have been proposed, such as DeepLab [5, 6, 7], SegNet [1], PSPNet [43], RefineNet [18], Most of these methods are based on supervised learning, hence requiring a sufficient number of labeled training data to train. To cope with scenarios where supervision is limited, researchers begin to investigate the weaklysupervised setting
[26, 27, 9], , only boundingboxes or imagelevel labels are available, and the semisupervised setting [26, 35], , unlabeled data are used to enlarge the training set. Papandreou proposed EMAdapt [26] where the pseudolabels of the unknown pixels are estimated in the expectation step and standard SGD is performed in the maximization step. Souly [35] demonstrated the usefulness of generative adversarial networks for semisupervised segmentation.In the medical imaging domain, it becomes more intractable to acquire sufficient labeled data due to the difficulty of annotation, as the annotation has to be done by experts. Although fullysupervised methods (, UNet [30], VoxResNet [4], DeepMedic [14], 3DDSN [11], HNN [32]) have achieved remarkable performance improvement in tasks such as brain MR segmentation, abdominal singleorgan segmentation and multiorgan segmentation, semi or weakly supervised learning is still a far more realistic solution. For example, Bai [2] proposed an EMbased iterative method, where a CNN is alternately trained on labeled and postprocessed unlabeled sets. In [42], supervised and unsupervised adversarial costs are involved to address semisupervised gland segmentation. DeepCut [29] shows that weak annotations such as boundingboxes in medical image segmentation can also be utilized by performing an iterative optimization scheme like [26].
However, these methods fail to capture the anatomical priors [19]. Inclusion of priors in medical imaging could potentially have much more impact compared with their usage in natural images since anatomical objects in medical images are naturally more constrained in terms of shape, location, size, . Some recent works [10, 25] demonstrate that these priors can be learnt by a generative model. But these methods will induce heavy computational overhead. Kervadec [15] proposed that directly imposing inequality constraints on sizes is also an effective way of incorporating anatomical priors. Unlike these methods, we propose to learn from partial annotations by embedding the abdominal region statistics in the training objective, which requires no additional training budget.
3 Prioraware Neural Network
Our work aims to address the multiorgan segmentation problem with the help of multiple existing partiallylabeled datasets. Given a CT scan where each element indicates the Housefield Unit (HU) of a voxel, the goal is to find the predicted labelmap of each pixel/voxel. The target structures are restricted to be organs which do not overlap with each other.
3.1 Partial Supervision
We consider a new supervision paradigm, , partial supervision, for multiorgan segmentation. This is motivated by the fact that there exists a considerable number of datasets with only one or a few organs labeled in the form of abdominal CT scans [31, 33, 34] in medical image analysis, which can serve as partial supervision for multiorgan segmentation (see the list in the Appendix). Based on domain knowledge, our approach assumes the following characteristics of the datasets which are common in medical image analysis. First, the scanning protocols of medical images are well standardized, , brain, head and neck, chest, abdomen, and pelvic in CT scans, which means that the internal structures are consistent in a limited range according to the scanning protocol (see Fig. 1). Second, internal organs have anatomical and spatial relationship such as gastrointestinal track, , stomach, duodenum, small intestine, and colon are connected in a fixed order.
The partiallysupervised setting can be formally defined as below. Given a fullylabeled dataset with the annotation known and T partiallylabeled datasets } with the th dataset defined as . and denote the image indices for and , respectively. For each element , denotes the annotation of the th pixel in the th image and is selected from , where denotes the abdominal organ space, , . For the th partiallylabeled dataset , is selected from . In 2Dbased segmentation models, the th input is a sliced 2D image from either Axial, Coronal or Saggital view of the whole CT scan [45, 32, 44, 39]. In 3Dbased segmentation models, is a cropped 3D patch from the whole CT volume [8, 22]. Note that semisupervision and fullysupervision are two extreme cases of partial supervision, when the set of partial labels is an empty set () and is equal to the complete set (), respectively.
An naive solution is to simply train a segmentation network from both the fullylabeled data and the partiallylabeled data and alternately update the network parameters and the segmentations (pseudolabels) for the partiallylabeled data [44, 2]. While these EMlike approaches have achieved significant improvement compared with fullysupervised methods, they require highquality pseudolabels and fail to explicitly incorporate anatomical priors on shape or size.
To address this issue, we propose a Prioraware Neural Network (PaNN), aiming at explicitly embedding anatomical priors without incurring any additional budget. More specifically, the anatomical priors are enforced by introducing an additional penalty which acts as a soft constraint to regularize that the average output distributions of organ sizes should mimic their empirical proportions. This prior is obtained by calculating the organ size statistics of the fullylabeled dataset. An overview of the overall framework is shown in Fig. 2, and the detailed training procedures will be introduced in the following sections.
3.2 Prioraware Loss
Consider a segmentation network parameterized by
, which outputs probabilities
. Let be the label distribution in the fullylabeled dataset, with describing the proportion of the th label (organ). Then, we estimate the average predicted distribution of the pixels in the partiallylabeled datasets as(1) 
where
denotes the probability vector of the
th pixel in the th input slice , and is the total number of pixels/voxels. Recall that is the total number of partiallylabeled datasets.To embed the prior knowledge, the prioraware loss is defined as
(2) 
which measures the matching probability of the two distributions and
via KullbackLeibler divergence. Therein,
is the cross entropy between and , and is the information entropy of (). As emphasized above, the rationale of Eq. (2) is that the output distributions of different organ sizes should approximate their empirical proportions , which generally reflects the domainspecific knowledge.Note that is a global estimation of label distribution of the fullylabeled training data, which remains unchanged. Consequently, is constant which can be omitted during the network training. Nevertheless, we observe that it is still problematic to directly apply stochastic gradient descent, as we will detail in Sec. 3.3.
Specifically in our case, our final training objective is
(3) 
where and are the cross entropy loss on the fullylabeled data and the partiallylabeled data, respectively. And denotes the computed pseudolabels as well as existing partial labels from the partiallylabeled dataset(s). Note that the prioraware loss is used as a soft global constraint to stablize the training process. Concretely, is defined as
(4) 
where denotes the softmax probability of the th pixel in the th image to the th category. is given by
(5)  
where the first term corresponds to the pixels with their labels given, , . The second term corresponds to unlabeled background pixels, and needs to be estimated during the model training as a kind of pseudosupervision, , .
3.3 Derivation
By substituting Eq. (1) into Eq. (2) and expanding into scalars, we rewrite Eq. (2) as
(6)  
From Eq. (2) and Eq. (6) we can see that the average distribution
of organ sizes is inside the logarithmic loss, which is very different from standard machine learning loss such as Eq. (
4) and Eq. (5) where the average is outside logarithmic loss. And directly minimizing by stochastic gradient descent is very difficult as the true gradient induced by Eq. (2) is not a summation of independent terms, the stochastic gradients would be intrinsically biased [20].To remedy this, we propose to optimize the KL divergence term using stochastic primaldual gradient [20]. Our goal here is to transform the prioraware loss into an equivalent minmax problem by taking the sample average out of the logarithmic loss. We introduce two auxiliary variables to assist the optimization, , the primal variable and the dual variable . First, the following identity holds
(7) 
due to the property of the log function. Based on Eq. (7), we define as the dual variable associated to the primal variable , and define as the dual variable associated to the primal variable . Then, we have
(8) 
where (or ) denotes the th element of (or ). Substituting them into Eq. (2)/Eq. (6), maximizing the KL divergence is equivalent to the following minmax optimization problem:
(9) 
which brings the sample average out of the logarithmic loss. Note that we ignore the constant in the above formulas.
3.4 Model Training
We consider training a fully convolutional network [21, 6, 30] for multiorgan segmentation, where the input images are either 2D slices [39, 32, 45] or 3D cropped patches [8, 22]. The training procedure can be divided into two stages.
In the first stage, we only train on the fullylabeled dataset by optimizing Eq. (4) via stochastic gradient descent (also means and in Eq. (3)). The goal of this stage is to find a proper initialization for the network weights, which can stabilize the training procedure in the second stage.
In the second stage, we train the model on the union of the fullylabeled dataset and partiallylabeled dataset(s) via Eq. (3). As can be drawn, we have two groups of variables, , the network weights and the three auxiliary variables . We adopt an alternating optimization, which can be decomposed into two subproblems:
Fixing , Updating . With the network weights given, we can first estimate the pesudolabels of background pixels in the partiallylabeled dataset(s) . Meanwhile, the optimization of and is a maximization problem. Hence, we do stochastic gradient ascent to learn and . As for the initialization, we set to and set to , respectively.
Fixing , Updating . By fixing the three auxiliary variables, we can then update the network weights via the standard stochastic gradient descent.
As can be seen, our algorithm is formulated as a minmax optimization. We summarize the detailed procedure of optimization in Algorithm 1.
4 Experiments
4.1 Experiment Setup
Datasets and Evaluation Metric.
We use the training set released in the MICCAI 2015 MultiAtlas Abdomen Labeling Challenge as the fullylabeled dataset , which contains 30 abdominal CT scans with axial contrastenhanced abdominal clinical CT images in total. For each case, anatomical structures are annotated, including spleen, right kidney, left kidney, gallbladder, esophagus, liver, stomach, aorta, inferior vena cava (IVC), portal vein & splenic vein, pancreas, left adrenal gland, right adrenal gland. Each CT volume consists of slices of pixels, with a voxel spatial resolution of .As for the partiallylabeled dataset(s) , we use a spleen segmentation dataset^{1}^{1}1Available at http://medicaldecathlon.com (referred as A), a pancreas segmentation dataset^{2}^{2}2Available at https://wiki.cancerimagingarchive.net/display/Public/PancreasCT (referred as B) and a liver segmentation dataset (referred as C). To make these partiallylabeled datasets balanced, 40 cases are evenly selected from each dataset to constitute the partial supervision.
Following the standard crossvalidation evaluation [33, 32, 23, 45, 39], we randomly partition the fullylabeled dataset into complementary folds, each of which contains cases, then apply the standard 5fold crossvalidation. For each fold, we use folds (, cases) as full supervision and test on the remaining fold.
The evaluation metric we use is the DiceSørensen Coefficient (DSC), which measures the similarity between the prediction voxel set and the groundtruth set . Its mathematical definition is . We report an average DSC of all the testing cases over the labeled anatomical structures for performance evaluation.
Implementation details. Similar to [45, 32, 33, 39], we use the soft tissue CT window range of HU. The intensities of each slice are then rescaled to . Random rotation of is used as an online data augmentation. Our implementations are based on the current stateoftheart 2D^{3}^{3}3https://github.com/tensorflow/models/tree/master/research/deeplab [7, 6] and 3D models^{4}^{4}4https://github.com/DLTK/DLTK [30, 28]. We provide an extensive study about how partiallylabeled datasets facilitate multiorgan segmentation task and list thorough comparisons under different settings.
As described in Sec. 3.4, the whole training procedure is divided in two stages. The first stage is the same as fullysupervised training, , we train exclusively on the fullylabeled dataset for a certain number of iterations M.
In the second stage, we switch to the minmax optimization on the union of the fullylabeled dataset and partiallylabeled datasets for M iterations. In each minibatch, the sampling rate of labeled data and partiallylabeled data is . It has been suggested [2] that it is less necessary to update the pseudolabel per iteration. Hence, is updated every K iterations in practice. In addition, the hyperparameters and are set to be and , respectively. The same decay policy of learning rate is utilized as that used in the first stage. In the second stage, the initial learning rate for the minimization step and the maximization step are set as and , respectively.
For 2D implementations, the initial learning rate of the first stage is and a poly learning rate policy is employed. M and M are set as K and K, respectively. Following [33, 7, 14], we apply multiscale inputs (scale factors are ) in both training and testing phase. For 3D implementations, the initial learning rate of the first stage is and a fixed learning rate policy is employed. M and M are set as K and K, respectively.
4.2 Experimental Comparison
Model  Supervision  Partiallylabeled  Average Dice  
dataset  
A  B  C  
ResNet50 [12]  Full  0.7535  
Semi [2]  ✓  0.7593  
✓  0.7632  
✓  0.7596  
✓  ✓  ✓  0.7669  
Partial (ours)  ✓  0.7650  
✓  0.7662  
✓  0.7631  
✓  ✓  ✓  0.7705  
PaNN (ours)  ✓  0.7716  
✓  0.7712  
✓  0.7705  
✓  ✓  ✓  0.7833  
ResNet101 [12]  Full  0.7614  
Semi [2]  ✓  0.7637  
✓  0.7649  
✓  0.7647  
✓  ✓  ✓  0.7719  
Partial (ours)  ✓  0.7714  
✓  0.7695  
✓  0.7684  
✓  ✓  ✓  0.7735  
PaNN (ours)  ✓  0.7770  
✓  0.7819  
✓  0.7748  
✓  ✓  ✓  0.7904  
3DUNet [8]  3DUNetfullysup  0.7066  
Semi [2]  ✓  ✓  ✓  0.7193  
Partial (ours)  ✓  ✓  ✓  0.7163  
PaNN (ours)  ✓  ✓  ✓  0.7208 
We compare the proposed PaNN with a series of stateoftheart algorithms, including 1) the fullysupervised approach (denoted as “fullysup”), where we train exclusively only on the fullylabeled dataset , 2) the semisupervised approach (denoted as “semisup”), where we train the network on both the fullylabeled dataset and the partiallylabeled dataset(s) while treating as unlabeled following the representative method [2], and 3) the naive partiallysupervised approach (denoted as “partialsup”), where we also train the network on both and while treating the partial labels as they are. Different from PaNN, we set in Eq. (3) to verify the efficacy of the prioraware loss.
Benefit of Partial Supervision. As shown from Table 1, among three kinds of supervisions, partial supervision obtains the best performance followed by the semisupervision and full supervision. It is no surprise to observe such a phenomenon for two reasons. First, compared with full supervision, semisupervision has more training data, though part of them is not annotated. Second, compared with semisupervision, partial supervision involves more annotated pixels in the organ of interest.
Effect of PaNN. From Table 1, PaNN generally achieves better performance than the naive partiallysupervised methods, which demonstrates the effectiveness of our proposed PaNN. For example, when setting the partial dataset as the union of A, B and C, PaNN achieves the best result either using 2D models or 3D models. 2D models generally observe a better performance in each setting compared with 3D models. This is probably due to the fact that current 3D models only act on local patches (, ), which results in lacking of holistic information [38]. A detailed discussion of 2D and 3D models is listed in [16]. More specifially, PaNN outperforms the naive partiallysupervised method by with ResNet50 and by with ResNet101 as the backbone model, respectively. Additionally, we also observe a convincing performance gain of using 3D UNet [8, 30] as the backbone model.
Name  Spleen  Kidney(R)  Kidney(L)  Gallbladder  Esophagus  Liver  Aorta  IVC  Average  Mean Surface  Hausdorff 
Dice  Distance  Distance  
AutoContext3DFCN [33]  0.926  0.866  0.897  0.629  0.727  0.948  0.852  0.791  0.782  1.936  26.095 
deedsJointCL [13]  0.920  0.894  0.915  0.604  0.692  0.948  0.857  0.828  0.790  2.262  25.504 
dltk0.1_unet_sub2 [28]  0.939  0.895  0.915  0.711  0.743  0.962  0.891  0.826  0.815  1.861  62.872 
results_13organs_p0.7  0.890  0.898  0.883  0.685  0.754  0.936  0.870  0.819  0.817  4.559  38.661 
PaNN* (ours)  0.961  0.901  0.943  0.704  0.783  0.972  0.913  0.835  0.832  1.641  25.176 
PaNN (ours)  0.968  0.920  0.953  0.729  0.790  0.974  0.925  0.847  0.850  1.450  18.468 
Meanwhile, by increasing the number of partiallylabeled datasets (from using only A, B or C to the union of three), the performance improvements of different methods are also different. For example, with the ResNet101 as the backbone, the largest improvement obtained under semisupervision is (from to ), and that of partial supervision is (from to ). By contrast, PaNN obtains a much more remarkable improvement of (from to ). Such an observation suggests that PaNN is capable of handling more partiallylabeled training data and is less susceptible to the background ambiguity.
Organbyorgan Analysis. To reveal the detailed effect of PaNN, we present an organbyorgan analysis in Fig. 3. We use ResNet50 as the backbone model (ResNet101 has a similar trend) and the partiallylabeled dataset C (indicates that liver is the target organ).
In Fig. 3, we observe clear statistical improvements over the fullysupervised method for almost every organ (pvalues hold true for 11/13 of all abdominal organs). Great improvements are also observed for those difficult organs, , organs either in small sizes or with complex geometric characteristics such as gallbladder (from to ), esophagus (from to ), stomach (from to ), IVC (from to ), portal vein & splenic vein (from to ), pancreas (from to ), right adrenal gland (from to ) and left adrenal gland (from to ). This promising result indicates that our method distills a reasonable amount of knowledge from additional partiallylabeled data and the regularization loss can help facilitate the network to enhance the discriminative information to a certain degree.
Meanwhile, we also observe a distinct performance improvement for organs other than the partiallylabeled structures (, the liver). For instance, the performance of gallbladder, stomach, IVC, pancreas are boosted from , , , to , , , , respectively. This suggests that the superiority of PaNN not only originates from more training data, but also from the fact that PaNN can effectively incorporate anatomical priors on organ sizes in abdominal regions, which is helpful for multiorgan segmentation.
Qualitative Evaluation. We also show a set of qualitative examples, , slices from cases, in Fig. 4, where we zoom in to visualize the finer details of the improved region.
In these samples, we observe that PaNN is the only method that successfully detects the pancreatic tail in Fig. 4(a). In Fig. 4(b), all other methods fail to detect the portal vein and splenic vein while PaNN demonstrates an almost perfect detection of these veins. For Fig. 4(c) to Fig. 4(e), apart from the evident improvements of the pancreas, left adrenal gland, one of the smallest abdominal organs, is also clearly segmented by PaNN.
4.3 MICCAI 2015 MultiAtlas Labeling Challenge
In order to further demonstrate the effectiveness of PaNN, we test our model in the 2015 MICCAI MultiAtlas Abdomen Labeling challenge. The top model (denoted as “PaNN” in Table 4) we submit is based on ResNet101, and trained on all 30 cases of the fullylabeled dataset and the union of three partiallylabeled datasets A, B and C. The evaluation metric employed in this challenge includes the Dice scores, average surface distances [32] and Hausdorff distances [22]. We compared PaNN with the other top submissions of the challenge leaderboard in Table 4. As it shows, the proposed PaNN achieves the best performance under all the three evaluation metrics, easily surpassing prior best result by a large margin. Without using any additional data and even randomly removing partial labels from the challenge data, our method (denoted as “PaNN*” in Table 4) stills obtains the stateoftheart result of , outperforming the previous best result of DLTK UNet [28] by in average Dice. It is noteworthy that our method is far from its potential maximum performance as we only use 2D single view algorithms. It is suggested [45, 38, 44] that using multiview algorithms or model ensemble can boost the performance further.
4.4 Generalization to Other Datasets
In order to further demonstrate the effectiveness of the PaNN, we also apply our algorithm to a different set of abdominal clinical CT images, where 20 cases are used for training and 15 cases are used for testing. A total of 9 structures (spleen, right kidney, left kidney, gallbladder, liver, stomach, aorta, IVC, pancreas) are manually labeled. Each case was segmented by four experienced radiologists, and confirmed by an independent senior expert. Each CT volume consists of slices of pixels, and has voxel spatial resolution of . We use the union of all 3 datasets A, B, C as the partial supervision. The results are summarized in Table 3, where the proposed PaNN also achieves the best results compared with existing methods.
Organ  Fully  Semi  Partially  PaNN 
Supervised  Supervised  Supervised (ours)  (ours)  
Gallbladder  0.8225  0.8399  0.8465  0.8467 
Aorta  0.9110  0.9096  0.9121  0.9133 
IVC  0.8083  0.8175  0.7995  0.8266 
Pancreas  0.7831  0.7994  0.8079  0.8193 
avg. Dice  0.9008  0.9060  0.9063  0.9103 
5 Conclusion
In this work, we have presented PaNN, for multiorgan segmentation, as a way to better utilize existing partiallylabeled datasets. In several applications such as radiation therapy or computeraided surgery, physicians and surgeons have been doing segmentation of target structures. Meanwhile, to handle the background ambiguity brought by the partiallylabeled data, the proposed PaNN exploits the anatomical priors by regularizing the organ size distributions of the network output should approximate their prior statistics in the abdominal region. Our proposed PaNN shows promising results using current stateoftheart 2D/3D models. As an additional benefit, PaNN also demonstrates huge potential for applications such as computational anatomy by combining these data and building consistent dataset with multiorgan segmentation.
References
 [1] V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep convolutional encoderdecoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence, 39(12):2481–2495, 2017.
 [2] W. Bai, O. Oktay, M. Sinclair, H. Suzuki, M. Rajchl, G. Tarroni, B. Glocker, A. King, P. M. Matthews, and D. Rueckert. Semisupervised learning for networkbased cardiac mr image segmentation. In MICCAI, 2017.

[3]
A. Bearman, O. Russakovsky, V. Ferrari, and L. FeiFei.
What’s the point: Semantic segmentation with point supervision.
In
European conference on computer vision
, pages 549–565, 2016.  [4] H. Chen, Q. Dou, L. Yu, J. Qin, and P.A. Heng. Voxresnet: Deep voxelwise residual networks for brain segmentation from 3d mr images. NeuroImage, 2017.
 [5] L.C. Chen, J. T. Barron, G. Papandreou, K. Murphy, and A. L. Yuille. Semantic image segmentation with taskspecific edge detection using cnns and a discriminatively trained domain transform. In CVPR, 2016.
 [6] L.C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2018.
 [7] L.C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam. Encoderdecoder with atrous separable convolution for semantic image segmentation. 2018.
 [8] Ö. Çiçek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ronneberger. 3d unet: learning dense volumetric segmentation from sparse annotation. In MICCAI, 2016.
 [9] J. Dai, K. He, and J. Sun. Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation. In ICCV, 2015.

[10]
A. V. Dalca, J. Guttag, and M. R. Sabuncu.
Anatomical priors in convolutional networks for unsupervised
biomedical segmentation.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 9290–9299, 2018.  [11] Q. Dou, H. Chen, Y. Jin, L. Yu, J. Qin, and P.A. Heng. 3d deeply supervised network for automatic liver segmentation from ct volumes. In MICCAI, 2016.
 [12] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
 [13] M. P. Heinrich. Multiorgan segmentation using deeds, selfsimilarity context and joint fusion. 2015.
 [14] K. Kamnitsas, C. Ledig, V. F. Newcombe, J. P. Simpson, A. D. Kane, D. K. Menon, D. Rueckert, and B. Glocker. Efficient multiscale 3d cnn with fully connected crf for accurate brain lesion segmentation. Medical image analysis, 36:61–78, 2017.
 [15] H. Kervadec, J. Dolz, M. Tang, E. Granger, Y. Boykov, and I. B. Ayed. Constrainedcnn losses for weakly supervised segmentation. Medical Image Analysis, 2019.
 [16] M. Lai. Deep learning for medical image segmentation. arXiv preprint arXiv:1505.02000, 2015.
 [17] D. Lin, J. Dai, J. Jia, K. He, and J. Sun. Scribblesup: Scribblesupervised convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3159–3167, 2016.
 [18] G. Lin, A. Milan, C. Shen, and I. D. Reid. Refinenet: Multipath refinement networks for highresolution semantic segmentation. In CVPR, 2017.
 [19] G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. A. Van Der Laak, B. Van Ginneken, and C. I. Sánchez. A survey on deep learning in medical image analysis. Medical image analysis, 42:60–88, 2017.

[20]
Y. Liu, J. Chen, and L. Deng.
An unsupervised learning method exploiting sequential output statistics.
In NIPS, 2017.  [21] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
 [22] F. Milletari, N. Navab, and S. A. Ahmadi. Vnet: Fully convolutional neural networks for volumetric medical image segmentation. In International Conference on 3D Vision, 2016.
 [23] I. Nogues, L. Lu, X. Wang, H. Roth, G. Bertasius, N. Lay, J. Shi, Y. Tsehay, and R. M. Summers. Automatic lymph node cluster segmentation using holisticallynested neural networks and structured optimization in ct images. In MICCAI, 2016.
 [24] M. S. Nosrati and G. Hamarneh. Incorporating prior knowledge in medical image segmentation: a survey. arXiv preprint arXiv:1607.01092, 2016.
 [25] O. Oktay, E. Ferrante, K. Kamnitsas, M. Heinrich, W. Bai, J. Caballero, S. A. Cook, A. De Marvao, T. Dawes, D. P. O‘Regan, et al. Anatomically constrained neural networks (acnns): application to cardiac image enhancement and segmentation. IEEE transactions on medical imaging, 37(2):384–395, 2018.
 [26] G. Papandreou, L.C. Chen, K. Murphy, and A. L. Yuille. Weakly and semisupervised learning of a dcnn for semantic image segmentation. In ICCV, 2015.
 [27] D. Pathak, E. Shelhamer, J. Long, and T. Darrell. Fully convolutional multiclass multiple instance learning. 2015.
 [28] N. Pawlowski, S. I. Ktena, M. C. Lee, B. Kainz, D. Rueckert, B. Glocker, and M. Rajchl. Dltk: State of the art reference implementations for deep learning on medical images. arXiv preprint arXiv:1711.06853, 2017.
 [29] M. Rajchl, M. C. Lee, O. Oktay, K. Kamnitsas, J. PasseratPalmbach, W. Bai, M. Damodaram, M. A. Rutherford, J. V. Hajnal, B. Kainz, et al. Deepcut: Object segmentation from bounding box annotations using convolutional neural networks. IEEE transactions on medical imaging, 36(2):674–683, 2017.
 [30] O. Ronneberger, P. Fischer, and T. Brox. Unet: Convolutional networks for biomedical image segmentation. In MICCAI, 2015.
 [31] H. R. Roth, L. Lu, A. Farag, H.C. Shin, J. Liu, E. B. Turkbey, and R. M. Summers. Deeporgan: Multilevel deep convolutional networks for automated pancreas segmentation. In MICCAI, 2015.
 [32] H. R. Roth, L. Lu, N. Lay, A. P. Harrison, A. Farag, A. Sohn, and R. M. Summers. Spatial aggregation of holisticallynested convolutional neural networks for automated pancreas localization and segmentation. Medical image analysis, 45:94–107, 2018.
 [33] H. R. Roth, C. Shen, H. Oda, T. Sugino, M. Oda, Y. Hayashi, K. Misawa, and K. Mori. A multiscale pyramid of 3d fully convolutional networks for abdominal multiorgan segmentation. In MICCAI, 2018.
 [34] A. L. Simpson, M. Antonelli, S. Bakas, M. Bilello, K. Farahani, B. van Ginneken, A. KoppSchneider, B. A. Landman, G. Litjens, B. Menze, et al. A large annotated medical image dataset for the development and evaluation of segmentation algorithms. arXiv preprint arXiv:1902.09063, 2019.
 [35] N. Souly, C. Spampinato, and M. Shah. Semi supervised semantic segmentation using generative adversarial network. In ICCV, 2017.
 [36] M. Tang, A. Djelouah, F. Perazzi, Y. Boykov, and C. Schroers. Normalized cut loss for weaklysupervised cnn segmentation. In CVPR, pages 1818–1827, 2018.
 [37] M. Tang, F. Perazzi, A. Djelouah, I. Ben Ayed, C. Schroers, and Y. Boykov. On regularized losses for weaklysupervised cnn segmentation. In ECCV, pages 507–522, 2018.
 [38] Y. Wang, Y. Zhou, W. Shen, S. Park, E. K. Fishman, and A. L. Yuille. Abdominal multiorgan segmentation with organattention networks and statistical fusion. arXiv preprint arXiv:1804.08414, 2018.
 [39] Y. Wang, Y. Zhou, P. Tang, W. Shen, E. K. Fishman, and A. L. Yuille. Training multiorgan segmentation networks with sample selection by relaxed upper confident bound. In MICCAI, 2018.
 [40] J. Xu, A. G. Schwing, and R. Urtasun. Tell me what you see and i will show you where it is. In CVPR, pages 3190–3197, 2014.

[41]
Q. Yu, L. Xie, Y. Wang, Y. Zhou, E. K. Fishman, and A. L. Yuille.
Recurrent saliency transformation network: Incorporating multistage visual cues for small organ segmentation.
In CVPR, pages 8280–8289.  [42] Y. Zhang, L. Yang, J. Chen, M. Fredericksen, D. P. Hughes, and D. Z. Chen. Deep adversarial networks for biomedical image segmentation utilizing unannotated images. In MICCAI, 2017.
 [43] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. In CVPR, 2017.
 [44] Y. Zhou, Y. Wang, P. Tang, W. Shen, E. K. Fishman, and A. L. Yuille. Semisupervised multiorgan segmentation via multiplanar cotraining. In WACV, 2019.
 [45] Y. Zhou, L. Xie, W. Shen, Y. Wang, E. K. Fishman, and A. L. Yuille. A fixedpoint model for pancreas segmentation in abdominal ct scans. In MICCAI, 2017.