Abstract
Accurate and robust segmentation of abdominal organs on CT is essential for many clinical applications such as computeraided diagnosis and computeraided surgery. But this task is challenging due to the weak boundaries of organs, the complexity of the background, and the variable sizes of different organs. To address these challenges, we introduce a novel framework for multiorgan segmentation of abdominal regions by using organattention networks with reverse connections (OANRCs) which are applied to 2D views, of the 3D CT volume, and output estimates which are combined by statistical fusion exploiting structural similarity. More specifically, OAN is a twostage deep convolutional network, where deep network features from the first stage are combined with the original image, in a second stage, to reduce the complex background and enhance the discriminative information for the target organs. Intuitively, OAN reduces the effect of the complex background by focusing attention so that each organ only needs to be discriminated from its local background. RCs are added to the first stage to give the lower layers more semantic information thereby enabling them to adapt to the sizes of different organs. Our networks are trained on 2D views (slices) enabling us to use holistic information and allowing efficient computation (compared to using 3D patches). To compensate for the limited crosssectional information of the original 3D volumetric CT, e.g., the connectivity between neighbor slices, multisectional images are reconstructed from the three different 2D view directions. Then we combine the segmentation results from the different views using statistical fusion, with a novel term relating the structural similarity of the 2D views to the original 3D structure. To train the network and evaluate results, structures were manually annotated by four human raters and confirmed by a senior expert on normal cases. We tested our algorithm by 4fold crossvalidation and computed DiceSørensen similarity coefficients (DSC) and surface distances for evaluating our estimates of the structures. Our experiments show that the proposed approach gives strong results and outperforms 2D and 3Dpatch based stateoftheart methods in terms of DSC and mean surface distances.
1 Introduction
Segmentation of the internal structures, like body organs, in medical images is an essential task for many clinical applications such as computeraided diagnosis (CAD), computeraided surgery (CAS) and radiation therapy (RT). However, despite intensive studies of automatic or semiautomatic segmentation methods, there remain challenges which need to be overcome before these methods can be applied to clinical environments. In particular, detailed abdominal organ segmentation on CT is a challenging task both for manual human annotation and for automatic segmentation algorithms for various reasons including the morphological complexity of the structures, the large variations between inter and intrasubjects, and image characteristics such as low contrast of soft tissues.
Early studies of abdominal organ segmentation focused on specific single organs, for example relatively large isolated structures such as the liver [12, 23, 20] or critical structures such as blood vessels [17, 19]. However, most of the algorithms were based on specific features of the target organ, and so extensibility to the simultaneous segmentation of multiple organs was limited. For multiorgan segmentation, atlasbased approaches were adopted for many applications [13, 2, 7, 37, 15, 40, 16]. The general framework of atlasbased segmentations is to deformably register selected atlas images with segmented structures to the target image. Critical issues for this approach, which affect performance accuracy, include proper atlas selection, accurate deformable image registration, and label fusion. In particular, for the abdominal region, intersubject variations are relatively large compared with other parts of the body (e.g., the brain) so the segmentation results are dependent on deformable registration between intersubjects from the limited set of atlases, which is a challenging problem that critically affects the final accuracies. In addition, computational time is strongly dependent on the number of atlases. Therefore, selection of the proper number and types of atlases is a critical factor for both of the accuracy and efficiency.
Recently, learningbased approaches exploiting large datasets have been applied to the segmentation of medical images [9, 8, 24, 25, 33, 4, 26, 11, 14, 41]
. In particular, deep convolutional neural networks (CNN) have been very successful
[9, 8, 28, 29, 24, 33, 4, 26, 11, 14]. Targets include regions in the brain [4, 11, 14], chest [33], and abdomen [9, 28, 29]. The performance results of CNNs for organs (and even tumors) reach, or outperform, alternative stateoftheart methods. Unlike multiatlasbased approaches, deep networks do not require selecting a specific atlas or require deformable registration from training sets to a target image. In this study, we apply deep network approaches to abdominal organ segmentation.Most studies based on deep networks, however, focused on a single structure segmentation, particularly for abdominal regions, and there are few studies of multiorgan segmentation partly due to technical challenges discussed later. We note that fully convolutional networks (FCNs) [21] have been generally accepted for organ segmentations on CT scans [8, 39, 30] partly because they give stateoftheart performance for semantic segmentation of natural images [21, 5]. But there are three major characteristics of abdominal CT which we must address in order to obtain strong performance on multiorgan segmentation.
Firstly, many abdominal organs have weak boundaries between spatially adjacent structures on CT, e.g. between the head of the pancreas and the duodenum. In addition, the entire CT volume includes a large variety of different complex structures. Morphological and topological complexity includes anatomically connected structures such as the gastrointestinal (GI) track (stomach, duodenum, small bowel and colon) and vascular structures. The correct anatomical borders between connected structures may not be always visible in CT, especially in sectional images (i.e., 2D slices), and may be indicated only by subtle texture and shape change, which causes uncertainty even for human experts. This makes it hard for deep networks to distinguish the target organs from the complex background.
Secondly, there are large variations in the relative sizes of different target organs, e.g. the liver compared to the gallbladder. This causes problems when applying deep networks to multiorgan segmentation because lower layers typically lack semantic information when segmenting small structures. The same problem has been observed in semantic segmentation of natural images where the segmentation performance on small regions is typically much worse than on large regions, motivating the need to introduce mechanisms which attend to the scale [6].
Thirdly, although CT scans are highresolution threedimensional volumes, most current deep network methods were designed for 2D images. To overcome the limitations of using 2D CNNs for 3D images, Setio et al. [33] used multiple 2D patches reconstructed from different directions around the target region for the task of pulmonary nodule detection. Zhuang et al. [40] used 2D axial, coronal, and sagittal slices for pancreas detection at the coarse level and also for segmentation at the finer level. More recently, there are studies which use 3D deep networks [8, 24, 30, 14, 27]. These, however, are not networks that act on the entire 3D CT volume but instead are local patchbased approaches (due to complex challenges of 3D deep networks discussed later in this paragraph). To address the problems caused by restricting to image patches, [30, 14] used a hierarchical approach with multiresolutions, which reduces the dimension of the whole volume for initial detection and focuses on smaller regions at the finer resolution. But this strategy is best suited to a single target structure. Roth et al. [27] applied a bigger patch size to deal with the whole dense pancreatic volume, but this was also for single pancreas segmentation and hard to extend to the whole abdominal region. In general, 3D deep networks face far greater complex challenges than 2D deep networks. Both approaches rely heavily on graphics processing units (GPUs) but these GPUs have limited memory size which makes it difficult when dealing with full 3D CT volumes compared to 2D CT slices (which require much less memory). In addition, 3D deep networks typically require many more parameters than 2D deep networks and hence require much more training data, unless they are restricted to patches. But there is limited training data for abdominal CT images, because annotating them is challenging and requires expert human radiologists, which makes it particularly difficult to apply 3D deep networks to abdominal multiorgan segmentation. We have, however, implemented a 3D patch based approach for comparison.
To deal with the technical difficulties for abdominal multiorgan segmentation on CT, we introduce a novel framework of an organattention 2D deep networks with reverse connections (OANRC) followed by statistical fusion to combine the information from the three different views exploiting structural similarity using local isotropic 3D patches. OAN is a twostage deep network, which computes an organattention map (OAM) from typical probability map of labels for input images in the first stage and combines OAM to the original input image for the second stage. This twostage strategy effectively reduces the complexity of the background while enhancing the discriminative information of target structures (by concentrating attention close to the target structures). By training OAM with additional deep network, uncertainties and errors from the first stage are adjusted and the fidelity of the final probability map is improved. In this procedure, we apply reverse connections
[18] to the first stage so that we can localize organ information at different scales by assisting the lower layers with semantic information.More specifically, we apply OANRC to each sectional slice, which is an extreme form of anisotropic local patches but include the whole semantic (i.e. volume) information from one viewing direction. This yields segmentation information from separate sets of multisectional images (axial, coronal, and sagittal planes in this study similarly to most of medical image platforms for 2D visualization). We statistically fuse the three sources of information using local isotropic 3D patches based on directiondependent local structural similarity. The basic fusion framework uses expectationmaximization (EM) similar to
[36, 2]. But, unlike typical statistical fusion methods used for atlasbased segmentation, the input volumes and the target volumes for segmentation in our problem are the same. But different structures and texture patterns, from different viewing directions, will often generate nonidentical segmentations in 3D. Our strategy is to exploit structural similarity by computing a directiondependent local property at each voxel. This models the structural similarity from the 2D images to the original 3D structure (in the 3D volume) by local weights. This structural statistical fusion improves our overall performance by combining the information from the three different views in a principled manner and also imposing local structure.Figure 1 describes the graphical concept of our framework. Our proposed algorithm was tested on abdominal CT scans of normal cases collected as a part of FELIX project for pancreatic cancer research [22]. By experiments, our method showed robust and high fidelities to the groundtruth for all target structures with smooth boundaries. It outperformed 3D patchbased algorithms as well as 2Dbased in terms of DICEsimilarity coefficient and average surface distance with memory and computational efficiency.
2 OrganAttention Networks with Reverse Connections
Given a 3D volume of interest (VOI) of a scanned CT image , our goal is to find the label of each voxel . The target structures (i.e., the labeled structures) are restricted to be organs which do not overlap with each other, so every voxel should be assigned to a label in a finite set . In this section we introduce our proposed organattention networks with Reverse connections (annotated as OANRC) which is run separately on three different views, and then in the next section we describe our novel structural similarity statistical fusion method which combines the segmentation results obtained from the OANRCs on the three different views.
2.1 Twostage Organ Attention Network
We first introduce the OAN, which is composed of two jointly optimized stages. The first stage (stageI) transforms the organ segmentation probability map to provide spatial attention to the second stage (stageII), so that the segmentation network trained in stageII is more discriminative for segmenting organs (because it only has to deal with local context). To assist the lower layers in stageI with more semantic information, we employ reverse connections (Sec. 2.2), which pass semantic information down from high layers to low layers. The OAN is trained in an endtoend fashion to enhance the learning ability of all stages.
The input images to our OAN are reconstructed 2D slices from axial, sagittal and coronal directions. Based on the normal vector directions of the sagittal (
), coronal () and axial () planes, we denote the 2D images by , and respectively, where and are the numbers of slices for the three directions, respectively, and . Following the work of [39], we train an individual OAN for each direction.Fig. 2 illustrated our organattentionnetwork architecture. The network consists of two stages, where each stage is a segmentation network. For notational simplicity, we denote an input 2D slice by and its corresponding label map by . StageI outputs a probability map
for each label at every pixel, where the probability density function
is a segmentation network parameterized by . We use FCN [21] with reverse connections, which is explained in Sec. 2.2, as . FCN is the backbone network throughout the paper. Each element is the probability that the th pixel in the input slice belongs to label , where is the background, and are target organs. We define , where is the activation value of the th pixel on the th channel dimension. Let be the activation map. The objective function to minimize for is given by(1) 
where is an indicator function.
Using a preliminary organ segmentation map to guide the computation of a better organ segmentation can be thought as employing an attentional mechanism. Towards this end, we propose an organattention module by
(2) 
where denotes the convolution operator, indicates the convolutional filters, and is the bias. (2) embeds crossorgan information into a single organattention map, , which learns discriminative spatial attention for different organs automatically. By combining with the original input , we get an image which emphasizes each organ by
(3) 
where is the elementwise product operator. We apply to the input of stageII, and the probability of stageII then becomes .
In order to drive stageII to focus on organ regions without needing to deal with complicated nonlocal background, we define a selection function, where is the probability map provided by stageI. In stageII, we only accept the region if
and do not backpropagate it to stageI. The loss function for stageII is formulated as
(4) 
To jointly optimize stageI and stageII, we define a loss function aiming at estimating parameters , , , and by optimizing
(5) 
where and are the fusion weights.
2.2 Reverse Connections
FCNs [21] have shown good segmentation results in recent studies, especially for single organ segmentation. However, for multiorgan segmentation, lower layers typically lack semantic information, which may lead to inaccurate segmentation particularly for smaller structures. Therefore, we propose reverse connections which feed coarsescale (high) layer information backward to finescale (low) layer for semantic segmentation of multiscale structures, inspired by [18]. This enables us to connect abstract highlevel semantic information to the more detailed lower layers so that all the target organs have similar levels of details and abstract information at the same layer. The reverse connections framework for stageI is shown in Fig. 3. Fig. 4 illustrates a reverse connection block. Let denote the reverse connection map of the th convolutional layer in the backbone network, i.e. FCN in this study, where is the output of the th convolutional layer. A convolutional layer (with channels by kernels) is added after , and a deconvolutional layer (with channels by kernels) is applied after . is then obtained via an elementwise summation of these two maps. is the output of a convolutional layer (with channels by kernels) grafted onto . Let denote the corresponding weights for obtaining . Following [18], we add reverse connections from to .
With these learnable reverse connections, the semantic information of the lower layers can be enriched. In order to drive learned reverse connection maps to produce segmentation results approaching the groundtruth, we make each reverse connection map associate with a classifier. As the sideoutput layers proposed in
[18] are designed for detection purposes, they are not suitable for our task. Instead we follow the sideoutputs used in [38]. More specifically, a convolutional layer (with channels by kernels) is added on top of , whose output is denoted as , and followed by a deconvolutional layer (with channels). We denote the weights of the th sideoutput layer by . The loss function for sideoutput layers is defined as(6) 
where and is the probability output of the th sideoutput layer.
In order to combine the learned reverse connection maps of fine layers and coarse layers, we add up the predictions (i.e., ) of the reverse connection maps from high layer to low layer gradually. First, is fused with a upsampling of by an elementwisely addition. Then we follow the same strategy and gradually merge and , as shown in Fig. 5. To obtain a fused activation map from the activation map of both sideoutputs (i.e., ) and convolutional layers in the backbone network (i.e., ), a scale function is adopted followed by an elementwise addition by
(7) 
where indicates the th channel of the activation map. and are fusion weights. Then the fused probability map, , can be obtained by . The final objective function for stageI is defined by
(8) 
where , and are fusion weights, and
(9) 
Note that in our full system with the twostage organattention network and reverse connections, all the parameters are optimized simultaneously by standard backpropagation
(10) 
2.3 Testing Phase
In the testing stage, given a slice , we obtain the stageI and stageII probability map by
(11) 
where is the network functions defined in Sec. 2.1. A fused probability map of and is then given by
(12) 
The final label map is determined by .
3 Statistical Label Fusion Based on Local Structural Similarity
As described in Sec. 1, our OANRC is based on 2D images which is an extreme case of 3D anisotropic patches. In this section, we propose to fuse anisotropic information obtained from different viewing directions using isotropic 3D local patches to estimate the final segmentation. Let us denote the segmentation results by , which are obtained as described in Sec. 2.3 from the axial (Z), sagittal (X), and coronal (Y) OANRCs. Depending on the viewing directions, sectional images contain different structures and may have different texture patterns in the same organs. These differences can cause nonidentical segmentations by the deep network as shown in Fig. 6 in 3D. In addition, there is no guarantee of connectivity between neighbor slices by independent use of slices for training and testing. Possible naïve approaches for determining the final segmentation in 3D from the OANRC results can be boolean operations such as union or intersection. Majority voting (MV) is another candidate for efficient fusion, however, theses approaches assume the same global weights of OANRC results. From the observations that the performance level of segmentation, e.g. sensitivity, can be different from viewing directions for each organ, we set the performance level to be an unknown variable when computing the probability of labeling. This concept is similar to the label fusion algorithms using expectationmaximization (EM) framework such as STAPLE (simultaneous truth and performance level estimation) and its extensions [36, 1, 2].
Let us denote the true label of the by , which is unknown, and the unknown performance level parameter of segmentation by . The segmentations from the deep networks are observed values. Under this condition, the basic EM framework is performed by following two steps in an iterative manner: 1) to compute which is the expected value of the log likelihood, , under the current estimate of the parameters at iteration, and 2) to find the parameter which maximizes .
The maximization step can be written as
(13) 
By assuming independence between and in our problem, the second term in (13) becomes free of and the maximization step can be written as
(14) 
Therefore, we redefine as .
The performance level parameter in this framework is a global property representing the overall confidence of deep network segmentation for the whole volume. However, it can also vary according to the voxel spatial locations via the local and neighbor structures as we use 2D slices for the initial segmentation. Therefore, we propose to combine local structural similarity shown from a specific viewing direction to the original 3D volume and the global performance level, conceptually similar to local weighted voting [31]. We compute the probability of correspondence between 2D images and the 3D volume by structural similarity (SSIM) [35] by
(15) 
where is the SSIM from the viewing direction at the voxel. and are userdefined constants, and and represent local 2D and 3D patches centered at the voxel, respectively. and
are the average and standard deviation of the patch
, respectively, and is the covariance of and . Fig. 7 shows an example of the structural similarity computed on different viewing directions as a color map.Considering the local image properties, the expectation of log likelihood function in our problem becomes
(16) 
The global underlying performance level parameters of the deep network segmentations is defined as
(17) 
where is the probability of the voxel labeled as from the deep network with the current estimated performance value , when the true label is .
To make the problem simple, we assume conditional independence between labeling and the original volume intensities. The labeling probability with the target image intensity then becomes
(18) 
3.1 Estep
In the expectation step (Estep), we estimate the probability of voxelwise labels. Let us denote the probability that the true label of voxel is at the iteration by . When the deep network segmentations and performance level parameters at the iteration are given, can be then described as
(19) 
where is the vector of all . From the independence between , and , we apply Bayesian theorem to (19).
(20) 
where is a priori of the voxel. By applying (18) to (20), we then obtain the probability of voxelwise labeling as
(21) 
3.2 Mstep
In the maximization step (Mstep), the goal is to find the performance parameters, , which maximize (16) with the current given parameters. Considering each and independently, the expectation of log likelihood function in (16) can be expressed with the estimated voxelwise probability in Estep. Then the performance parameter of each segmentation can be formulated to find the solution which maximizes the summation of voxelwise probability as
(22) 
where at voxel. By applying (19) and (18), (22) becomes
(23) 
From the definition of in (17), the summation of probability mass function, , must be , and (22) becomes a constrained optimization problem which can be solved by introducing a Lagrange multiplier, . We then obtain the optimal solution by making the first gradient zero as
(25) 
By substituting the constraint of , we can obtain the final optimal solution as
(26) 
3.3 Parallel computing using GPUs
The fusion step can be efficiently computed in a parallel way on a GPU. The local structural similarity of th voxel in th deep network and priori can be computed for each voxel and saved as a preprocessing step. In the EM iterations, as shown in (21), the probability can be computed and updated for each structure at each voxel. In our implementation, a GPU thread is logically allocated for each voxel. However, to reduce the used memory and computation cost, the target volume of interest (VOI) for each structure is computed in an extended region as voxels for each direction from in our implementation. For parallel computing, one CPU thread is allocated to a structure and launches a kernel of one GPU to compute EM iteration for each structure.
4 Experimental Results
We evaluated our methods on abdominal CT images of normal cases under an IRB (Institutional Review Board) approved protocol in Johns Hopkins Hospital as a part of the FELIX project for pancreatic cancer research [22]. CT images were obtained by Siemens Healthineers (Erlangen,Germany) SOMATOM Sensation and Definition CT scanners. CT scans are composed of slices of images, and have voxel spatial resolution of . All CT scans are contrast enhanced images and obtained in the portal venous phase.
A total of structures for each case were segmented by four human annotators/raters, one case by one person, and confirmed by an independent senior expert. The structures include the aorta, colon, duodenum, gallbladder, interior vena cava (IVC), kidney (left, right), liver, pancreas, small bowel, spleen, stomach, and large veins. Vascular structures were segmented only outside of the organs in order to make the structures exclusive to each other (i.e. no overlaps).
As explained in Sec. 2, we used OANRCs for multiorgan segmentation whose backbone FCNs had been pretrained by dataset [10]. From the possible variants of FCNs (e.g., FCN32s, FCN16s, and FCN8s), which depend on how they combine the fine detailed predictions [34], we selected FCN8s in this study because it captures very fine details in the and pooling layer, and keeps highlevel semantic contextual information from the final layer. Our algorithm was implemented and tested on a workstation with Intel i76850K CPU, NVidia TITAN X (PASCAL) GPU. With cases, the initial segmentations using OANRCs were tested by fourfold crossvalidation. All the input images of OANRCs are times enlarged by upsampling, which lead to improved performance in our experiments.
In the fusion step, the average probability of are taken as a priors in (21) and the initial performance levels were computed by randomly selecting 5 cases and by comparing them to the groundtruth. To compute the local patchbased structural similarity in (15), patches of size cubes were used for 3D volume. Since CT voxels are not always isotropic and spatial resolutions can be different between scan volumes, we resampled the 3D patch with length cubic voxels so that the same size of 3D patches and 2D patches from all directions can be used for all cases in our experiments.
The final segmentation results using OANRC with local structural similaritybased statistical fusion (LSSF) were compared with the 3Dpatch based stateoftheart approaches, 3D Unet[8] and hierarchical 3D FCN (HFCN) [30] as well as 2Dbased FCN, OAN and OANRC with majority voting (MV). For a quantitative comparison, we computed the wellknown DiceSørensen similarity coefficient (DSC) and the surface distances based on the manual annotations as groundtruth. For a structure , DSC is computed as where is the estimated segmentation and is the groundtruth, i.e. manual annotations in this study. The surface distance was computed from each vertex of the groundtruth and to the estimates of our algorithms. Fig. 8 shows comparison results by box plots, while Tables 1 and 2 represent the mean and standard deviations for all the cases.
As shown in Fig. 8, the basic OANRC outperforms other stateoftheart approaches and our local structural similaritybased fusion improves the results even more. We note that although DSC shows the relative overall volume similarity, it does not quantify the boundary smoothness or the boundary noise of the results. But evaluating the surface distances, see below, shows that our method works effectively for both the whole volumes and the boundaries of the organs.
Structure  3D Unet  HFCN  FCN MV  OAN MV  OANRC MV  OANRC LSSF 

Aorta  87.012.3  88.3 8.8  85.04.2  85.5 4.2  85.3 4.1  91.8 3.5 
Colon  77.011.0  79.3 9.2  80.3 9.1  81.5 9.4  82.0 8.8  83.0 7.4 
Duodenum  66.812.8  70.3 10.4  70.211.3  72.611.4  73.411.1  75.4 9.1 
Gallbladder  85.410.3  87.9 7.5  87.8 8.3  88.9 6.2  89.4 6.1  90.5 5.3 
IVC  80.810.2  84.7 5.9  84.0 6.0  85.6 5.8  86.0 5.5  87.0 4.2 
Kidney(L)  83.922.4  95.2 2.6  96.1 2.0  96.2 2.2  95.9 2.3  96.8 1.9 
Kidney(R)  88.014.4  95.6 4.5  95.8 4.9  95.9 4.9  96.0 2.5  98.4 2.1 
Liver  91.4 9.9  95.7 1.8  96.8 0.8  97.0 0.9  97.0 0.8  98.0 0.7 
Pancreas  79.311.7  81.410.8  84.3 4.9  86.2 4.5  86.6 4.3  87.8 3.1 
Small bowel  69.917.3  71.115.0  76.914.0  78.013.8  79.013.4  80.110.2 
Spleen  89.69.5  93.1 2.1  96.3 1.9  96.4 1.9  96.4 1.7  97.1 1.5 
Stomach  90.1 7.2  93.2 5.4  93.9 3.2  94.2 2.9  94.2 3.0  95.2 2.6 
Veins  60.723.7  74.510.5  74.810.7  76.811.2  77.412.1  80.7 9.3 
Structure  3D Unet  HFCN  FCN MV  OAN MV  OANRC MV  OANRC LSSF 

Aorta  0.44 1.01  0.420.58  0.560.47  0.470.42  0.440.28  0.390.21 
Colon  6.759.01  6,358.12  6.277.44  5.657.25  4.075.72  3.594.17 
Duodenum  2.012.46  1.702.18  1.712.25  1.491.87  1.541.43  1.361.31 
Gallbladder  1.310.76  1.210.50  1.220.52  1.120.50  1.050.41  0.950.37 
IVC  1.571.53  1.151.05  1.261.08  1.161.38  1.121.24  1.081.03 
Kidney(L)  0.771.04  0.410.42  0.360.47  0.340.47  0.300.33  0.300.30 
Kidney(R)  1.392.01  1.031.68  1.051.74  0.741.32  0.541.09  0.450.89 
Liver  1.893.21  1.60 0  1.612.98  1.392.64  1.321.74  1.231.52 
Pancreas  1.781.05  1.510.80  1.410.88  1.190.82  1.170.72  1.050.65 
Small bowel  4.215.78  4.016.01  3.916.05  3.204.05  3.375.48  3.013.35 
Spleen  0.980.56  0.590.37  0.600.36  0.560.40  0.47 0.27  0.420.25 
Stomach  2.785.89  2.505.02  2.515.13  2.365.65  1.881.64  1.681.55 
Veins  2.314.51  1.753.51  1.693.61  1.926.48  1.403.61  1.213.05 
Tables 1 and 2 represent the mean and standard deviations of performance measures for 13 critical organs. Similar to the box plots, they show that our OANRCs with statistical fusion improves the overall mean performance and also reduces the standard deviations significantly.
The OANRC training and testing can be computed in parallel for each view direction. In our experiments, the training took hours for iterations for training cases and the average testing time for each volume was seconds. The fusion time depended on the volume of the target structure, and the average computation time for organs was 6.87 seconds.
5 Discussion
Multiorgan segmentation using OANRCs alone, without the statistical fusion, gave similar or better performance compared with the stateoftheart approaches summarized in [16]. In the specific case of the pancreas, stateoftheart methods showed (mean standard deviations) segmentation accuracies as on 140 cases [32], on 150 cases [16], on cases [29] and (on the whole slice) versus (reduced region of interest) on cases [39] in terms of DSC. We cannot make a direct comparison because in these datasets CT images and manual segmentations (i.e. annotation) for the groundtruth are different from each other. But our OANRCs segmentations on our larger dataset shows similar or better performances in terms of DSC. Among target organs, our performance on structures such as gallbladder and pancreas, whose sizes are relatively small and have particularly weak boundaries improves significantly from using basic FCNs or using OANs without reverse connections.
Moreover, as shown in Sec. 4, our statistical fusion based on local structural similarity improves the overall segmentation accuracies in terms of both DSC and average surface distances. In particular, there are significant performance improvements for the minimum values as shown in Fig. 8, which helps explain the robustness of the algorithm. The differences can be depicted more clearly by visualizing the 3D surfaces as shown in Figs 10  11. The noise of the deep network segmentations is distributed over large regions, without much connectivity, and occasionally they show significantly different patterns. But our fusion step exploits structural similarity which outputs clean and smooth boundaries by effectively combining different information based on the local structure of the original 3D volume.
When applying our proposed method and interpreting the evaluation results, we must address several considerations.
As shown in our experiments, our proposed algorithm also outperforms 3D patch based approaches. But 3D (isotropic) patchbased approaches have several issues which make it hard to apply to this problem. To make bigger patch size, they require more parameters and hence require more training data or, if this is not available, significant data augmentation (.e.g, by scaling, rotation, and elastic deformation). In addition, there can be practical memory limitation on GPUs which restricts the expandable patch size. The limited patch size means that the deep networks receptive field sizes contains only limited local information which is problematic for multiorgan segmentation and the discontinuities between the patches also raises problems. It is possible that solutions to these three problems may make 3D patch based methods work better in the future. Unlike 3D approaches, the local structuresimilarity used in our fusion method effectively combine the information from anisotropic patches to 3D at each voxel. Fig. 9 shows an example generated by our proposed algorithm, which is visually indistinguishable from manual segmentation for almost all target structures.
The groundtruth used in this study for training and evaluation was specified using manual annotations by human observers. It is well known that there can be significant inter/intraobserver variations in manual segmentation. But, as explained before, the groundtruth was created by four human observers and checked by experts in a visual way, and we randomly divided testing groups in our 4fold crossvalidation to avoid biased comparison. However, it is still possible that inaccuracies due to human variability may affect the evaluation as well as the training. This can be further intensively explored as separate experiments.
Another possible consideration when applying the proposed approach is the image quality which can affect both of manual annotations and deep network segmentation results. Various factors such as spatial resolution, level of artifacts and reconstruction kernels should be considered. The dataset used in this study has been collected between to in the same institute with control over the scanning parameters. As explained in Sec. 4, the CT protocol is the portal venous phase and the spatial resolution is almost isotropic. But different scanning parameters and artifacts may affect our algorithms performance when applied to other datasets.
The same issues about manual segmentations and image qualities can be raised in general segmentation and evaluations. Specifically for our proposed approach, especially in the fusion step, the way of computing priori, , used in (21) can in practice affect the final segmentation. But considering that the deep network segmentation results from different viewingdirections are independently obtained, the mean can be accepted in general. However, if the deep network segmentations show clear tendencies towards overestimation or underestimation, then different types of models for priors may need to be used in order to improve the final result for practical applications.
One of the main advantages of our algorithm is the efficient computation time. The segmentation of organs of the whole volume takes similar to or less than 1 minute with better performance reported than the stateoftheart methods [16]. Hence our approach can be practically useful in clinical environments.
6 Conclusion
In this paper, we proposed a novel framework for multiorgan segmentation using OANRCs with statistical fusion exploiting structural similarity. Our twostage organattention network reduces uncertainties at weak boundaries, focuses attention on organ regions with simple context, and adjusts FCN error by training the combination of original images and OAMs. Reverse connections deliver abstract level semantic information to lower layers so that hidden layers can be assisted to contain more semantic information and give good results even for small organs. The results are improved by the statistical fusion, based on local structural similarity, which smooths our noise and removes biases leading to better overall segmentation performance in terms of DSC and surface distances. We showed that our performance is better than previous state of the art algorithms. Our framework is not specific to any particular body region, but gives high quality and robust results for abdominal CTs, which are typically challenging regions due to their low contrast, large intra/intervariations, and different scales. In addition, the efficient computational time of our algorithm makes our approach practical for clinical environments such as CAD, CAS or RT.
References
 1. A. J. Asman and B. A. Landman. Formulating spatially varying performance in the statistical fusion framework. IEEE Transactions on Medical Imaging, 31(6):1326–1336, 2012.
 2. A. J. Asman and B. A. Landman. Nonlocal statistical label fusion for multiatlas segmentation. Medical Image Analysis, 17(2):194–208, 2013.
 3. Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization via graph cuts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(11):1222–1239, 2001.
 4. H. Chen, Q. Dou, L. Yu, J. Qin, and P.A. Heng. VoxResNet : Deep voxelwise residual networks for brain segmentation from 3D MR images. NeuroImage, 2017.
 5. L.C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. 2016.

6.
L.C. Chen, Y. Yang, J. Wang, W. Xu, and A. L. Yuille.
Attention to scale: Scaleaware semantic image segmentation.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 3640–3649, 2016.  7. C. Chu, M. Oda, T. Kitasaka, K. Misawa, M. Fujiwara, Y. Hayashi, Y. Nimura, D. Rueckert, and K. Mori. Multiorgan segmentation based on spatiallydivided probabilistic atlas from 3D abdominal CT images. Lecture Notes in Computer Science, 8150 LNCS(PART 2):165–172, 2013.
 8. Ö. Çiçek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ronneberger. 3D Unet: Learning dense volumetric segmentation from sparse annotation. Lecture Notes in Computer Science, 9901 LNCS:424–432, 2016.
 9. Q. Dou, H. Chen, Y. Jin, L. Yu, J. Qin, and P. A. Heng. 3D deeply supervised network for automatic liver segmentation from CT volumes. Lecture Notes in Computer Science, 9901 LNCS:149–157, 2016.
 10. M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. ”the PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results”. ”http://www.pascalnetwork.org/challenges/VOC/voc2012/workshop/index.html”.
 11. M. Havaei, A. Davy, D. WardeFarley, A. Biard, A. Courville, Y. Bengio, C. Pal, P.M. Jodoin, and H. Larochelle. Brain Tumor Segmentation with Deep Neural Networks. Medical Image Analysis, 35:18–31, 2017.
 12. T. Heimann, B. van Ginneken, M. Styner, Y. Arzhaeva, V. Aurich, C. Bauer, A. Beck, C. Becker, R. Beichel, G. Bekes, F. Bello, G. Binnig, H. Bischof, A. Bornik, P. Cashman, Y. Chi, A. Cordova, B. Dawant, M. Fidrich, J. Furst, D. Furukawa, L. Grenacher, J. Hornegger, D. Kainmüller, R. Kitney, H. Kobatake, H. Lamecker, T. Lange, J. Lee, B. Lennon, R. Li, S. Li, H. Meinzer, G. Nemeth, D. Raicu, A. Rau, E. van Rikxoort, M. Rousson, L. Rusko, K. Saddi, G. Schmidt, D. Seghers, A. Shimizu, P. Slagmolen, E. Sorantin, G. Soza, R. Susomboon, J. Waite, A. Wimmer, and I. Wolf. Comparison and evaluation of methods for liver segmentation from CT datasets. IEEE Transactions on Medical Imaging, 28:1251–1265, 2009.
 13. J. E. Iglesias and M. R. Sabuncu. Multiatlas segmentation of biomedical images: A survey. Medical Image Analysis, 24(1):205–219, 2015.
 14. K. Jamnitsas, C. Ledig, V. F. Newcombe, J. P. Simpson, A. D. Kane, D. K. Menon, D. Rueckert, and B. Glocker. Efficient multiscale 3D CNN with fully connected CRF for accurate brain lesion segmentation. Medical Image Analysis, 36:61–78, 2017.
 15. T. Kada, M. G. Linguraru, M. Hori, R. M. Summers, N. Tomiyama, and Y. Sato. Abdominal multiorgan segmentation from CT images using conditional shapelocation and unsupervised intensity priors. Medical Image Analysis, 26(1):1–18, 2015.
 16. K. Karasawa, M. Oda, T. Kitasaka, K. Misawa, M. Fujiwara, C. Chu, G. Zheng, D. Rueckert, and K. Mori. Multiatlas pancreas segmentation: Atlas selection based on vessel structure. Medical Image Analysis, 39:18–28, 2017.
 17. C. Kirbas and F. Quek. A review of vessel extraction techniques and algorithms. ACM Computing Surveys, 36:81–121, 2004.
 18. T. Kong, F. Sun, A. Yao, H. Liu, M. Lu, and Y. Chen. RON: reverse connection with objectness prior networks for object detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
 19. D. Lesage, E. D. Angelini, I. Bloch, and G. FunkaLea. A review of 3d vessel lumen segmentation techniques: Models, features and extraction schemes. MEdical Image Analysis, 13:819–845, 2009.
 20. G. Li, X. Chen, F. Shi, W. Zhu, J. Tian, and D. Xiang. Automatic liver segmentation based on shape constraints and deformable graph cut in ct images. IEEE Transactions on Image Processing, 24:5315–5329, 2015.
 21. J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3431–3440, 2015.
 22. C. LugoFagundo, B. Vogelstein, A. Yuille, and E. K. Fishman. Deep learning in radiology: Now the real work begins. Journal of the American College of Radiology, 15:364–367, 2018.
 23. A. M. Mharib, A. R. Ramli, S. Mashohor, and R. B. Mahmood. Survey on liver ct image segmentation methods. Artificial Intelligence Review, 37, 2012.
 24. F. Milletari, N. Navab, and S. A. Ahmadi. VNet: Fully convolutional neural networks for volumetric medical image segmentation. Proceedings  2016 4th International Conference on 3D Vision, 3DV 2016, pages 565–571, 2016.

25.
J. Nascimento and G. Carneiro.
Multiatlas segmentation using manifold learning with deep belief networks.
Proceedings  International Symposium on Biomedical Imaging, 2016June, 2016.  26. M. Rajchl, M. C. Lee, O. Oktay, K. Kamnitsas, J. PasseratPalmbach, W. Bai, M. Damodaram, M. A. Rutherford, J. V. Hajnal, B. Kainz, and D. Rueckert. DeepCut: Object Segmentation from Bounding Box Annotations Using Convolutional Neural Networks. IEEE Transactions on Medical Imaging, 36(2):674–683, 2017.
 27. H. Roth, M. Oda, N. Shimizu, H. Oda, Y. Hayashi, T. Kitasaka, M. Fujiwara, K. Misawa, and K. Mori. Towards dense volumetric pancreas segmentation in ct using 3d fully convolutional networks. 2017.
 28. H. R. Roth, A. Farag, L. Lu, E. B. Turkbey, and R. M. Summers. Deep convolutional networks for pancreas segmentation in CT imaging. 2015.
 29. H. R. Roth, L. Lu, A. Farag, A. Sohn, and R. M. Summers. Spatial aggregation of holisticallynested networks for automated pancreas segmentation. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 9901 LNCS:451–459, 2016.
 30. H. R. Roth, H. Oda, Y. Hayashi, M. Oda, N. Shimizu, M. Fujiwara, K. Misawa, and K. Mori. Hierarchical 3d fully convolutional networks for multiorgan segmentation. 2017.
 31. M. Sabuncu, B. Yeo, K. van Leemput, B. Fischi, and P. Golland. A generative model for image segmentation based on label fusion. IEEE Transactions on Medical Imaging, 29:1714–1729, 2010.
 32. A. Saito, S. Nawano, and A. Shimizu. Joint optimization of segmentation and shape prior from levelsetbased statistical shape model, and its application to the automated segmentation of abdominal organs. Medical Image Analysis, 2105.
 33. A. A. A. Setio, F. Ciompi, G. Litjens, P. Gerke, C. Jacobs, S. J. Van Riel, M. M. W. Wille, M. Naqibullah, C. I. Sanchez, and B. Van Ginneken. Pulmonary Nodule Detection in CT Images: False Positive Reduction Using MultiView Convolutional Networks. IEEE Transactions on Medical Imaging, 35(5):1160–1169, 2016.
 34. E. Shelhamer, J. Long, and T. Darrell. Fully convolutional networks for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4):640–651, April 2017.
 35. Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: From error visibility to structural similarity. IEEE Transcations on Image Processing, 13(4):600–612, 2004.
 36. S. K. Warfield, K. H. Zou, and W. M. Wells. Simultaneous truth and performance level estimation (STAPLE): An algorithm for the validation of image segmentation. IEEE Transactions on Medical Imaging, 23(7):903–921, 2004.
 37. R. Wolz, C. Chu, K. Misawa, M. Fujiwara, K. Mori, and D. Rueckert. Automated abdominal multiorgan segmentation with subjectspecific atlas generation. IEEE Transactions on Medical Imaging, 32(9):1723–1730, 2013.
 38. S. Xie and Z. Tu. Holisticallynested edge detection. In IEEE International Conference on Computer Vision (ICCV), 2015.
 39. Y. Zhou, L. Xie, W. Shen, Y. Wang, E. K. Fishman, and A. L. Yuille. A fixedpoint model for pancreas segmentation in abdominal CT scans. In Medical Image Computing and Computer Assisted Intervention (MICCAI), 2017.
 40. X. Zhuang and J. Shen. Multiscale patch and multimodality atlases for whole heart segmentation of MRI. Medical Image Analysis, 31:77–87, 2016.
 41. C. Zu, Z. Wang, D. Zhang, P. Liang, Y. Shi, D. Shen, and G. Wu. Robust multiatlas label propagation by deep sparse representation. Pattern Recognition, 63:511–517, 2017.