Elastic Boundary Projection for 3D Medical Imaging Segmentation

12/03/2018 ∙ by Tianwei Ni, et al. ∙ Johns Hopkins Medicine 0

We focus on an important yet challenging problem: using a 2D deep network to deal with 3D segmentation for medical imaging analysis. Existing approaches either applied multi-view planar (2D) networks or directly used volumetric (3D) networks for this purpose, but both of them are not ideal: 2D networks cannot capture 3D contexts effectively, and 3D networks are both memory-consuming and less stable arguably due to the lack of pre-trained models. In this paper, we bridge the gap between 2D and 3D using a novel approach named Elastic Boundary Projection (EBP). The key observation is that, although the object is a 3D volume, what we really need in segmentation is to find its boundary which is a 2D surface. Therefore, we place a number of pivot points in the 3D space, and for each pivot, we determine its distance to the object boundary along a dense set of directions. This creates an elastic shell around each pivot which is initialized as a perfect sphere. We train a 2D deep network to determine whether each ending point falls within the object, and gradually adjust the shell so that it gradually converges to the actual shape of the boundary and thus achieves the goal of segmentation. EBP allows 3D segmentation without cutting the volume into slices or small patches, which stands out from conventional 2D and 3D approaches. EBP achieves promising accuracy in segmenting several abdominal organs from CT scans.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 6

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Medical imaging analysis (MedIA), in particular 3D organ segmentation, is an important prerequisite of computer-assisted diagnosis (CAD), which implies a broad range of applications. Recent years, with the blooming development of deep learning, convolutional neural networks have been widely applied to this area 

[21][20], which largely boosts the performance of conventional segmentation approaches based on handcrafted features [15][16], and even surpasses human-level accuracy in many organs and soft tissues.

Existing deep neural networks for medical imaging segmentation can be categorized into two types, differing from each other in the dimensionality of the processed object. The first type cuts the 3D volume into 2D slices, and trains a 2D network to deal with each slice either individually [29] or sequentially [5]. The second one instead trains a 3D network to deal with volumetric data directly [20][18]. Although the latter was believed to have potentially a stronger ability to consider 3D contextual information, it suffers from two weaknesses: (1) the lack of pre-trained models makes the training process unstable and the parameters tuned in one organ less transferrable to others, and (2) the large memory consumption makes it difficult to receive the entire volume as input, yet fusing patch-wise prediction into the final volume remains non-trivial yet tricky.

2D-Net [21][29] 3D-Net [20][31] AH-Net [18] EBP (ours)
Pure 2D network?
Pure 3D network?
Working on 3D data?
3D data not cropped?
3D data not rescaled?
Can be pre-trained?
Table 1: A comparison between EBP and previous approaches in network dimensionality, data dimensionality, and the ways of pre-processing data and network weights. Due to space limit, we do not cite all related work here – see Section 2 for details.

In this paper, we present a novel approach to bridge the gap between 2D networks and 3D segmentation. Our idea comes from the observation that an organ is often single-connected and locally smooth, so, instead of performing voxel-wise prediction, segmentation can be done by finding its boundary which is actually a 2D surface. Our approach is named Elastic Boundary Projection (EBP

), which uses the spherical coordinate system to project the irregular boundary into a rectangle, on which 2D networks can be applied. EBP starts with a pivot point within or without the target organ and a elastic shell around it. This shell, parameterized by the radius along different directions, is initialized as a perfect sphere (all radii are the same). The goal is to adjust the shell so that it eventually converges to the boundary of the target organ, for which we train a 2D network to predict whether each ending point lies inside or outside the organ, and correspondingly increase or decrease the radius at that direction. This is an iterative process, which terminates when the change of the shell is sufficiently small. In practice, we place a number of pivots in the 3D space, and summarize all the converged shells for outlier removal and 3D reconstruction.

Table 1 shows a comparison between EBP and previous 2D and 3D approaches. EBP enjoys three-fold advantages. First, EBP allows using a 2D network to perform volumetric segmentation, which absorbs both training stability and contextual information. Second, with small memory usage, EBP processes a 3D object entirely without cutting it into slices or patches, and thus prevents the trouble in fusing predictions. Third, EBP can sample abundant training cases by placing a number of pivots, which is especially useful in the scenarios of limited data annotations. We evaluate EBP in segmenting several organs in abdominal CT scans, and demonstrate its promising performance.

The remainder of this paper is organized as follows. Section 2 briefly reviews related work, and Section 3 describes the proposed EBP algorithm. After experiments are shown in Section 4, we draw our conclusions in Section 5.

2 Related Work

Computer aided diagnosis (CAD) is a research area which aims at helping human doctors in clinics. Currently, a lot of CAD approaches start from medical imaging analysis to obtain accurate descriptions of the scanned organs, soft tissues, etc.. One of the most popular topics in this area is object segmentation, i.e., determining which voxels belong to the target in 3D data, such as abdominal CT scans studied in this paper. Recently, the success of deep convolutional neural networks for image classification [13][26][10][11] has been transferred to object segmentation in both natural images [25][6] and medical images [21][20].

One of the most significant differences between natural and medical images lies in data dimensionality: natural images are planar (2D) while medical data such as CT and MRI scans are volumetric (3D). To deal with it, researchers proposed two major pipelines. The first one cut each 3D volume into 2D slices, and trained 2D networks to process each of them individually [21]. Such methods often suffer from missing 3D contextual information, for which various techniques were adopted, such as using 2.5D data (stacking a few 2D images as different input channels) [22][23], training deep networks from different viewpoints and fusing multi-view information at the final stage [30][28][29], and applying a recurrent network to process sequential data [5][3]. The second one instead trained a 3D network to deal with volumetric data [7][20]. These approaches, while being able to see more information, often require much larger memory consumption, and so most existing methods worked on small patches [9][31], which left a final stage to fuse the output of all patches. In addition, unlike 2D networks that can borrow pre-trained models from natural image datasets [8], 3D networks were often trained from scratch, which often led to unstable convergence properties [27]. One possible solution is to decompose each 3D convolution into a 2D-followed-by-1D convolution [18]. A discussion on 2D vs. 3D models for medical imaging segmentation is available in [14].

Prior to the deep learning era, planar image segmentation algorithms were often designed to detect the boundary of a 2D object [2][24][17]. Although these approaches have been significantly outperformed by deep neural networks in the area of medical imaging analysis [15][16], we borrow the idea of finding the 2D boundary instead of the 3D volume and design our approach.

3 Elastic Boundary Projection

Figure 1: The overall flowchart of EBP (best viewed in color). We show the elastic shell after some specific numbers of iterations (green voxels in the second row) generated by a pivot (the red center voxel in the second row) within an boundary of the organ (blue voxels in 2nd row). The data generation process starts from a perfect sphere initialized by , and then we obtain the pairs (the third and fourth row) by in the training stage. In the testing stage, is predicted by our model given . After that, one iteration is completed by the adjustment of to by the addition of . Finally, the elastic shell converges to .

3.1 Problem, Existing Methods and Drawbacks

The problem we are interested in is to segment an organ from abdominal CT scans. Let an input image be , a 3D volume with voxels, and each voxel indicates the intensity at the specified position measured by the Haunsfield unit (HU). The label shares the same dimension with , and indicates the class annotation of . Without loss of generality, we assume that where indicates the target organ and the background. Suppose our model predicts a volume , and and are the foreground voxels in ground-truth and prediction, respectively, we can compute segmentation accuracy using the Dice-Sørensen coefficient (DSC): , which has a range of with implying a perfect prediction.

Let us denote the goal as . Thus, there are two typical ways of designing . The first one trains a 3D model to deal with volumetric data directly [7][20], while the second one works by cutting the 3D volume into slices and using 2D networks for segmentation [22][29]

. Both 2D and 3D approaches have their advantages and disadvantages. We appreciate the ability of 3D networks to take volumetric cues into consideration (radiologists also exploit 3D information to make decisions), however, 3D networks are sometimes less stable, arguably because we need to train all weights from scratch, while the 2D networks can be initialized using pre-trained models from the computer vision literature (

e.g., RSTN [29] borrowed FCN [19] as initialization). On the other hand, processing volumetric data (e.g., 3D convolution) often requires heavier computation in both training and testing. We aim at designing an algorithm which takes both benefits of 2D and 3D approaches.

3.2 EBP: the Overall Framework

Our algorithm is named Elastic Boundary Projection (EBP). As the name shows, our core idea is to predict the boundary of an organ instead of every pixel in it.

Consider a binary volume with indicating the foreground voxel set. We define its boundary as a set of (continuous) coordinates that are located between the foreground and background voxels111The actual definition used in implementation is slightly different – see Section 3.4 for details.. Since is a 2D surface, we can parameterize it using a rectangle, and then apply a 2D deep network to solve it. We first define a set of pivots which are randomly sampled from the region-of-interest (ROI), e.g., the 3D bounding-box of the object. Then, in the spherical coordinate system, we define a fixed set of directions , in which each

is a unit vector

, i.e., , for . For each pair of and , there is a radius indicating how far the boundary is along this direction, i.e., 222If is located outside the boundary, there may exist some directions that the ray does not intersect with . In this case, we define , i.e., along these directions, the boundary collapses to the pivot itself. In all other situations (including is located within the boundary), there may be more than one ’s that satisfy this condition, in which cases we take the maximal . When there is a sufficient number of pivots, the algorithm often reconstructs the entire boundary as expected. See Sections 3.4 and 3.5 for implementation details.. When is not convex, it is possible that a single pivot cannot see the entire boundary, so we need multiple pivots to provide complementary information. Provided a sufficiently large number of pivots as well as a densely distributed direction set , we can approximate the boundary and thus recover the volume which achieves the goal of segmentation.

Therefore, volumetric segmentation reduces to the following problem: given a pivot and a set of directions , determine all so that . This task is difficult to solve directly, which motivates us to consider the following counter problem: given , and a group of values, determine whether these values correctly describe the boundary, i.e., whether each falls on the boundary. We train a model to achieve this goal. Here, the input is a generated image , where is the intensity value of at position

, interpolated by neighboring voxels if necessary. Note that

appears in a 2D rectangle. The output is a map of the same size, with each value indicating whether is located within, and how far it is from the boundary.

The overall flowchart of EBP is illustrated in Figure 1. In the training stage, we sample and generate pairs to optimize . In the testing stage, we randomly sample and initialize all ’s with a constant value, and use the trained model to iterate on each until convergence, i.e., all entries in are close to (as we shall see later, convergence is required because one-time prediction can be inaccurate). Finally, we perform 3D reconstruction using all ’s to recover the volume . We will elaborate the details in the following subsections.

3.3 Data Preparation: Distance to Boundary

In the preparation stage, based on a binary annotation , we aim at defining a relabeled matrix , with its each entry storing the signed distance between each integer coordinate and . The sign of indicates whether is located within the boundary (positive: inside; negative: outside; : on), and the absolute value indicates the distance between this point and the boundary (a point set). We follow the convention to define

(1)

where we use the -distance (the Euclidean distance) while a generalized -distance can also be used. We apply the KD-tree algorithm for fast search. If other distances are used, e.g., -distance, we can apply other efficient algorithms, e.g., floodfill, for constructing matrix . The overall computational cost is , where is the number of voxels and is the size of the boundary set.

Here are some technical details. The KD-tree is built on the set of boundary voxels, i.e., the integer coordinates with at least one (out of six) neighborhood voxels having a different label (foreground vs. background) from itself. There are in average such voxels for each case, and performing individual searches on this KD-tree takes around minutes. To accelerate, we limit which implies that all coordinates with a sufficiently large distance are truncated (this is actually more reasonable for training – see the next subsection). We filter all pixels with an -distance not smaller than 333Mathematically, -distance (the minimal coordinate difference among three axes) is smaller than or equal to -distance., which runs very fast444Using a modified version of the floodfill algorithm, we can find all these voxels in time. and typically reduces the number of searches to less than of . Thus, data preparation takes less than minute for each case.

After is computed, we multiply by for all background voxels, so that the sign of distinguishes inner voxels from outer voxels. In the following parts, can be a floating point coordinate, in which case we interpolate to obtain .

3.4 Training: Data Generation and Optimization

To optimize the model , we need a set of training pairs . To maximally reduce the gap between training and testing data distributions, we simulate the iteration process in the training stage and sample data on the way.

We first define the direction set . We use the spherical coordinate system, which means that each direction has an azimuth angle and a polar angle . To organize these directions into a rectangle, we represent as the Cartesian product of an azimuth angle set of elements and a polar angle set of elements where . The

azimuth angles are uniformly distributed,

i.e., , but the polar angles have a denser distribution near the equator, i.e., , so that the unit vectors are approximately uniformly distributed over the sphere. Thus, for each , we can find the corresponing and , and the unit direction vector satisfies , and , respectively. . In practice, we fix which is a tradeoff between sampling density (closely related to accuracy) and computational costs.

We then sample a set of pivots . At each , we construct a unit sphere with a radius of , i.e., for all , where the superscript indicates the number of undergone iterations. After the -th iteration, the coordinate of each ending point is computed by:

(2)

Using the coordinates of all , we look up the ground-truth to obtain an input-output data pair:

(3)

and then adjust accordingly555Eqn 4 is not strictly correct, because is not guaranteed to be the fastest direction along which goes to the nearest boundary. However, since is the shortest distance to the boundary, Eqn (4) does not change the inner-outer property of .:

(4)

until convergence is achieved and thus all ending points fall on the boundary or collapse to itself666If is located within the boundary, then all ending points will eventually converge onto the boundary. Otherwise, along all directions with being outside the boundary, will be gradually reduced to and thus the ending point collapses to itself. These collapsed ending points will not be considered in 3D reconstruction (see Section 3.6)..

When the 3D target is non-convex, there is a possibility that a ray has more than one intersections with the boundary. In this case, the algorithm will converge to the one that is closest to the initial sphere. We do not treat this issue specially in both training and testing, because we assume a good boundary can be recovered if (i) most ending points are close to the boundary and (ii) pivots are sufficiently dense.

Here we make an assumption: by looking at the projected image at the boundary, it is not accurate to predict that the radius along any direction should be increased or decreased by a distance larger than (we use in experiments). So, we constrain . This brings three-fold benefits. First, the data generation process becomes much faster (see the previous subsection); second, iteration allows to generate more training data; third and the most important, this makes prediction easier and more reasonable, as we can only expect accurate prediction within a small neighborhood of the boundary.

After the training set is constructed, we optimize with regular methods, e.g.

, stochastic gradient descent is used in this paper. As a side comment, our approach can generate abundant training data by increasing

and thus the sampling density of pivots, which is especially useful when the labeled training set is very small.

3.5 Testing: Iteration and Inference

The testing stage is mostly similar to the training stage, which starts with a set of randomly placed pivots and a unit sphere around each of them. We fix the parameters and iterate until convergence or the maximal number of rounds is reached (unlike training in which ground-truth is provided, iteration may not converge in testing). After that, all ending points of all pivots, except for those collapsed to the corresponding pivot, are collected and fed into the next stage, i.e., 3D reconstruction. The following techniques are applied to improve testing accuracy.

First, the input image at each training/testing round only contains intensity values at the current shell defined by . However, such information is often insufficient to accurately predict , so we complement it by adding more channels to . The -th channel is defined by radius values . There are two types of channels, with of them being used to sample the boundary and of them to sample the inner volume:

(5)

When and , it degenerates to using one single slice at the boundary. With relatively large and (e.g., in our experiments), we benefit from seeing more contexts which is similar to volumetric segmentation but the network is still 2D. The number of channels in remains to be regardless of and .

Second, we make use of the spatial consistency of distance prediction to improve accuracy. When the radius values at the current iteration are provided, we can randomly sample numbers where is small, add them to , and feed the noisy input to . By spatial consistency we mean the following approximation is always satisfied for each direction :

(6)

where is the angle between and the normal direction at . Although this angle is often difficult to compute, we can take the left-hand side of Eqn (6) as a linear function of

and estimate its value at

using multiple samples of . This technique behaves like data augmentation and improves the stability of testing.

Figure 2:

An example of 3D reconstruction (best viewed in color). We start with all pivots (green and blue points indicate ground-truth and predicted inner pivots, respectively) predicted to be located inside the target. In Step 1, all converged ending points generated by these pivots form the point clouds. In Step 2, a kernel density estimator (KDE) is applied to remove outliers (marked in a red oval in the second figure). In Step 3, we adopt a graphics algorithm for 3D reconstruction and finally we voxelize the point cloud.

Third, we shrink the gap between training and testing data distributions. Note that in the training stage, all values are generated using ground-truth, while in the testing stage, they are accumulated by network predictions. Therefore, inaccuracy may accumulate with iteration if the model is never trained on such “real” data. To alleviate this issue, in the training stage, we gradually replace the added term in Eqn (4) with prediction, following the idea of curriculum learning [1]. In Figure 1, we show that this strategy indeed improves accuracy in validation.

Last but not least, we note that most false positives in the testing stage are generated by outer pivots, especially those pivots located within another organ with similar physical properties. In this case, the shell may converge to an unexpected boundary which harms segmentation. To alleviate this issue, we introduce an extra stage to determine which pivots are located within the target organ. This is achieved by constructing a graph with all pivots being nodes and edges being connected between neighboring pivots. The weight of each edge is the the intersection-over-union (IOU) rate between the two volumes defined by the elastic shells. To this end, so we randomly sample several thousand points in the region-of-interest (ROI) and compute whether they fall within each of the shells, based on which we can estimate the IOU of any pivot pairs. Then, we find the minimum cut which partitions the entire graph to two parts, and the inner part is considered the set of inner pivots. Only the ending points produced by inner pivots are considered in 3D reconstruction.

3.6 3D Reconstruction

The final step is to collect all ending points and reconstruct the surface of the 3D volume. Since model is not perfect, there always exist many false positives (i.e., predicted ending points that do not fall on the actual boundary), so we adopt kernel density estimation (KDE) to remove them, based on the assumption that with a sufficient number of pivots, the density of ending points around the boundary is much larger than that in other regions. We use the Epanechnikov kernel with a bandwidth of , and preserve all integer coordinates with a log-likelihood not smaller than .

Finally, we apply a basic graphics framework to accomplish this goal, which works as follows. We first use the Delaunay triangulation to build the mesh structure upon the survived ending points, and then remove improper tetrahedrons with a circumradius larger than . After we obtain the alpha shape, we use the subdivide algorithm to voxelize it into volumes with hole filling. Finally, we apply surface thinning to the volumes by slices. This guarantees a closed boundary, filling which obtains final segmentation. We illustrate an example of 3D reconstruction in Figure 2.

3.7 Discussions and Relationship to Prior Work

The core contribution of EBP is to provide a 2D-based approach for 3D segmentation. To the best of our knowledge, this idea was not studied in the deep learning literature. Conventional segmentation approaches such as GraphCut [2] and GrabCut [24]

converted 2D segmentation to find the minimal cut, a 1D contour that minimizes an objective function, which shared a similar idea with us. Instead of manually defining the loss function by using voxel-wise or patch-wise difference, EBP directly measures the loss with a guess and iteratively approaches the correct boundary. This is related to the methods based on the

active contour [12][4].

In the perspective of dimension reduction, EBP adds a different solution to a large corpus of 2D segmentation approaches [21][22][23][30][29] which cut 3D volumes into 2D slices without considering image semantics. Our solution enjoys the ability of extracting abundant training data, i.e., we can sample from an infinite number of pivots (no need to have integer coordinates). This makes EBP stand out especially in the scenarios of fewer training data (see experiments). Also, compared to pure 3D approaches [7][20], we provide a more efficient way of sampling voxels which reduces computational overheads as well as the number of parameters, and thus the risk of over-fitting.

However, EBP also has some limitations. First, annotations of medical images can be noisy – sometimes, even the professional radiologists/doctors cannot find the accurate position of boundaries. Such inaccuracy only changes the loss function a little bit for voxel-wise (either 2D or 3D) segmentation approaches, but can largely impact EBP since we are only sampling the boundary voxels. This explains why we need Eqn (5) which provides more inner-volume information. Second, due to its working mechanism, it is still difficult for EBP to deal with 3D targets with heavily irregular geometric shapes, e.g., tiny blood vessels.

4 Experiments

4.1 Datasets and Evaluation Metric

We evaluate EBP in a dataset with high-resolution CT scans. The width and height of each volume are both , and the number of slices along the axial axis varies from to . These data were collected from some potential renal donors, and annotated by four expert radiologists in our team. Four abdominal organs were labeled, including left kidney, right kidney and spleen. Around hour is required for each scan. All annotations were later verified by an experienced board certified Abdominal Radiologist. We randomly choose half of these volumes for training, and use the remaining half for testing. The data split is identical for different organs. We compute DSC for each case individually, i.e., where and are ground-truth and prediction, respectively.

For the second dataset, we refer to the spleen subset in the Medical Segmentation Decathlon (MSD) dataset777http://medicaldecathlon.com/. This is a public dataset with cases, in which we randomly choose for training and the remaining are used for testing. This dataset has quite a different property from ours, as the spatial resolution varies a lot. Although the width and height are still both , the length can vary from to . DSC is also used for accuracy computation.

Two recenty published baselines named RSTN [29] and VNet [20] are used for comparison. RSTN is a 2D-based network, which uses a coarse-to-fine pipeline with a saliency transformation module to deal with each 2D slice. We follow the code provided by the authors to feed neighboring slices to each network, and fuses three (coronal, sagittal and axial) views into final prediction. VNet is a 3D-based network, which randomly crops into patches from the original patch for training, in order not to exceed the GPU memory limit (). In the testing stage, a sliding window of the same size is moved regularly along three axes, and each voxel is averaged over all predictions it gets. Though RSTN does not require a 3D bounding-box (ROI) while EBP and VNet do, this is considered a fair comparison in 3D approaches, because a 3D bounding-box is relatively easy to obtain.

Approach left kidney Approach right kidney
Average Max Min Average Max Min
RSTN [29] RSTN [29]
VNet [20] VNet [20]
EBP EBP
Approach spleen Approach MSD spleen
Average Max Min Average Max Min
RSTN [29] RSTN [29]
VNet [20] VNet [20]
EBP EBP
Table 2: Comparison of segmentation accuracy (DSC, ) on our multi-organ dataset and the spleen

class in the MSD benchmark. Within each group, average (with standard deviation), max and min accuracies are reported.

4.2 Quantitative Results

Results are summarized in Table 2. One can see that in all these organs, EBP achieves comparable segmentation accuracy with RSTN, and usually significantly outperforms VNet.

On our own data, EBP works slightly worse than RSTN, but on the spleen set, the worst case reported by RSTN has a much lower DSC () than that of EBP (). After diagnosis, we find that RSTN fails to detect a part this organ in a few continuous 2D slices, but EBP, by detecting the boundary, successfully recovers this case. This suggests that in many cases, EBP and RSTN can provide supplementary information to organ segmentation. Moreover, on the MSD spleen dataset, a challenging public dataset, EBP outperforms RSTN by more than . In addition, (i) the worst case in MSD spleen reported by EBP is , much higher than reported by RSTN; (ii) all standard deviations reported by RSTN are significantly larger. Both the above phenomena suggest that EBP enjoys higher stability.

We can observe that VNet often suffers even lower stability in terms of both standard deviation and worst accuracy. In addition, the training process is not guaranteed to converge to a good model, e.g., in right kidney of our own dataset, we trained two VNet models – one of them, as shown in Table 2, is slightly worse than both RSTN and EBP; while the other, reports even worse results: average, max and min DSCs, respectively. Similar phenomena, which are mainly due to the difficulty of optimizing a 3D-based network, were also observed in [7][18][28].

We also observe the impact of spatial resolution in the MSD spleen dataset. This dataset has a relatively low spatial resolution (i.e., voxels along the long axis), which raises extra difficulties to VNet (it requires patches to be sampled). To deal with this issue, we normalize all volumes so as to increase the number of slices along the long axis. The results of VNet shown in Table 2 are computed in this normalized dataset (in DSC computation, all volumes are normalized back to the original resolution for fair comparison), while both RSTN and EBP directly work on the original non-normalized dataset. In addition, VNet reports an average DSC on the non-normalized dataset, with a few cases suffering severe false negatives. This implies that VNet heavily depends on data homogeneity, while EBP does not.

4.3 Qualitative Results

Figure 3: 2D visualization of segmentation results (best viewed in color). In each row, from left to right: ground-truth, EBP, RSTN, VNet. The top part shows one special case in our multi-organ segmentation dataset. In this case, the image looks differently compared to most training images, due to some unusual situations during the CT scan. In this case, kidney segmentation results of both RSTN and VNet are heavily impacted whereas EBP works reasonably well. EBP produces unsatisfactory results on spleen segmentation, mainly because a part of pivots are not recognized as inner pivots. By simply tuning down the threshold by a little bit, EBP reports on spleen segmentation, which surpasses both RSTN and VNet. The bottom part shows a case in the MSD spleen dataset, which we can observe how imperfect annotation affects DSC evaluations. In both rows, the ground-truth annotations do not cover the entire spleen. RSTN and VNet somehow miss a small margin close to the boundary, while EBP produces obviously better results but gets lower DSC scores. This tells us (i) ground-truth annotations in medical images are often imperfect; (ii) DSC values above the human level (e.g., it can be defined as the average DSC between two individual human labelers, but such numbers are not available in most datasets) do not accurately reflect the absolute quality of segmentation, and in this scenario, a higher DSC does not guarantee better segmentation.

In this part, we compare the segmentation results by RSTN, VNet and EBP. We choose one case from our dataset and one case from the MSD spleen dataset, respectively.

To compare the different behaviors between EBP and two baselines, we display some slice-wise segmentation results in Figure 3. We can see that EBP often produces results in a good shape, even when the image is impacted by some unusual conditions in CT scan. In comparison, RSTN and VNet produce segmentation by merging several parts (RSTN: slices, VNet: patches), therefore, in such extreme situations, some parts can be missing and thus segmentation accuracy can be low. On the other hand, the most common issue that harms the accuracy of EBP is the inaccuracy in distinguishing inner pivots from outer pivots. Under regular conditions, EBP is often more sensitive to the boundary of the targets, as it is especially trained to handle these cases – an example comes from the visualization results in the MSD spleen dataset, which demonstrates that EBP sometimes produces better results than the ground-truth especially near the boundary areas.

4.4 How Does EBP Find the Boundary?

Now, we discuss an important and interesting problem, namely, how EBP finds the boundary and how confident it is. Two aspects are studied, namely, convergence and consistency, which are computed on one case in the subset of right kidney with a medium segmentation accuracy.

4.4.1 EBP Converges to the Boundary

We start with investigating convergence, by which we refer to whether each pivot , after a sufficient number of iterations, can converge to a boundary. We use the -norm of to measure convergence, the output of which indicates the amount of revision along the radius. With its value reaching at a low level (positive but smaller than ), perfect convergence is achieved. Results are shown in Figure 4. We can see that, starting from most inner pivots, the elastic shell can eventually converge to the boundary. In the figure, we show iterations, but in practice, for acceleration, we only perform iterations before sending all “ending points” to 3D reconstruction. This is to say, although convergence is not achieved and many ending points are not indeed located at the boundary, it is possible for 3D reconstruction algorithm to filter these outliers. This is because we have sampled a large number of pivots. Therefore, an ending point located near the boundary will be accompanied by a lot of others, while one located inside or even outside the target will be isolated. By applying kernel density estimation (KDE), we can filter out those isolated points so that 3D reconstruction is not impacted.

Figure 4: The -norm of during the first iterations. The thick curve is averaged over pivots, each of which appears as a thin curve.

4.4.2 EBP Discriminates Inner and Outer Pivots

Next, we investigate consistency, for which we take some pivot pairs and compute the DSC between the converged shells centered at them. This criterion was introduced in Section 3.5 to distinguish inner pivots from outer pivots. The assumption is that the shells generated by a pair of inner pivots should have a large similarity, while those generated by an inner pivot and an outer pivot should not. To maximally make fair comparison, we take all the boundary pivots, defined as the inner pivots with at least one neighbor being outside. Then, we sample all pivot pairs in which at least one of them is a boundary pivot, and make statistics. For those inner-inner pivot pairs, the average DSC () is much larger than that () of inner-outer pivot pairs. This experiment suggests that, two neighboring pivots are more likely to agree with each other if both of them are located within the target, otherwise the chance of getting a low DSC becomes large888There is a side note here. Theoretically, for an inner pivot and an outer pivot, if both elastic shells are perfectly generated, they should have a DSC of . However, it is not often the case, because the elastic shell of the outer pivot is also initialized as a sphere, which may intersect with the boundary. In this case, all ending points that are initially located within the target will start growing until they reach the other border of the target. Consequently, it has a non-zero DSC with some of the inner pivots..

Figure 5: The inter-pivot DSC recorded when a pivot keeps going along a fixed direction until it goes out of the target (we do not plot the curve beyond this point).

Last, we perform an interesting experiments to further reveal how inter-pivot DSC changes with the relative position of a pivot to the boundary. Starting from an inner pivot, we keep going along a fixed direction until being outside, and on the way, we record the DSC between the elastic shells generated by every neighboring pivot pairs. Some statistics are provided in Figure 5

. On all these curves, we observe a sudden drop at some place, which often indicates the moment that the pivot goes from inside to outside.

4.4.3 Summary

Summarizing all the information above, we conclude that EBP is indeed equipped with the ability to locate the boundary. However, since the distance between neighboring pivots is , we cannot find the accurate boundary from merely the pivots – they serve as starting points and EBP eventually finds the elastic shells that converge to the boundary.

5 Conclusions

This paper presents EBP, a novel approach that trains 2D deep networks for 3D object segmentation. The core idea is to build up an elastic shell and adjust it until it converges to the actual boundary of the target. Since the shell is parameterized in the spherical coordinate system, we can apply 2D networks (low computational overhead, fewer parameters, pre-trained models, etc.) to deal with volumetric data (richer contextual information). Experiments are performed on several organs in abdominal CT scans, and EBP achieves comparable performance to both 2D and 3D competitors. In addition, EBP can sample sufficient training data from few annotated examples, which claims its advantage in medical imaging analysis.

We learn from this work that high-dimensional data often contain redundancy (

e.g., not every voxel is useful in a 3D volume), and mining the discriminative part, though being challenging, often leads to a more efficient model. In the future, we will continue investigating this topic and try to cope with the weaknesses of EBP, so that it can be applied to a wider range of targets in the 3D world. Besides, we will continue investigating the possibility to overcome the weaknesses of EBP, especially for those observed in visualization.

Acknowledgments

This paper was supported by the Lustgarten Foundation for Pancreatic Cancer Research. We thank Prof. Zhouchen Lin for supporting our research. We thank Prof. Wei Shen, Dr. Yan Wang, Weichao Qiu, Zhuotun Zhu, Yuyin Zhou, Yingda Xia, Qihang Yu, Runtao Liu and Angtian Wang for instructive discussions.

References

  • [1] Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In

    International Conference on Machine Learning

    , 2009.
  • [2] Y. Y. Boykov and M. P. Jolly. Interactive graph cuts for optimal boundary & region segmentation of objects in nd images. In International Conference on Computer Vision, 2001.
  • [3] J. Cai, L. Lu, Y. Xie, F. Xing, and L. Yang. Pancreas segmentation in mri using graph-based decision fusion on convolutional neural networks. In International Conference on Medical Image Computing and Computer-Assisted Intervention, 2017.
  • [4] T. F. Chan and L. A. Vese. Active contours without edges. IEEE Transactions on Image Processing, 10(2):266–277, 2001.
  • [5] J. Chen, L. Yang, Y. Zhang, M. Alber, and D. Z. Chen.

    Combining fully convolutional and recurrent neural networks for 3d biomedical image segmentation.

    In Advances in Neural Information Processing Systems, 2016.
  • [6] L. C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2018.
  • [7] O. Cicek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ronneberger. 3d u-net: learning dense volumetric segmentation from sparse annotation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, 2016.
  • [8] J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In

    Computer Vision and Pattern Recognition

    , 2009.
  • [9] M. Havaei, A. Davy, D. Warde-Farley, A. Biard, A. Courville, Y. Bengio, C. Pal, P. M. Jodoin, and H. Larochelle. Brain tumor segmentation with deep neural networks. Medical Image Analysis, 35:18–31, 2017.
  • [10] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Computer Vision and Patter Recognition, 2016.
  • [11] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In Computer Vision and Patter Recognition, 2017.
  • [12] M. Kass, A. Witkin, and D. Terzopoulos. Snakes: Active contour models. International Journal of Computer Vision, 1(4):321–331, 1988.
  • [13] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, 2012.
  • [14] M. Lai. Deep learning for medical image segmentation. arXiv preprint arXiv:1505.02000, 2015.
  • [15] D. T. Lin, C. C. Lei, and S. W. Hung. Computer-aided kidney segmentation on abdominal ct images. IEEE Transactions on Information Technology in Biomedicine, 10(1):59–65, 2006.
  • [16] H. Ling, S. K. Zhou, Y. Zheng, B. Georgescu, M. Suehling, and D. Comaniciu. Hierarchical, learning-based automatic liver segmentation. In Computer Vision and Pattern Recognition, 2008.
  • [17] J. Liu, J. Sun, and H. Y. Shum. Paint selection. ACM Transactions on Graphics, 28(3):69, 2009.
  • [18] S. Liu, D. Xu, S. K. Zhou, O. Pauly, S. Grbic, T. Mertelmeier, J. Wicklein, A. Jerebko, W. Cai, and D. Comaniciu. 3d anisotropic hybrid network: Transferring convolutional features from 2d images to 3d anisotropic volumes. In International Conference on Medical Image Computing and Computer-Assisted Intervention, 2018.
  • [19] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Computer Vision and Pattern Recognition, 2015.
  • [20] F. Milletari, N. Navab, and S. A. Ahmadi. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In International Conference on 3D Vision, 2016.
  • [21] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, 2015.
  • [22] H. R. Roth, L. Lu, A. Farag, H. Shin, J. Liu, E. B. Turkbey, and R. M. Summers. Deeporgan: Multi-level deep convolutional networks for automated pancreas segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, 2015.
  • [23] H. R. Roth, L. Lu, A. Farag, A. Sohn, and R. M. Summers. Spatial aggregation of holistically-nested networks for automated pancreas segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, 2016.
  • [24] C. Rother, V. Kolmogorov, and A. Blake. Grabcut: Interactive foreground extraction using iterated graph cuts. In ACM Transactions on Graphics, volume 23, pages 309–314, 2004.
  • [25] E. Shelhamer, J. Long, and T. Darrell. Fully convolutional networks for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4):640, 2017.
  • [26] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. International Conference on Learning Representations, 2015.
  • [27] N. Tajbakhsh, J. Y. Shin, S. R. Gurudu, R. T. Hurst, C. B. Kendall, M. B. Gotway, and J. Liang. Convolutional neural networks for medical image analysis: Full training or fine tuning? IEEE Transactions on Medical Imaging, 35(5):1299–1312, 2016.
  • [28] Y. Xia, L. Xie, F. Liu, Z. Zhu, E. K. Fishman, and A. L. Yuille. Bridging the gap between 2d and 3d organ segmentation with volumetric fusion net. In International Conference on Medical Image Computing and Computer-Assisted Intervention, 2018.
  • [29] Q. Yu, L. Xie, Y. Wang, Y. Zhou, E. K. Fishman, and A. L. Yuille.

    Recurrent saliency transformation network: Incorporating multi-stage visual cues for small organ segmentation.

    In Computer Vision and Patter Recognition, 2018.
  • [30] Y. Zhou, L. Xie, W. Shen, Y. Wang, E. K. Fishman, and A. L. Yuille. A fixed-point model for pancreas segmentation in abdominal ct scans. In International Conference on Medical Image Computing and Computer-Assisted Intervention, 2017.
  • [31] Z. Zhu, Y. Xia, W. Shen, E. K. Fishman, and A. L. Yuille. A 3d coarse-to-fine framework for automatic pancreas segmentation. In International Conference on 3D Vision, 2018.