I Introduction
Automated Image segmentation plays a very import role in quantitative image analysis. In recent years, semantic segmentation methods based on convolutional neural networks (CNNs
) become very popular in computer vision and then in medical imaging research communities, e.g. the fully convolutional networks (FCNs)
[1] with application in natural images, and then Unet [2] and its D version Vnet [3] for medical image segmentation.In natural and medical images, as pixels or voxels usually exhibit strong correlation, jointly modeling the label distribution globally and/or locally is desirable. To capture the contextual information, conditional random fields (CRFs) [4] are commonly utilized for semantic segmentation. The model consists of a unary potential term and a pairwise potential term. The unary potential specifies the perpixel or voxel confidence of assigning a label, while the pairwise potential regularizes the label smoothness between neighboring voxels. In computer vision, the CRFs model has been integrated with CNNs for an endtoend training to take advantages of both the modeling power of CRFs and the representationlearning ability of CNNs [5].
Most deep learning based semantic segmentation methods are regionbased [1, 2, 3, 5], in which each pixel is labeled as either target object or background. On the other hand, one can also formulate semantic segmentation with a surface based model, in which the boundary surface of the the target object is computed directly. Apparently these two types of approaches are equivalent as the boundary surface can be computed from the labeled target volume, and vice versa. As one of prominent surfacebased methods, GraphSearch (GS) [6, 7] has achieved great success, especially in medical imaging field, e.g. [8, 9, 10, 11, 12]. This method is capable of simultaneously detecting multiple interacting surfaces with global optimality with respect to the energy function designed for individual surfaces with several geometric constraints defining the surface smoothness and interrelations. The method solves the surface segmentation problem by transforming it into computing a minimum st cut in a derived arcweighted directed graph, which can be solved with global optimality and has a loworder polynomial time complexity.
Inspired by the graph search method, Shah et al. [13, 14] first modeled the terrainlike surfaces segmentation as direct surface identification using a regression network based on CNN. The network only models the unary potentials. As the prediction was directly on surface positions, a surface monotonicity constraint was realized in a straightforward way. The network used was a very light weighted D CNN and no post processing was required. Surprisingly the results were very promising.
It would be of high interest to extend Shah et al.’s method to D for segmenting general nonterrain like surface. To achieve that goal, two major obstacles need to be overcome: 1) how to generate patches with a regular neighborhood in D, such that the traditional CNN can be applied? 2) It is generally hard to train a D network, especially when it contains giant fully connected (FC) layers, the size of which is closely related to inference/patch size. There is a tradeoff between the amount of contextual information within a patch and the number of parameters in a network architecture, i.e. a bigger patch size comes along with more contextual information, but more parameters need to be trained.
Contributions: To overcome those technical barriers, we propose to build a framework of surfacebased CNN+CRFs for surface segmentation in medical images. The contributions are twofold: 1) A surfacebased CNN+CRFs framework is proposed, including proper modeling of unary and pairwise term, and compatibility matrix customized for surface segmentation; 2) A novel shapeaware patch generation method, which is based on harmonic mapping, is also proposed to make efficient training of surface segmentation possible.
Ii Method
The pipeline starts with a presegmentation (Preseg), which serves as the coarse surface position and topology that the final segmentation should comply with. The triangular (Tri) mesh of the presegmented surface is then converted to a quadrilateral (Quad) mesh, which is friendly to convolution operations. Based on the Quad mesh, image patches, which contain terrainlike boundary surfaces of the partial target object can be generated and are fed into the proposed neural network to predict the voxels on the desired surface. The flowchart of proposed method is illustrated in Fig. 1.
In the following sections, we will first define the surfacebased segmentation problem rigorously. Then the CRFs model will be reviewed, and then we will cover the modeling of the unary and pairwise terms. A novel shapeaware patch generation method and the network architecture will be explained finally.
Iia SurfaceBased Segmentation
A D image can be viewed as a
D tensor
. A terrainlike surface in is oriented and shown in Fig. 2. Let , and denote the image sizes in , and dimensions, respectively. The surface is defined by a function , where , , and . Thus any surface in intersects with exactly one voxel of each column (Col) in parallel with direction, and it consists of exactly voxels. In surface segmentation, generally we define the energy or cost for one feasible surface to be:(1) 
and the “optimal” surface is computed by minimizing the energy. Generally the unary term is the energy when considering each column independently, and the pairwise energy term penalizes discontinuity of surface position among adjacent columns, and is to balance the contributions of the two terms.
IiB CRFs Model
In this section the Conditional Random Fields (CRFs) model is first briefly reviewed and then the modeling of unary and pairwise terms is explained.
IiB1 Review
CRFs is defined on observations
and random variables
as follows: Let be a graph such that , so that is indexed by the vertices of . Then is a conditional random field when the random variables conditioned on , obey the markov property.The pair can be characterized by a Gibbs distribution of the form
(2) 
Here is called the energy of the configuration and is the partition function. For convenience, the conditioning on is dropped. In the fully connected pairwise CRFs model of [15], the energy of a label assignment is given by:
(3) 
where the unary energy components measure the energy of taking the value , and pairwise energy components measure the energy of assigning , to and simultaneously. balances the two terms.
IiC Modeling the Surface Segmentation as CRFs
It should be natural to model the surfacebased segmentation with a CRFs model. In the surfacebased segmentation scenario, are the random variables and is the observation. The unary potentials correspond to the energy of assigning the surface position to be without explicitly considering the column interactions. And the pairwise potentials represent the energy to simultaneously assign surface positions to be and , respectively.
IiC1 Modeling Unary and Pairwise Potentials
Commonly, unary energies are obtained from a CNN, which, roughly speaking, predicts labels for columns without explicitly considering the smoothness and the consistency of label assignments. The pairwise energies provide the smoothing term that encourages assigning similar labels to columns with similar properties. In [15], the pairwise potentials are modeled as weighted Gaussians:
(4) 
where each for
, is a Gaussian kernel applied on feature vectors
and . The feature vectors of , denoted by , are derived from image features such as spatial location, and visual features like pixel/voxel intensities. The function , called the label compatibility function, captures the compatibility between different pairs of labels.In [15], the term is defined as:
(5) 
where the first and second term on RHS is called appearance kernel and smoothness kernel, respectively. , , and control shapes of corresponding Gaussian kernels.
IiC2 Customized Pairwise Terms
Next we will explain the CRFs pairwise term modeling in the proposed setting.
Customized Visual Feature
One main difference from the common semantic segmentation is the meaning of . In D regionbased segmentation, just reduces to D (gray images) or D (color images) pixel intensity values . In this sense, it is reasonable that larger difference may indicate possible label differences. However, in our surfacebased segmentation setting, it represents one column of voxels in our D image . One should notice that currently it is not proper to use term as a measure indicating possible labeling differences for corresponding pairs. The reason is that the may also contain voxels that are not significantly related to the labeling of the current column
and may have large variance, e.g. the two voxels
and , which are outlined by blue dash ovals in Fig. 3(b). To remedy this problem, we propose to use the probabilitymap or logits output by CNN as the visual feature. The observation is that a CNN with enough receptive field can in some sense get rid of these unrelated voxels in probabilitymap, which is illustrated in Fig. 3(c).The proposed kernel term is of the new form
(6) 
Customized Compatibility Matrix
The other difference is the physical meaning of label differences of different columns. In D regionbased segmentation, the quantity of difference among different classes does not have much meaning within it. Suppose classes cat, car and building have labels 0, 1, and 2, respectively. Generally there is no way to give any reasonable meaning to the label difference. For example, one can not say cat is more compatible to car than to building. However, in our D surfacebased segmentation, the label difference has an explicit meaning of surface smoothness, i.e. the smaller label difference, the smoother the surface.
The naive way to learn the compatibility matrix would need to learn a matrix. In our scenario, this way will be illposed, since some label pairs may not exist or at least are very rare in the training data. To tackle this, we proposed to parameterize the compatibility matrix () with a parameter function . The parameterization has the following formula:
(7) 
The intuition behind is that the compatibility penalty is monotonically related to the label difference. In this way, the training parameter number for compatibility matrix is reduced from to .
IiD ShapeAware Patch Generation
For real applications, two obstacles need to be solved first such that we can model the surfacebased segmentation as terrainlike surfaces segmentation using CNN. 1) Unfolding the surface into a terrainlike surface, on which our surfacebased segmentation is defined. 2) The unfolded image or patche volumes should have a rectangular cuboid grid structure in D, such that the traditional CNN can apply to.
IiD1 To Terrainlike Surfaces
For the cylindrical surface, it is very trivial to use a cylindrical coordinate transform. For the simplest closed surface, i.e. a sphere, it is also trivial to utilize a predefined sampling and dividing pattern on the sphere to unfold it. But how to deal with more complex closed surface, e.g. nonconvex objects/surfaces? We propose to tackle 1) by first harmonic mapping the surface to the unit sphere and then using a predefined sampling and dividing pattern to realize the unfolding.
IiD2 To Grid Structure Patch
Given a presegmented surface, a triangular (Tri) mesh of the presegmentation can be computed by the marching cube method. However, the Tri mesh may contain variable neighborhood schemes and does not have a regular grid structure. We propose to resample the Tri mesh of the presegmentation into a quadrilateral (Quad) mesh to tackle 2).
The detailed explanation of the proposed pipeline for the proposed shapeaware patch generation is as follows.
Harmonic Mapping
The triangulated presegmentation mesh, which should be a Genus0 closed surface (otherwise, it needs to make it closed artificially), is harmonically mapped to the unit sphere to obtain a triangulated sphere using the algorithm in [16].
Quad Parameterization of the Unit Sphere
The unit sphere can be parameterized by a Quad mesh (except 8 grid points, which only had 3 neighbors), denoted by . This parameterization proceeds in a recursive way. The base Quad mesh is a inscribed cube of the unit sphere. In each recursion, the middle point of each edge is computed and moved outwards exactly to the unit sphere. This process is demonstrated in Fig. 4. It can be noticed that more recursive iterations produce higher grid resolution on the surface. In our experiments, the recursion number is chosen as 5. In other words, for each face of the base Quad mesh, the Quad grid size increases from the base to .
(a)  (b) 
(c)  (d) 
Quad Remeshing of Preseg Surfaces
Both the mapped presegmentation surface and the Quad parameterized sphere are a unit sphere manifold in the same space, and can be overlaid to each other. For each grid point on the Quad mesh sphere , the corresponding triangular face in the mapped presegmentation surface can be found. The Barycentric coordinates of this grid point can then be computed. For each vertex on and each on , there exist a onetoone mapping relation. Using the Barycentric coordinates computed on the unit sphere manifold, for each , we can get its approximate corresponding point on the original triangulated presegmentation surface . The normal direction for each can be computed in a similar fashion. In this way, a Quad remeshing for the triangulated presegmentation surface could be realized, denoted by . This Quad remeshing process works for all genus0 closed surfaces, which is illustrated in Fig. 5 (bc).
Sampling Columns to Generate Patches
After the remeshing, for each , we sample a column of voxels with certain length and resolution in the normal direction, which is treated as the image feature for this vertex and corresponds to one column in our problem definition (Fig. 5 (d)). The Quad surface mesh is a D manifold, hence, after extending in the image feature column dimension, a D volume, corresponding to , is generated. In addition, our Quad parameterization of a unit sphere can be easily split into 6 pieces, which correspond to the 6 faces of the inscribed cube. If we split these 6 pieces and treat the image feature column as the third dimension, 6 D volumes / patches can be generated.
Ground Truth Generation
When the is derived, the truth surface voxel for each or column can be defined as the nearest neighbor voxel to the manual segmentation mesh in the normal direction.
IiE Network Architecture
The network is designed for the direct surface segmentation. The architecture consists of two main parts: a D encoderdecoder CNN for surface probability map generation, and a trainable CRFs for modeling unary and pairwise simultaneously, as demonstrated in Fig. 6.
IiE1 Cnn
In medical image segmentation, the encoderdecoder CNN has been popular. We take a similar architecture to generate surface probability map, i.e. unary term. Global skip connections are built as in [2], as well as short or local skip connections utilized as in He et al. [17], where a unit block is called residual block. Those connections are used to mitigate gradient vanishing problem. As output of the encoderdecoder CNN, a two channel probability map for the target surface is generated. During pretraining the CNN, the supervision is added in with a weighted binary crossentropy (WBCE) loss. The ground truth here is a binary mask of the same size as the input patch. The difference from those in region based segmentation neural network, is that 1 and 0 represent the target surface and background, respectively. Thus, the resulting classification problem is highly imbalanced. We introduce the WBCE loss to alleviate the problem.
IiE2 CRFs
To explicitly model unary and pairwise simultaneously, the CRFs is introduced. To our best knowledge, commonly CRFs are used in region based segmentation network and it is the first time to apply it to the surface based segmentation. In Shah et al.’s method, a FC layer was utilized to directly regress the surface position but from feature maps in low spatial resolution.
The fullyconnected CRFs model was first introduced to semantic segmentation by Krähenbühl and Koltun [15], which is known as DenseCRF. Although DenseCRF utilized a meanfield approximation inference, it achieved significantly improved results with an efficient inference. This has become the backbone for most CRFs models. The meanfield inference of a DenseCRF model can be incorporated into neural network, which was developed in [5]. This enables the joint training of CNN and CRFs by simple back propagation. This method was named as CRFasRNN. In CRFasRNN, the messagepassing step is the bottleneck. The exact computation is quadratic in the number of pixels and therefore is infeasible. To alleviate this, a permutohedral lattice approximation was utilized. However, computing it efficiently on GPU is nontrivial or impossible to realize. In addition, an efficient gradient computation of the permutohedral lattice approximation, is also a nontrivial problem. This may hinder the learning of some parameters, e.g. , , and . In the convolutional CRFs [18], the message passing is reformulated to be a convolution with a truncated Gaussian kernel and can be implemented in a similar way to the regular convolutions in CNN. Therefore the convolutional CRFs is utilized in the proposed method.
IiE3 Loss
The crossentropy (CE) loss is utilized both for the pretraining of the CNN and the fine tuning of the CNN+CRFs network. For pretraining, it is a binary crossentropy (BCE) loss, since the encoderdecoder is meant to output probability map of being surface or nonsurface. Also, as the surface pixel and nonsurface pixel classes are highly imbalanced, a weighted binary crossentropy (WBCE) loss is used. For the CNN+CRFs fine tuning, the problem is modeled as a multinomial classification and therefore a multinomial crossentropy (MCE) loss is chosen. And the fine tuning is endtoend. The losses and training strategies for the proposed method are summarized in Table. I.
CNN Loss  CRFs Loss  Training Strategy  

proposed CNN  WBCE     
proposed CNN+CRFs  MCE  Pretrain CNN and then fine tune CNN+CRFs 
In the following two sections, the proposed method was applied to the prostate MRI segmentation and the spleen CT segmentation.
Iii Application to the Prostate MRI Segmentation
Iiia Data, Patch Generation and Augmentation, Hyper Parameters
IiiA1 Data
The dataset is provided by the NCIISBI 2013 Challenge  Automated Segmentation of Prostate Structures [19]. This dataset has two labels: peripheral zone (PZ) and central gland (CG). We treat them both as prostate, since the single surface segmentation is considered in this work. The challenge data set has 3 parts including the training (60 cases), the leader board (10 cases) and the test data sets (10 cases). As the challenge is closed, only the training and leader board data with annotation, 70 cases in total, were used for our experiments. 10fold cross validation was applied on this dataset.
IiiA2 Patch Generation
Our method needs presegmentation to set the desired topology and also to give the base plane for the D patches, such that feature columns are sampled in normal directions. To test the robustness of the proposed method to presegmentation and the column length (the resolution is fixed), two presegmentation methods were explored. The first one was fitting a fixed size ellipsoid. The other one was coarsely fitting a mean shape to the user defined bounding box. Apparently the second one should produce more accurate presegmentations. And obviously we can sample shorter feature columns with a better presegmentation. All volumes were resampled to be isotropic with voxel resolution and normalized to have zero mean and unit variance.
Ellipsoid Presegmentation:
For simplicity, an ellipsoids, which has three principal semiaxes lengths as , and , was used as presegmentations. The centers of the ellipsoids were picked by users. As this presegmentation is far from perfect (average Dice similarity coefficient (DSC) around 0.7), longer columns should be sampled such that at least all surface voxels must be included. The column length for ellipsoid presegmentations was set to be 128 and the resolution was 0.625mm.
Mean Shape Presegmentation:
For training data, we aligned all images to one randomly picked reference image based on the ground truth centers, such that all training prostates were coarsely aligned. And then zero level set of average surface distance maps would be the mean shape. For test data, based on the bounding boxes users picked, we fitted the mean shape into the bounding boxes by only changing the value of level set, i.e. the mean shape was only allowed to do scaling transformation. The column length under this setting could be reduced to 64 (resolution=0.625mm) as they were more accurate (average DSC around 0.78).
IiiA3 Data Augmentation
Rotation (, , ), flipping in two inplane directions, combination of the two, and simple random translation in the direction were applied. In total, the amount of the training patch was enlarged by a factor of 14, from to .
IiiA4 Hyper Parameters
The proposed network was implemented with Pytorch
[20]. The network was initialized with Xavier normal initialization [21]. The patch size () was and with two different presegmentation settings, in which represents the inplane size or the number of columns in each patch, and the resolution on the column direction was 0.625mm.The Proposed CNN only:
Adam optimizer [22] with learning rate
, was chosen for the training. We let it run for 50 epochs. The weights within the WBCE were
and for patches of two different sizes, which are basically inversely proportional to the ratio between the nonsurface voxels count and the surface voxels count in the ground truth annotation.The Proposed CNN+CRFs:
The settings for the CNN pretraining were the same to that of CNN only. During fine tuning, the learning rate of Adam was , and the training ran for 50 epochs. Only MCE loss following CRFs layer was utilized. The initialization of parameters in CRFs layer is detailed in Table. II.
0.5  1  3  5  0.2  5  5 
IiiB Evaluation Metrics
Three metrics – Dice similarity coefficient (DSC), Hausdorff distance (HD) (the greatest of all the distances from each point the computed surface to the closest point on the reference surface), and the average surface distance (ASD) (the average over the shortest distances between the points on computed surface and the reference surface), were engaged to evaluate results of segmentations. The definition of the DSC is:
(8) 
where is the number of voxel of prostate in ground truth segmentation, is the prostate voxel number in prediction, and is the number of overlap prostate voxels between the ground truth and the prediction. The distance from a voxel to a surface is first defined as:
(9) 
Then HD between the two surfaces and is computed as:
(10) 
The ASD is defined as:
(11) 
where and are the number of voxels in surface and , respectively.
IiiC Comparison to Other Methods
The quantitative segmentation results of different methods are listed in Table. III.
In Table. III, our results were derived using only NCIISBI data. While for compared methods: FCN [1], Vnet [3], Unet [2], PSNet [23], they utilize additional inhouse data and Promise12 data [24] for their network training. For all other methods, only NCIISBI dataset was used. In other words, the results of the first four methods were derived using around double training cases and a similar number of validation and test cases. With respect to the metric of DSC, our method outperforms FCN, Vnet, Unet and PSNet, and are comparable to the deep learning stateoftheart GCANet [25] and another stateoftheart traditional method [26] combining the supervoxel method, Graph Cut and Active Contour Model (ACM). With respect to the surface distance related metrics (i.e., HD and ASD), the proposed method significantly outperform all the compared methods. Another advantage of the proposed method is that the post processing, e.g. morphological operations, to remove holes to which surface metrics are very sensitive, which is required for most region based deep learning methods such as FCN, Vnet, Unet and so on, is not necessary any more. GS method shares the same merit. However, due to the need to manually design the cost function (how to generate the probability maps), even the solution to the energy minimization problem is global optimal, its performance is inferior to the proposed method and other deep learning based methods.
DSC  ASD (mm)  HD (mm)  

FCN [1]  0.790.06  4.81.1  11.94.8 
Vnet [3]  0.830.05  3.41.2  9.53.9 
Unet [2]  0.840.05  3.31.0  10.13.2 
PSNet [23]  0.850.04  3.00.9  9.33.5 
GS  0.800.04  2.70.6  13.91.8 
SupervoxelGraphCutACM [26]  0.880.02     
GCANet [25]  0.88  2.2   
proposed CNN+CRFs  0.880.03  1.40.3  8.23.6 
IiiD Robustness to Different Presegmentations
The results with two different presegmentations are shown in Table. IV. Better presegmentations and shorter columns improve the DSC and HD performance consistently. The ASD performances are comparable. The results basically indicate that although better presegmentations can help, our method is not sensitive but robust to different presegmentations if correct topologies are included.
Preseg, Col length  DSC  ASD (mm)  HD (mm)  
Ellipsoid, 128  proposed CNN+CRFs  0.860.05  1.40.5  9.65.2 
Mean shape, 64  proposed CNN+CRFs  0.880.03  1.40.3  8.23.6 
IiiE Ablation Study
We also investigated ablation study to verify the CRFs layer could improve the surface segmentation. The ablation study results are shown in Table. V. We compare results with or without CRFs layer. It could be noticed that the CRFs layer does not improve the DSC performance but it does help the surface related metrics performance. One sample of the improvement is illustrated in Fig. 7.
Preseg, Col length  DSC  ASD (mm)  HD (mm)  

Ellipsoid, 128  proposed CNN  0.860.05  1.50.5  11.35.9 
proposed CNN+CRFs  0.860.05  1.40.5  9.65.2  
Mean shape, 64  proposed CNN  0.880.03  1.40.3  8.32.9 
proposed CNN+CRFs  0.880.03  1.40.3  8.23.6 
(a)  (b) 
(c)  (d) 
In the next section, the proposed method was tested in the spleen CT segmentation.
Iv Application to the Spleen CT Segmentation
Iva Data, Patch Generation and Augmentation, Hyper Parameters
IvA1 Data
The dataset is provided by Task09 of Medical Segmentation Decathlon (MSD) challenge ^{1}^{1}1https://decathlon.grandchallenge.org/Home/. Only training sets with annotation were utilized. There are 41 cases in total. All experiments were conducted with 4fold cross validation.
IvA2 Patch Generation
All volumes were resampled to be isotropic with voxel resolution and normalized to have zero mean and unit variance. For simplicity, a D Vnet was trained as the baseline model. A D active contour model [27] was utilized to provide coarse segmentation. Then we smoothed the coarse predictions and treat the smoothed segmentations as our presegmentations. The generated patch has a size .
IvA3 Data Augmentation
The same augmentation strategy was applied to this task. In total, the training patch number is .
IvA4 Hyper Parameters
The proposed network was implemented with Pytorch. The network is initialized with Xavier normal initialization. The patch size () is , in which represents the inplane size or the number of columns in each patch, and the resolution on the column direction is mm.
Vnet
A public implementation ^{2}^{2}2https://github.com/mattmacy/vnet.pytorch was used. The patch size is
. BCE is chosen as the loss function. The network was trained with Adam with learning rate
for epochs.The Proposed CNN+CRFs
For the CNN pretraining, Adam optimizer with learning rate , was chosen. We let it run for 200 epochs. The weight within the WBCE is . During fine tuning, the MCE loss was utilized, and the learning rate of Adam is , and the training runs for 100 epochs. The initialization of parameters in the CRFs layer is detailed in Table. VI.
0.7  1  0.2  5  0.2  5  5 
IvB Evaluation Metrics
DSC, ASSD and HD were involved to quantify segmentation results.
IvC Performance Comparison
(a)  (b) 
(c)  (d) 
DSC  ASD (mm)  HD (mm)  

Vnet [3]  0.940.03  1.21.0  16.311.2 
proposed CNN+CRFs  0.950.02  0.860.74  13.612.7 
pvalue  0.007  0.047  0.160 
The quantitative results of proposed method applied to the MSD Spleen dataset is shown in Table. VII. As the task is easy, it can be noticed that even the baseline Vnet can achieve promising results. However, the proposed method can still gain additional improvement, especially in the sense of surface related metrics. From Table. VII, when considering value of test, the proposed CNN+CRFs significantly outperforms Vnet in aspects of DSC and ASD. Sample segmentation results are shown in Fig. 8. Illustrations that the CRFs improves segmentation results can be found in Fig. 9.
(a)  (b) 
(c)  (d) 
V Discussion
Va Training Efficiency
Patch Consistency
One advantage of proposed patch generation method is it generates more consistent patches, i.e. all the patches’ truths have monotonic surfaces within (each column has exact one voxel being surface). For the regionbased network, the paradigms of truths for image patches generally have much more variance, e.g. all zeros (background ), all ones (desired object) and any possibilities between. The proposed method may make the task easier for the network. Also the proposed method focuses on surfaces and segments surfaces directly, then the proposed network may predict boundaries/surfaces more accurately. One may think it is one kind of the attention mechanism [28].
Patch Amount
From the view of total patch number for each volume, compared to the common regionbased D segmentation network, the proposed method may be more efficient (actually we extract 6 patches for each volume), as we only sampled around presegmentation surface but not the whole volume. Surely it can be argued that if the presegmentation or ROI bounding box is given, the region or subvolume based segmentation network may have similar number of patches.
Based on all these, if assuming similar augmentation utilized, the proposed method may converge faster. Actually the training of proposed CNN on the prostate data converges in around 3 hours with a Nvidia Titan X GPU.
VB Presegmenation with Correct Topology
The proposed method relies on the fact that the presegmentation must have a correct topology. Otherwise, there is no chance for our method to generate the correct prediction, since prediction of the proposed method always comply to the topology of the presegmentation. As was verified by experiments, the proposed method is not sensitive to the presegmentation accuracy as long as the topology is correct. In this sense, for application of simple topologies, model based presegmentation methods may work better than advanced methods without any topology guarantees. For example, a simple Unet/Vnet may not be proper to use directly for the presegmentation generation, although the DSC of it’s prediction may be significantly higher than that of the generated by a simple model based method. Actually, for the spleen dataset, we have to apply a recursive Gaussian mask smoothing and windowsinc mesh smoothing to get proper presegmentations.
VC Inference with Overlapping Patches
In our current implementation, each case only produces 6 patches, overlapping merely on the boundaries of patches. In regionbased CNN segmentation work, it is commonly known that taking overlapping patches and averaging the prediction on the overlapped regions during inference can improve the segmentation results. Theoretically one can also take similar strategies in the surfacebased segmentation and it probably helps to improve inference quality.
VD Possible Drawbacks of the Proposed Method
In the regionbased CNN+CRFs framework, the visual feature is pixel or voxel intensity of original image, which is very helpful for the CRFs to recover the true object boundary accurately to compensate the coarseness character of CNNonly semantic segmentation. However, in our current surfacebased CNN+CRFs framework, the original image information can not be used as in the regionbased counterpart to help accurate boundary recovering. In GS framework, this problem is remedied by using carefully designed unary cost term, which includes enough lower level original image information. We may treat the problem as possible future work.
VE Extension to Surfaces with Complex Topologies
It is still open to apply the proposed method to application of single surface or object segmentation with complex topology, e.g. brain gray/white matter segmentation. For these applications, more carefully designed presegmentation method has to be considered. Also the sampling direction for image feature column may also need to be handled carefully such that no two columns have intersections. Possible options include the electric field line based method [29] and the generalized gradient vector flow based method [10].
VF Extension to Multiple Surfaces Segmentation
For multiple surfaces or objects segmentation, more efforts should be devoted on how to apply the proposed methods. 1) If all surfaces can share one presegmenation and image feature columns, it would be very straightforward to extend the proposed CNN from binary to multinomial. However, the problem becomes multilabel classification (each column has multiple labels, the number of which equals to the desired surface number), but not multiclass classification (each column has one label). The easier multiclass classification is properly modeled by current CRFs framework, which is not the case for the harder multilabel problem. 2) If surfaces can not share presegmentation and feature columns, how to reserve the topology conveyed by presegmentations would be nontrivial. For example, how to guarantee two surfaces not crossing each other, which is a very common prior knowledge in medical imaging. One may argue we can manage to control column length such that different surfaces can not intersect each other. But this moves the burden to decide proper column lengths. All these can be considered as the future work.
VG Loss Investigation
The MCE loss may not be the best option for our surfacebased segmentation network, since the different possible labels for each column have ordering within. Therefore, in future we may consider weighted MCE, where the weights of different labels for each column should explicitly monotonically relates to the distances between current labels and ground truth label. Another possibility is to find out proper methods that can optimize surface position errors, e.g. mean square error, directly.
Vi Conclusion
We propose a novel direct surface segmentation method in D using deep learning. With the proposed patch generation method, surface monotonicity with respect to the presegmentation is enforced in our segmentation neural network. With an encoderdecoder network only, with respect to the surface distance related metrics (i.e., HD and ASD), the segmentation results on NCIISBI 2013 Prostate dataset and MSD Spleen dataset are promising. Together with the introduction of the CRFs layer into our deep network, the performance of the proposed surface segmentation method can even be improved further.
Acknowledgment
The authors would like to thank…
References

[1]
J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for
semantic segmentation,” in
Proceedings of the IEEE conference on computer vision and pattern recognition
, 2015, pp. 3431–3440.  [2] O. Ronneberger, P. Fischer, and T. Brox, “Unet: Convolutional networks for biomedical image segmentation,” in International Conference on Medical Image Computing and ComputerAssisted Intervention. Springer, 2015, pp. 234–241.
 [3] F. Milletari, N. Navab, and S.A. Ahmadi, “Vnet: Fully convolutional neural networks for volumetric medical image segmentation,” in 3D Vision (3DV), 2016 Fourth International Conference on. IEEE, 2016, pp. 565–571.
 [4] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul, “An introduction to variational methods for graphical models,” Machine learning, vol. 37, no. 2, pp. 183–233, 1999.

[5]
S. Zheng, S. Jayasumana, B. RomeraParedes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. Torr, “Conditional random fields as recurrent neural networks,” in
Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1529–1537.  [6] X. Wu and D. Z. Chen, “Optimal net surface problems with applications,” in International Colloquium on Automata, Languages, and Programming. Springer, 2002, pp. 1029–1042.
 [7] K. Li, X. Wu, D. Z. Chen, and M. Sonka, “Optimal surface segmentation in volumetric imagesa graphtheoretic approach,” IEEE transactions on pattern analysis and machine intelligence, vol. 28, no. 1, pp. 119–134, 2006.
 [8] M. K. Garvin, M. D. Abramoff, X. Wu, S. R. Russell, T. L. Burns, and M. Sonka, “Automated 3d intraretinal layer segmentation of macular spectraldomain optical coherence tomography images,” IEEE transactions on medical imaging, vol. 28, no. 9, pp. 1436–1447, 2009.
 [9] Y. Yin, X. Zhang, R. Williams, X. Wu, D. D. Anderson, and M. Sonka, “Logismos—layered optimal graph image segmentation of multiple objects and surfaces: cartilage segmentation in the knee joint,” IEEE transactions on medical imaging, vol. 29, no. 12, pp. 2023–2037, 2010.
 [10] I. Oguz and M. Sonka, “Logismosb: layered optimal graph image segmentation of multiple objects and surfaces for the brain,” IEEE transactions on medical imaging, vol. 33, no. 6, pp. 1220–1235, 2014.
 [11] M. K. Garvin, M. D. Abràmoff, R. Kardon, S. R. Russell, X. Wu, and M. Sonka, “Intraretinal layer segmentation of macular optical coherence tomography images using optimal 3d graph search,” IEEE transactions on medical imaging, vol. 27, no. 10, pp. 1495–1505, 2008.
 [12] Q. Song, J. Bai, M. K. Garvin, M. Sonka, J. M. Buatti, and X. Wu, “Optimal multiple surface segmentation with shape and context priors,” IEEE transactions on medical imaging, vol. 32, no. 2, pp. 376–386, 2013.
 [13] A. Shah, M. D. Abramoff, and X. Wu, “Simultaneous multiple surface segmentation using deep learning,” in Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support. Springer, 2017, pp. 3–11.
 [14] A. Shah, L. Zhou, M. D. Abrámoff, and X. Wu, “Multiple surface segmentation using convolution neural nets: application to retinal layer segmentation in oct images,” Biomedical optics express, vol. 9, no. 9, pp. 4509–4526, 2018.
 [15] P. Krähenbühl and V. Koltun, “Efficient inference in fully connected crfs with gaussian edge potentials,” in Advances in neural information processing systems, 2011, pp. 109–117.
 [16] P. T. Choi, K. C. Lam, and L. M. Lui, “Flash: Fast landmark aligned spherical harmonic parameterization for genus0 closed brain surfaces,” SIAM Journal on Imaging Sciences, vol. 8, no. 1, pp. 67–94, 2015.
 [17] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
 [18] M. T. Teichmann and R. Cipolla, “Convolutional crfs for semantic segmentation,” arXiv preprint arXiv:1805.04777, 2018.
 [19] N. Bloch, A. Madabhushi, H. Huisman et al., “Nciisbi 2013 challenge: automated segmentation of prostate structures,” 2015.
 [20] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” in Advances in Neural Information Processing Systems, 2017.

[21]
X. Glorot and Y. Bengio, “Understanding the difficulty of training deep
feedforward neural networks,” in
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics
, 2010, pp. 249–256.  [22] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
 [23] Z. Tian, L. Liu, Z. Zhang, and B. Fei, “Psnet: prostate segmentation on mri based on a convolutional neural network,” Journal of Medical Imaging, vol. 5, no. 2, p. 021208, 2018.
 [24] G. Litjens, R. Toth, W. van de Ven, C. Hoeks, S. Kerkstra, B. van Ginneken, G. Vincent, G. Guillard, N. Birbeck, J. Zhang et al., “Evaluation of prostate segmentation algorithms for mri: the promise12 challenge,” Medical image analysis, vol. 18, no. 2, pp. 359–373, 2014.
 [25] H. Jia, Y. Song, D. Zhang, H. Huang, D. Feng, M. Fulham, Y. Xia, and W. Cai, “3d global convolutional adversarial network for prostate mr volume segmentation,” arXiv preprint arXiv:1807.06742, 2018.
 [26] Z. Tian, L. Liu, Z. Zhang, J. Xue, and B. Fei, “A supervoxelbased segmentation method for prostate mr images,” Medical physics, vol. 44, no. 2, pp. 558–569, 2017.
 [27] Y. Gao, R. Kikinis, S. Bouix, M. Shenton, and A. Tannenbaum, “A 3d interactive multiobject segmentation tool using local robust statistics driven active contours,” Medical image analysis, vol. 16, no. 6, pp. 1216–1227, 2012.
 [28] L.C. Chen, Y. Yang, J. Wang, W. Xu, and A. L. Yuille, “Attention to scale: Scaleaware semantic image segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 3640–3649.
 [29] Y. Yin, Q. Song, and M. Sonka, “Electric field theory motivated graph construction for optimal medical image segmentation,” in International Workshop on GraphBased Representations in Pattern Recognition. Springer, 2009, pp. 334–342.
Comments
There are no comments yet.