The task of optimally delineating 3D surfaces representing object boundaries is important in segmentation and quantitative analysis of volumetric medical images. OCT image segmentation to detect and localize the intra-retinal boundaries is a necessary basis for ophthalmologists in diagnosis and treatment of retinal pathologies , e.g glaucoma. IVUS image segmentation produces cross-sectional images of blood vessels that provide quantitative assessment of the vascular wall, information about the nature of atherosclerotic lesions as well as plaque shape and size . Interactive surfaces segmentation technologies are applied tissues layers segmentation in medical images.
In medical imaging, many surfaces that need to be identified appear in mutual interactions. These surfaces are “coupled” in a way that their topology and relative positions are usually known already (at least in a general sense), and the distances between them are within some specific range. Clearly, incorporating these surface-interrelations into the segmentation can further improve its accuracy and robustness, especially when insufficient image-derived information is available for defining some object boundaries or surfaces. Such insufficiency can be remedied by using clues from other related boundaries or surfaces. Simultaneous optimal detection of multiple coupled surfaces thus yields superior results compared to the traditional single-surface detection approaches. Simultaneous segmentation of coupled surfaces in volumetric medical images is an under-explored topic, especially when more than two surfaces are involved.
Several approaches for detecting coupled surfaces have been proposed in past years. The graph-based methods [12, 24, 21] have been proven one of the state-of-the-art traditional approaches for surface segmentation in medical images. The great success of the methods is mainly due to their capability of modeling the boundary surfaces of multiple interacting objects, as well as a prior knowledge reflecting anatomic information in a complex multi-layered graph model, enabling the segmentation of all desired surfaces to be performed simultaneously in a single optimization process with guaranteed global optimality. The essence of the graph model is to encode the surface cost, which measures the “goodness” of a feasible surface based on a set of derived image features, as well as the surface interacting constraints, into a graph structure. The major drawback is associated with the need for handcrafted features to define the surface cost of the underlying graphical model.
Armed with superior data representation learning capability, deep learning (DL) methods are emerging as powerful alternatives to traditional segmentation algorithms for many medical image segmentation tasks [13, 23]. The state-of-the-art DL segmentation methods in medical imaging include fully convolutional networks (FCNs)  and U-net 
, which model the segmentation problem as a pixel-wise or voxel-wise classification problem. However, due to the scarcity of training data in medical imaging, it is at least nontrivial for the convolutional neural networks (CNNs) toimplicitly learn the global structures of the target objects, such as shape, boundary smoothness and interaction. Shah et al.  first formulated the single surface segmentation as a regression problem using an FCN followed by fully-connected layers to enforce the monotonicity of the target surface. More recently, Yufan He et al. 
utilized a U-net as a backbone network to model the multiple surface segmentation with regression by a fully differentiable soft argmax, in which the ordering of those surfaces is adjusted to be guaranteed by a sequence of ReLU operations.
In this work, we propose to unify the powerful feature learning capability of DL with the successful graph-based surface segmentation model in a single deep neural network for end-to-end training to achieve globally optimal segmentation of multiple interacting surfaces. In the proposed segmentation framework, the surface costs are parameterized and the DL network is leveraged to learn the model from the training data to determine the parameters for the input image. The multi-surface inference by minimizing the total surface cost while satisfying the surface interacting constraints is realized by the primal-dual Internal Point Method (IPM) for constrained convex optimization, which can be implemented by a layer of neural networks enabling efficient back-propagation of gradients with virtually no additional cost . Thus, the DL network for surface cost parameterization can be seamlessly integrated with the multi-surface inference to achieve the end-to-end training.
To clearly present the essence of the proposed surface segmentation framework, we consider the simultaneous segmentation of multiple terrain-like surfaces. For the objects with complex shapes, the unfolding techniques  developed for the graph-based surface segmentation methods as well as the convolution-friendly resampling approach [27, 15], can be applied.
2.1 Problem Formulation
Let of size be a given 3-D volumetric image. For each pair, the voxel subset forms a column parallel to the -axis, denoted by , which is relaxed as a line segment from to . Our target is to find terrain-like surfaces , each of which intersects every column at exactly one point.
In the graph-based surface segmentation model [12, 24, 21], each voxel is associated with an on-surface cost for each sought surface , which is inversely related to the likelihood that the desired surface contains the voxel, and is computed based on handcrafted image features. The surface cost of is the total on-surface cost of all voxels on . The on-surface cost function for the column can be an arbitrary function in the graph model. However, an ideal cost function should express a type of convexity: as we aim to minimize the surface cost, should be low at the surface location; while the distance increases from the surface location along the column
, the cost should increase proportionally. We propose to leverage DL networks to learn a Gaussian distributionto model the on-surface cost function for each column , that is, . Thus, the surface cost of is parameterized with .
For multiple surfaces segmentation, a surface interacting constraint is added to every column for each pair of the sought surfaces and . For each , we have , where and are two specified minimum and maximum distance between and , respectively, with on top of . The multi-surface segmentation is formulated as an optimization problem, where the parameterized surface costs are derived using deep CNNs:
2.2 The Surface Segmentation Network Architecture
As shown in Fig. 1, our segmentation network consists of two integrative components: One aim to learn the surface cost parameterization in Eqn. (1); the other strikes to solve the the optimal surface interference by optimizing Eqn. (1) with the IPM optimization module. The surface cost is parameterized with (,
), which models the Gaussian distribution of the surface locations along each image column. The assumption behind this learning Gaussian surface ground truth is that the predicted surface locations in the H dimension should has the maximum probability, while locations deviating from this predicted surface location have smaller, instead of zero, probability as the predicted result. Bigger variance means most difficult recognizing surface, while small variance means easy recognizing surface. RegionConv is a convolution module to output-region segmentation, while SurfaceConv is a convolution module to output
-surface segmentation probability distribution. IPM Optimization indicates primal-dual Internal Point Method for constrained convex optimization. Input includes raw image, gradient of a raw image along H, W dimension, and magnitude and direction of the gradient, total 5 channels. GDiceLoss is an-class Generalized Dice Loss. Weighed DivLoss is an image-gradient weighted divergence loss. GT denotes ground truth. Dashed line indicates optional for different experiments. Thus, the whole network can then be trained in an end-to-end fashion and outputs globally optimal solutions for the multiple surface segmentation.
Surface Cost Parameterization. We utilize U-net 
as the backbone of our deep network for the feature extraction. The implemented U-net has seven layers with long skip connections between the corresponding blocks of its encoder and decoder. Each block has three convolution layers with a residual connection. The output feature maps of the U-net module is then fed into the following RegionConv and SurfaceConv modules (Fig. 1). The RegionConv module is implemented with three-layer convolutions followed by a convolution and softmax to obtain the probability maps for the regions divided by the sought surfaces. The SurfaceConv module is implemented with the same module structure of RegionConv to compute the location probability distribution along every image column for each surface. Note that each sought surface intersects every image column exactly once.
The RegionConv module directly makes use of the region information, which may help direct the U-net learning robust features for surface segmentation. In addition, the output region segmentation is used to estimate the surface locations. For each sought surfaceand every image column , the estimated surface location is the average envelop of the -th region on column , as there is no guarantee that each of the predicted regions is consecutive along the column based on voxel-wise classification by RegionConv, especially in the early stage of the training process. We also calculate a confidence index () for the surface location estimation based on the number of region disordering with for no disordering.
For each surface , based on the surface location probability on every image column from the SurfaceConv module, the expected surface location . Combined with the RegionConv module, the surface location distribution of on column is modeled with a Gaussian , as follows.
where is used to balance the fidelity of information from RegionConv and SurfaceConv. Thus, the surface cost of surface is parameterized with .
Globally Optimal Multiple Surface Inference. Given the surface cost parameterization , the inference of optimal multiple surfaces can be solved by optimizing Eqn. (1), which is a constrained convex optimization problem. In order to achieve an end-to-end training, the optimization inference needs to be able to provide gradient back-propagation, which impedes the use of traditional convex optimization techniques. We exploit the OptNet technique  to integrate a primal-dual interior point method (IPM) for solving Eqn. (1) as an individual layer in our surface segmentation network (Fig. 1). Based on Amos and Kolter’s theorem , the residual equation to Eqn. (1) derived from the Karush-Kuhn-Tucker conditions at the optimal solution can be converted into a full differential equation , where is a Jacobian of with respect to , is the input to the IPM optimization module including , and defines the surface locations of all surfaces. We thus can deduce partial differentials , which can be used to compute the back-propagation gradients , where
is the training loss. IPM method is a 2nd order Newton method with complicated matrix inversion, but which just needs less than 10 iterations to get converge in our context, so it still supports high-epoch training form. In test,it just needs about 1 minute for each volume image. Please refer to the Appendix for a detailed IPM algorithm, and all codes including preprocessing, network, optimization, and config file are publice at.
2.3 Network Training Strategy
Multiple loss functions are introduced to focus on the training of different modules in the proposed multiple surface segmentation network (Fig.1
). In the proposed SurfaceConv module, the softmax layer works on each imagecolumn, not on each voxel. The rational is that we assume each target surface intersects with each column by exactly once, and so the probabilities are normalized within each column. We assume SurfaceConv should output a Gaussian shaped probability map for each column, which mimics the Bayesian learning for each column and shares merits with knowledge distillation  and distillation defense .
To encourage SurfaceConv outputs reasonable probability maps, an innovative weighted divergence loss
is utilized for SurfaceConv training. It inherits from KLDLoss (Kullback–Leibler divergence). It also measures distribution distance between 2 distribution, but it more emphasizes probability consistence of some weighed critical points between 2 distributions.where indicates all pixels in classes, and is ground truth probability at pixel , is predicted probability at pixel , is a pixel-wise weight from raw image gradient magnitude: ,where as an experience parameter. In our applications, we hope the better probability consistence at pixels of bigger image gradients between the prediction and ground truth. We use the surface location of each reference surface on each column as and use either fixed or dynamically from the the / computation module to form the ground truth Gaussian distribution.
For the RegionConv module, a generalized Dice loss  is introduced to counter the possible high unbalance in region sizes.
For the predicted surface locations, in addition to using -loss to measure the difference between the prediction and the surface ground truth, we introduce a novel SmoothLoss to regularize the smoothness and mutual interaction of sought surfaces. More precisely, is the total sum of the mean-squared-errors (MSEs) of the surface location changes between any two adjacent image columns to the ground truth, plus the total sum of the MSEs of thickness on every column of each region divided by the sought surfaces.
The whole network loss , where is a weight coefficient for countering weak gradient when the prediction is close to the ground truth.
The proposed method was validated on two Spectral Domain Optical Coherence Tomography (SD-OCT) datasets for segmenting 9 retinal surfaces, and on one public Intravascular Ultrasound (IVUS) data set for the segmentation of lumen and media of vessel walls.
3.1 SD-OCT Retinal Surface Segmentation
Mean Absolute Surface Distance (MASD) and standard deviation inevaluated on Beijing Eye Study Dataset for segmenting 9 retinal surfaces. Below OE is OCT-Explorer graph search method, ours is the proposed method. Depth resolution is 3.87.
Beijing Eye Study OCT Data set. Beijing Eye Study 2011 has 3468 participants of aged 50+ years, but all of them have no segmentation ground truth. 47 health subjects without explicit eye diseases were randomly chosen, from which the graph-search based OCT-explorer 3.8  generated initial segmentation result, and then an experienced ophthalmologist manually corrected all 47 segmentation result as the ground truth for experiments. Choosing 47 subjects only is to save expensive ophthalmologist’s correcting cost. All participants have scans on macula and optic nerve head by SD-OCT (Heidelberg Engineering, Inc., Germany) with a pixel resolution of 3.87 in the height (-axis) direction. Each volume has scan composing of 31 single lines on the * field centered on the macula. Horizontal area of scan was reduced to centered on the macula to remove the optic disc region. This experiment used a fixed to generate the Gaussian ground truth, and used gaussian and pepper&salt noises for data augmentation. A 10-fold cross-validation were performed to evaluate our method: 8 folds for training, 1 fold for validation, and 1 fold for testing. The mean absolute surface distances (MASDs) for each sought surface over the testing results on all 47 scans are shown in Table 1. Sample segmentation results are illustrated in Fig. 1(a).
Public JHU OCT Dataset. The public JHU retinal OCT data set  includes 35 human retina scans acquired on a Heidelberg Spectralis SD-OCT system, of which 14 are healthy controls (HC) and 21 have a diagnosis of multiple sclerosis (MS). Each patient has 49 B-scans with pixel size 4961024, and 9 ground truth surfaces on each B-Scan. The -axial resolution in each A-scan is 3.9
. The original images were manually delineated with 21 control points on each surface, and then a cubic interpolation was performed on each B-scan to obtain the ground truth by a Matlab script. Each B-scan was cropped to keep the center 128 rows to from a 1281024 image.
The same data configuration and image input as in  for training (6 HCs and 9 MS subjects) and testing (the remaining 20 subjects) were adopted in our experiment. A fixed was used to generate Gaussian ground truth. Gaussian and pepper&salt noises were used for data augmentation. The MASDs for the proposed and He et al.’s methods are shown in Table 2. While marginally improving the MASDs, our method demonstrates to be much more robust over the state-of-the-art He et al.’s method  with an improvement of 11.5% on the standard deviation. Please refer to the supplementary material for the ablation experiments on this data set.
3.2 IVUS Vessel Wall Segmentation
The data used for this experiment was obtained from the standardized evaluation of IVUS image segmentation database . In this experiment, the data set B was used. This dataset consists of 435 images with a size of , as well as the respective expert manual tracings of lumen and media surfaces. The pixel size is 0.0260.026. It comprises two groups - a training set (109 slices) and a testing set (326 slices). The experiment with the proposed method was conducted in conformance with the directives provided for the IVUS challenge. In our experiment, we randomly split the 109 training slices into 100 slices for training and 9 slices for validation. Each slice was transformed to be represented in the polar coordinate system with a size of . Jaccard Measure (JM), Percentage of Area Difference (PAD) and Hausdroff Distance (HD) are utilized to evaluate segmentation accuracy, which are calculated using a Matlab script published in IVUS Challenge . The results are summarized in Table 3 comparing to the state-of-the-art automated methods. Sample segmentation results are illustrated in Fig. 1(b).
In this paper, a novel DL segmentation framework for multiple interacting surfaces is proposed with end-to-end training. The globally optimal solutions are achieved by seamlessly integrating two DL networks: one for surface cost parameterization with a Gaussian model and the other for total surface cost minimization while explicitly enforcing the surface mutual interaction constrains. The effectiveness of the proposed method was demonstrated on SD-OCT retinal layer segmentation and IVUS vessel wall segmentation. Though all our experiments were conducted on 2D, the method is ready for applications in 3D.
-  Optimal surface segmentation with convex priors in irregularly sampled space. Medical Image Analysis volume 54, May 2019.Elsevier.. External Links: Cited by: Table 3.
-  (2017) OptNet: differentiable optimization as a layer in neural networks. CoRR abs/1703.00443. External Links: Cited by: §0.A.3, §1, §2.2.
-  (2014) Standardized evaluation methodology and reference database for evaluating ivus image segmentation. Comput Med Imaging Graph 38 (2), pp. 70–90. Cited by: §3.2.
-  Intravascular ultrasound image segmentation: a three-dimensional fast-marching method based on gray level distributions. IEEE Transactions on Medical Imaging vol. 25, no. 5, pp. 590-601, May 2006. External Links: Cited by: §1.
-  Topology guaranteed segmentation of the human retina from oct using convolutional neural networks. arXiv:1803.05120.. External Links: Cited by: Table 2.
-  (2016) Identity mappings in deep residual networks. pp. 630–645. External Links: Cited by: §2.2.
-  (2018) Retinal layer parcellation of optical coherence tomography images: data resource for multiple sclerosis and healthy controls.. Data Brief. External Links: Cited by: §3.1.
-  (2019) Fully convolutional boundary regression for retina oct segmentation. pp. 120–128. External Links: Cited by: Appendix 0.B, §1, §3.1, §3.1, Table 2, Table 3.
-  (2015) Distilling the knowledge in a neural network. In Deep Learning and Representation Learning Workshop at NIPS 2014, Cited by: §2.3.
-  The iowa institute for biomedial imaging team External Links: Cited by: §3.1, Table 1.
-  Retinal layer segmentation of macular oct images using boundary classification. Biomed. Opt. Express 4(7), 1133–1152 (2013). External Links: Cited by: Table 2.
-  (2006) Optimal surface segmentation in volumetric images-a graph-theoretic approach. IEEE Trans Pattern Anal Mach Intell 28 (1), pp. 119–134. Cited by: §1, §2.1.
-  (2017) A survey on deep learning in medical image analysis. Medical image analysis 42, pp. 60–88. Cited by: §1.
-  (2015) Fully convolutional networks for semantic segmentation. In CVPR 2015, pp. 3431–3440. Cited by: §1.
-  (2014) LOGISMOS-b: layered optimal graph image segmentation of multiple objects and surfaces for the brain. IEEE transactions on medical imaging 33 (6), pp. 1220–1235. Cited by: §2.
-  (2016) Distillation as a defense to adversarial perturbations against deep neural networks. In 2016 IEEE Symposium on Security and Privacy (SP), pp. 582–597. Cited by: §2.3.
-  (NEURIPS2019) PyTorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems 32. External Links: Cited by: §0.A.2.
-  A review of algorithms for segmentation of optical coherence tomography from retina. Journal of medical signals and sensors 3(1), 45–60, 2013. External Links: Cited by: §1.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In MICCAI 2015, pp. 234–241. Cited by: §1, §2.2.
-  Relaynet: retinal layer and fluid segmentation of macular optical coherence tomography using fully convolutional networks. Biomed. Opt. Express 8(8), 3627–3642 (2017). External Links: Cited by: Table 2.
-  (2019) Optimal surface segmentation with convex priors in irregularly sampled space. Med. Image Anal.. Cited by: §1, §2.1.
-  (2018) Multiple surface segmentation using convolution neural nets: application to retinal layer segmentation in oct images. Biomed. Opt. Express 9 (9), pp. 4509–4526. Cited by: §1.
-  (2017) Deep learning in medical image analysis. Annual review of biomedical engineering 19, pp. 221–248. Cited by: §1.
-  (2012) Optimal multiple surface segmentation with shape and context priors. IEEE transactions on medical imaging 32 (2), pp. 376–386. Cited by: §1, §2.1.
-  (2017) Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations. CoRR abs/1707.03237. External Links: Cited by: §2.3.
-  The uninversity of iowaThe uninversity of iowa External Links: Cited by: Globally Optimal Segmentation of Mutually Interacting Surfaces using Deep Learning, §0.A.2, §2.2.
-  (2010) LOGISMOS—layered optimal graph image segmentation of multiple objects and surfaces: cartilage segmentation in the knee joint. IEEE transactions on medical imaging 29 (12), pp. 2023–2037. Cited by: §2.
-  (2019) 3-d surface segmentation meets conditional random fields. arXiv preprint arXiv:1906.04714. Cited by: §2.
Appendix 0.A IPM Optimization
The 3D theoretical model in the previous problem formulation section needs to convert into a column model to implement. In this 3D theoretical model, the , , and all have a same size of
, a 3D tensor. The implementation of this proposed segmentation framework in 3D is computationally demanding and memory hungry. In our experiments, we implemented it with 2D input slice. Multi-surfaces in a 2-D image can be viewed a series of smooth curves. A feasible surface intersects with each image column (-axis in 3D, or H dimension in 2D) exactly once. The space between 2 adjacent surfaces and boundary is defined as region, and surfaces should not cross each other. No surface should cross each other is, in mathematics, where indicating different surface index along the H dimension, and N is the number surface, is the optimized surfaces location. An implemented math model is to build an optimization model on each column along -axis, and then to parallelize this column model on GPU for all columns to achieve the original 3D model.
In our following notation, bold characters or numbers indicate vector or matrix, Diag means diagonal matrix.expresses N surfaces along a column , similar convention also for , , etc.
in previous problem formulation section expresses gap width between any 2 surfaces. It is enough in implementation to consider a simpler constraint of gap width between any 2 adjacent surfaces, like the A in formula 4, which reduces the number of constraints from to .
where A is a constant matrix with dimension . We also define a matrix , where B is also a constant matrix with dimension of .
In general image format of implementation, -axis points downward, so a constraint condition can express as , where both and are a non-positive value vector of size . can further express as , where is a vector of size .
Now we define our implemented column model. We expect final surfaces location satisfying constraints, and has minimal deviation sum from initial prediction with confidence index which are come from image feature analysis. In other words, under the constraint of , may deviate a little bigger from when is big; while can deviate a little smaller from when is small. Therefore, an matrix form constrained optimization model (column model) is constructed like below:
where , , , expresses the region gap range along a column, and expresses final optimized solution.
is a special constraint case of more general case. Following subsections A.1 and A.2 first consider above simple constrained case , and subsection A.3 considers more general constraint case .
0.a.1 IPM Forward Optimization for
In order to solve above constrained convex optimization problem formula 5a with constraint , we write it into a Lagrangian form:
Its corresponding perturbed KKT conditions are like below:
where is a perturbed slackness scalar and , and in equation 8. Bigger means smaller dual gap between original model function 5 and the Lagrangian formula 6 ; and and indicate the optimal solution for the Lagrangian . Equations 7 and 8 give very important relations between and , and between and , which can be utilized in the back-propagation of the big deep learning optimization when and are optimal solutions.
where . When , and get their optimal solutions , .
Using Newton iteration method to find the root of , we can get the iterative optimization formula in the small IPM forward optimization like below:
where is an iterative step length. And let
where is a Jacobian matrix of with respect to . After IPM forward iteration ends, will save for reuse in backward propagation of big deep learning optimization, which saves expensive inverse computation of matrix of a size of .
Therefore, the IPM iterative formula 12 may further express as:
where and express the improving direction of and , and .
Newton iterative method guarantees the stationary of the solution, but it does not guarantee the feasibility of solution. The another core idea of IPM is that finding optimal solution starts from an interior feasible point, uses Newton method find iterative improving direction, and then uses linear search to find a proper step to make sure a new iterative point is still in the feasible domain and at same time reduces the norm of residual . Therefore, in this each step of linear search process, algorithm needs to make sure equations 9 and 10 hold. The detailed algorithm of forward IPM Iteration is illustrated in Algorithm 1.
As the Newton iterative method requires an initial point nearby its final goal root, in Algorithm 1 a parallel (Largest Increasing Sub-sequence algorithm) is used to find most matching initial surface locations from the initial prediction , and then fills the non-largest increasing sub-sequence points with its neighbor value to make an initial is feasible. As the dual gap between original cost function and Lagrangian is less than , and deduced from the perturbed complementary slackness equation 8, this algorithm gradually enlarges t by using to reduce dual gap, in order to get more accurate optimal solution to original cost function. In order to avoid when , choose to make sure .
0.a.2 IPM Backward Propagation for
When IPM forward iteration converges in the small IPM optimization process, we get . Its matrix form is
Using total differential with respect to variables , , , and , we get:
Using formula 13 to replace above left-most matrix, we get
Above full differential equation continue to convert to partial differential equations, we get
In back-propagation of the big deep learning optimization, the backward input to IPM optimization module is , where L means loss in the deep learning network. We define 2 variables and like below (Notes: , and ):
Transpose above equation, we get
Equations 22 and 23 are exactly what we want for back propagation loss gradient with respect to and in the big deep learning optimization. Naturally, back propagation algorithm for IPM module in the big deep learning optimization is like algorithm 2:
We used Pytorch  1.3.1 implementing this IPM forward and backward propagation algorithm with batch-supported GPU parallelism in Ubuntu Linux. In our practice, we found at most 7 IPM iterations can achieve enough accuracy, with , , , . In our implementation, we use pseudo inverse to replace normal inverse in computing only when is a singular matrix sometimes.
In order to facilitate further research and comparison on surface segmentation field, all our code including data pre-processing, core parallel IPM code, framework code, experiment configure(yaml) are public at .
0.a.3 IPM Optimization for
In order to solve above constrained convex optimization problem formula 5a with constraint , we write it into a Lagrangian form:
where , , .
Its corresponding perturbed KKT conditions are like below:
where is a perturbed slackness scalar and , and in equation 26. Bigger means smaller dual gap between original model function 5 and the Lagrangian formula 24 ; and and indicate the optimal solution for the Lagrangian . Equations 25 and 26 also give very important relations between and , between and , and between and , which can be utilized in the back-propagation of the big deep learning optimization when and are optimal solutions.
where . When , and get their optimal solutions , .
Following same deduction process in section A.1 and A.2, we can get IPM forward and backward formula. In application is different in each column , and it is general a learning vector from previous layers in a network, so following formula deduction needs to compute with a similar deduction process of , and then maps back to and . Readers also can refer OptNet  for its following deduction.
Appendix 0.B Ablation Experiments
We did 7 ablation experiments on public JHU OCT data to verify our architecture design choices. Ablation experiment results are reported at table LABEL:AblationExpResult. Ablation experiment 1 shows SmoothLoss improved both mean error and variance, and Fig. 2(a) shows the explicit visual difference between with and without SmoothLoss. Ablation experiment 2 shows that adding gradient channels in input improved both mean error and variance, and Fig. 2(b) is its visual result. Ablation 3 shows the value of dynamic sigma, which increases error variance, but does not hurt mean performance; dynamic sigma may reduce the training effort of finding a proper fixed . Ablation 4 shows weighted surface cross entropy is a little better than paper  of general cross entropy, but not better than our new suggested weighted divergence Loss. Ablation 5 shows that the referred surface from RegionConv will reduce error variance at 13%, which makes sense that when network training enters stable stage, its region segmentation information is very helpful to prevent big surface errors. Ablation experiment 6 shows that general KLDiv Loss is worse than weighted divergence loss as weighted divergence loss more cares explicit gradient change points; a visual example is at Fig 2(c). Ablation experiment 7 and its visual Fig. 2(d) shows IPM model can improve surface dislocations. The prediction result of ablation experiment 7 without IPM optimization module has 361 pixels distributed in 10 slices of test set of violating separation constraints, which shows that network without IPM module can not learn implicit constraints in the ground truth, and IPM module has the capability of conforming constraints, reducing error, and reducing standard deviation (20%).