1 Introduction
Biomedical semantic segmentation which predicts the class of each pixel in an image is very important for both preoperative surgical planning and intraoperative surgical navigation. For example, 3D prostate segmentation from Magnetic Resonance Imaging (MRI) volumes is important for diagnosis through volume assessment and treatment planning through boundary estimation
[1]. Intraoperative 2D segmentation of stent graft markers from fluoroscopy images is important for instantiating the 3D shape of a stent graft and hence to navigate the Fenestrated Endovascular Aortic Repair (FEVAR) [2].Conventional methods are based on ad hoc expertdesigned feature extractors and classifiers which are insufficient. Recently, most segmentation methods are based on Deep Convolutional Neural Network (DCNN) which has shown promising performances for many vision tasks including image classification
[3], object detection [4], and semantic segmentation [5]. In DCNN, features are extracted and classified automatically by training multiple nonlinear modules [6]. Unlike traditional fully connected neural networks where each output node is linked to all input nodes, an output node of DCNN only links to regional input nodes, known as the receptive field. Multiple convolutional layers, as shown in Fig. 1a, and downsampling layers, i.e. pooling layer shown in Fig. 1b, are necessarily cascaded to achieve a large receptive field which is essential for extracting and classifying abstract features and semantic information. However, as a result, the feature map is downsampled as well. DCNN with downsampling layers is desirable for imagelevel tasks, i.e. image classification, but is insufficient for pixellevel tasks, i.e. image semantic segmentation.There has been previous research working on compensating the dimension decreasing of feature maps. Upsampling is a widelyused method. For example, deconvolutional layers and nonlinear upsampling are used respectively in Fully Convolutional Neural Network (FCNN) [7] and SegNet [8] to recover the downsampled feature map to the input image size. An alternative solution is to use atrous convolution [5], also known as dilated convolution [9], to replace the downsampling layer for increasing the receptive field. Atrous convolution inserts zeros between nonzero filter taps to sample the feature map, as shown in Fig. 1c. It increases the receptive field with the atrous rate but maintains the spatial dimension, without increasing the computational complexity. However, applying atrous convolution would cause high memoryconsuming and node missing, the details are introduced below. These two accompanying problems prevent the wide application of atrous convolution. Based on the author’s knowledge, there has not been a DCNN with pure atrous convolution for high resolution biomedical semantic segmentation.
Memory shortage is the first challenge for applying atrous convolution, as highresolution feature map propagation consumes a large amount of memory while the memory of a worker, i.e. GPU, is limited. In previous work, atrous convolution was usually applied jointly with downsampling layers as a tradeoff between the accuracy and memory. For example, in Deeplab [5], firstly, a feature map at size of the input image was extracted by multiple convolutional and downsampling layers; secondly, feature maps with a larger receptive field but with the same
spatial size were calculated by multiple atrous convolutional layers; lastly, biolinear interpolation was utilized to recover the spatial information of the downsampled feature maps while conditional random field was used to refine the predicted pixellevel probability. In multiscale context aggregation
[9], a feature map with resolution was firstly downsampled from the input image by a frontend module, then a context module with seven atrous convolutional layers was applied to extract features with a larger receptive field at the same dimensional size. Similar joint usage can also be found in [10].Setting the atrous rates is another challenge when applying atrous convolution, as the output node only links to input nodes which align with nonzero filter taps, as shown in Fig. 1c. The input nodes which align with zero filter taps are not considered. There is no standard on setting the atrous rates yet, for example, atrous rates of (1, 1, 2, 4, 8, 16, 1) were allocated for achieving a receptive field of in [9]
following the strides of maxpooling layers in FCNN. Wang et al. proposed that an atrous rate setting of (2, 4, 8) would cause gridding effects (some input nodes were missed) and proposed hybrid atrous rates, i.e. (1, 2, 5, 9) to guarantee covering all input nodes
[10]. Atrous rates of (6, 12, 18) were set for each block and atrous rates of (1, 2, 4) were set inside each block in [11] based on experimental indications.In this paper, we wish to propose a dimensionally lossless DCNN where the spatial dimension of intermediate feature maps is the same as that of the input image. A similar work could be found in [12], however the spatial dimension of intermediate feature maps at the residual stream is still smaller than that of the input image. For proposing a dimensionally lossless DCNN, the proposed network needs to: 1) achieve the largest receptive field with as few atrous convolutional layers as possible to save the memory; 2) fully cover the receptive field without missing any input node. In the following contexts, firstly, we prove that an atrous rate setting of at the atrous convolutional layer, where is the kernel size and is the sequence number of the atrous convolutional layer, could achieve the largest and fullycovered receptive field with a minimum number of atrous convolutional layers in Sec. 2.1. Then six atrous blocks, three shortcut connections and four normalization methods are explored in Sec. 2.2.1, Sec. 2.2.2 and Sec. 2.2.3 respectively to select the optimal atrous block, shortcut connection and normalization method through experimental indications. Finally, a dimensionally lossless DCNN  Atrous Convolutional Neural Network (ACNN) is proposed with using multiple cascaded atrous IIblocks, residual learning and Fine Group Normalization (FGN) in Sec. 2.2.4. The RV, LV and aorta are used to validate the proposed method with data collection shown in Sec. 2.3 and with results shown in Sec. 3. UNet [13], optimized UNet [14] and a hybrid network similar to [5] are used as the comparison method in this paper. With using much less trainable parameters, the proposed ACNN achieved comparable segmentation Dice Similarity Coefficients (DSCs) on the three validation datasets, which indicates the benefit of dimensionally lossless feature maps. Discussions and conclusions are stated in Sec. 4 and Sec. 5 respectively.
2 Methodology
The proposed atrous rate setting is proved in Sec. 2.1. The six atrous blocks, three shortcut connections and four normalization methods are stated in Sec. 2.2.1, Sec. 2.2.2 and Sec. 2.2.3 respectively. Details about the proposed ACNN are illustrated in Sec. 2.2.4. Data collection of the RV, LV and aorta and the experimental setup are illustrated in Sec. 2.3.
2.1 Atrous Rate Setting
In this section, we focus on finding the atrous rate setting which would achieve the largest and fullycovered receptive field with a minimum number of atrous convolutional layers. Before the mathematical derivation, three 1D receptive field examples with three different atrous rate settings are intuitively shown in Fig. 2. In this threelayer network, with rates of (1, 2, 4), a receptive field of 15 is achieved, while with atrous rates of (1, 2, 9), a receptive field of 25 is achieved with a coverage ratio (the ratio of linked nodes over all nodes in the receptive field) at 0.84. With the proposed rates of (1, 3, 9), the largest receptive field of 27 is achieved with a full coverage where the coverage ratio is 1.0. Detailed mathematical proofs are presented below.
With an input feature map of size , an output feature map of size is calculated by the convolutional layer with an atrous rate , where , , , and , where is the total number of atrous convolutional layers. Here is the feature height and is the feature width, though these two values are usually equal. The channel numbers of feature maps are denoted as , and
is the input image. If ignore the nonlinear modules, i.e. relu, and the biases, an equivalent 2D atrous convolution could achieve a backward propagation from
to , which can be decomposed into two 1D atrous convolutions [15], with a kernel indexed by :(1) 
Here,
is an odd number which represents the kernel size, i.e. 3, 5, or 7.
is the pixel index. , a element of weight matrix , is a trainable variable. is a indicator function defined as:(2) 
Denote vectors
, as the 1D input image and the 1D feature map, both indexed by . could be calculated from by:(3) 
Define , in which indicates that only the central pixel of is with a nonzero value (=1). It is calculated as:
(4) 
Set , vectors consisting of 1, then , the element indexed by , is the path number from to the input image’s pixel or node. Thus, represents the receptive field of , where its receptive field coverage could be represented by the nonzero element number in the vector :
(5) 
and its receptive field size is calculated as:
(6) 
The receptive field coverage ratio of , denoted by , is then defined as:
(7) 
For guaranteeing a fullycovered RF from a output pixel or node, our target is to maximize the receptive field size with a constraint of RF coverage ratio:
(8) 
Substitute (6) and (7) into (8), the optimization problem can be converted to (9):
(9) 
The total path number from to is represented by:
(10) 
where represents an exponent calculation. It is the upper bound of because:
(11) 
where
(12) 
We assume the (12) holds and substitute it into the constraint of (9):
(13) 
This is a sum of geometric progression, one solution can be obtained as:
(14) 
It satisfies a uniformly covered RF: , where in 1D and the same in 2D, which satisfies the equivalent condition in (12) and thus is a solution to (9). Therefore, the atrous rate setting of at the layer could lead to the largest and fullycovered receptive field at the condition that the same number of atrous convolutional layers is used.
2.2 Atrous Convolutional Neural Network
With the proof in Sec. 2.1, a receptive field of could be achieved by a block of N atrous convolutional layers. Each node in the receptive field is linked evenly. In this paper, the kernel size of atrous convolutional layers is 3, following the setting in [16]. A block of N atrous convolutional layers is with a receptive field of . We call this block as atrous block and call the one specific with N atrous convolutional layers as Nblock, here N is expressed in the roman numeral.
The proposed ACNN is designed into multiple cascaded atrous blocks to increase the receptive field linearly by . For achieving a (usually ) wholeimage coverage,
blocks need to be cascaded. For solving the gradient vanishing/exploding problems and facilitating the back propagation, shortcut connections, i.e. residual learning, identity mapping and dense connections, and normalization methods, i.e. Batch Normalization (BN), Layer Normalization (LN), Instance Normalization (IN) and Group Normalization (GN), are explored. More details are discussed below.
2.2.1 Atrous Block
As the test image size in this paper is 512 or 256, six atrous blocks (Iblock, IIblock, IIIblock, IVblock, Vblock, VIblock) are explored with the receptive field of 3, 9, 27, 81, 243, 729 respectively, as shown in Fig. 3. The feature channel  of each intermediate feature map is the same.
The optimal atrous block is determined by experiments, as shown in Sec. 3.1. Here, we state and use the conclusion first  IIblock is the optimal atrous block and is used in the following context.
2.2.2 Shortcut Connection
Shortcut connection is essential for DCNN due to the degradation and gradient vanishing/exploding problems [17]. Three popular shortcut connections are explored in this paper: 1) residual learning [17], 2) identity mapping [18], 3) dense connection [19], as shown in Fig. 4. In the residual learning, the normalization and ReLU are placed after the atrous convolution and is added onto . In the identity mapping, the normalization and ReLU are placed before the atrous convolution and is added onto . In the dense connection, the normalization and ReLU are placed before the atrous convolution and is concatenated onto . As the feature map is in high resolution and the layer number is very large (64 layers for the RV and LV experiments while 128 layers for the aorta experiments) in this paper, fully dense connection could not be applied due to the extremely high memoryconsuming. In this paper, a dense connection is placed after M (M is 16 for the RV and LV experiments while is 32 for the aorta experiments) identity IIblocks, as shown in Fig. 4c. This grouped dense connection is called dense4 connection, as the atrous convolutional layers are divided into 4 groups.
The optimal shortcut connection is determined by experiments, as shown in Sec. 3.2. Here we state and use the conclusion first  residual learning is the optimal shortcut connection and is used in the following context.
2.2.3 Normalization Method
Normalization is essential for solving the interval covariate shift and gradient vanishing/exploding problems [20]. Firstly, the feature map F is divided into multiple groups. Based on the different division method, the most four popular normalization methods in biomedical semantic segmentation are BN [20] which divides each channel as a group, IN [21] which divides each channel and each batch as a group, LN [22] which divides each batch as a group, and GN [23]
which divides each batch and multiple channels as a group. The mean and variance of each group is calculated as:
(15) 
(16) 
Here, is the element in each feature map group, is the total number of elements in each feature group. A systematic review and detailed group subdivision of these four normalization methods in biomedical semantic segmentation with UNet structure could be found in [14].
In this paper, batch size of 1 is mainly explored, as it was proved that batch size of 1 outperformed larger batch sizes for biomedical semantic segmentation [14]. For the proposed ACNN where the feature channel is the same for all intermediate feature maps, BN and IN are the same as Fine Group Normalization (FGN) (set the group number of GN as the feature channel root) when the batch size is 1. Hence, FGN which also represents BN and IN, GN4 which sets the group number of GN as 4 and LN are explored for the subdivision of the feature maps, as shown in Fig. 5.
Then these grouped mean and variance are used to norm the feature groups to a mean of 0.0 and a variance of 1.0:
(17) 
Here is a small value for the dividing stability. Finally, additional two parameters and are applied on each feature channel to recover the representation ability of DCNN:
(18) 
During the inference, one way to apply BN, IN, LN and GN is to use the mean and variance of current testing batch to normalize the testing feature maps. BN in this mode is called BNtrain in this paper. There is an additional way to apply BN  use the moving average mean and variance of the training feature maps to normalize the testing feature maps. BN in this mode is called BNinfer and is also explored in this paper.
The optimal normalization method is selected based on the experimental results, as shown in Sec. 3.3. Here, we state and use the conclusion first  FGN is the optimal normalization method and is used in the following context. FGN was also proved to be the optimal normalization method when using UNet structure for biomedical semantic segmentation [14]. For UNet structure, FGN is different from BN and IN, as the feature channel changes inside the DCNN. For linking to the work in [14] and better generalizability, the name FGN is used in this paper rather than using IN or BN.
2.2.4 ACNN Architecture
The final proposed ACNN architecture is shown in Fig. 6. Multiple atrous IIblocks with residual learning and FGN normalization are stacked. The number of the residual IIblocks is determined by the input image size, i.e. 32 for a image while 64 for a image.
2.3 Data Collection and Experimental Setup
Three datasets: the RV, LV and aorta are collected and used for the validation.
Right ventricle
37 subjects, with different levels of Hypertrophic Cardiomyopathy (HCM) were scanned with a 1.5T MRI scanner (Sonata, Siemens, Erlangen, Germany) [24], collecting 6082 images with mm slice gap, mm pixel spacing, times frames, and image size. Analyze (AnalyzeDirect, Inc, Overland Park, KS, USA) was used to label the ground truth. Rotation from to with as the interval was used to augment the images. Three groups, with 12, 12, and 13 subjects respectively, were split randomly from the 37 subjects for cross validations.
Left ventricle
45 subjects, from the SunnyBrook data set [25], were scanned with MRI, collecting 805 images with image size. Rotation from to with as the interval was used to augment the images. Three groups, with 15 subjects respectively, were split randomly from the 45 subjects for cross validations.
Aorta
20 subjects, from the VISCERAL data set [26], were scanned with Computed Tomography (CT), collecting 4631 images with image size. Rotation from to with as the interval was used to augment the images. Three groups, with 7, 7, and 6 subjects respectively, were split randomly from the 20 subjects for cross validations.
Image intensities were normalized to . Evaluation images were not split. For cross validations, two groups were used in the training stage while the other group was used in the testing stage. The kernel size of the last atrous convolutional layer is 1 while the kernel size of all the other atrous convolutional layers is 3. All intermediate feature maps are with a feature channel of 16. The momentum was set as 0.9. We follow the learning schedule in [14]
, two epochs were trained and the learning rate was divided by 5 at the second epoch. Five initial learning rates: 1.5, 1.0, 0.5, 0.1, 0.05 were tested and the highest accuracy was shown in this paper as the experimental result. Stochastic Gradient Descent (SGD) was utilized as the optimizer.
Dice Similarity Coefficient (DSC) was used to evaluate the segmentation accuracy:
(19) 
where is the ground truth and is the class prediction after softmax. The DSC of the foreground is selected to represent the segmentation accuracy, as the DSC of the background is in the same trend as that of the foreground. The workers used were Titan Xp (12G memory) and 1080Ti (11G memory) with the CPUs of Xeon 1650 and Xeon 1620.
The codes were programmed with the Tensorflow Estimator Application Programming Interface (API), tf.layers, tf.contrib.layers and tf.data. The Tensorflow version is 1.8.0. The process status of the CPU and GPU both influence the training speed. Training all models under exactly the same process status is impossible. For a fair speed comparison, the time recorded in this paper is for 100 iterations under a clear environment where all other processes are ended. The consumed memory is recorded based on the showing of "watch nividiasmi" command. The parameter amount is for the weights and biases in the atrous convolutional layers in and is recorded based on the model.summary() in Keras.
ACNN  Test  meanstd DSC  Mem.  Time  OLR 

RV1  0.65880.3333  2.55G  14.5s  0.5  
128 Iblocks  RV2  0.72530.2846  2.55G  14.5s  1.5 
RV3  0.66530.3261  2.55G  14.5s  0.5  
RV1  0.64250.3476  1.65G  9.4s  0.5  
32 IIblocks  RV2  0.71690.2812  1.65G  9.4s  0.5 
RV3  0.68250.3265  1.65G  9.4s  0.5  
RV1  0.66880.3284  1.67G  6.3s  0.05  
10 IIIblocks  RV2  0.71670.2831  1.67G  6.3s  0.05 
RV3  0.64620.3344  1.67G  6.3s  0.5  
RV1  0.64700.3265  1.96G  4.6s  0.05  
3 IVblocks  RV2  0.66850.3174  1.96G  4.6s  0.05 
RV3  0.62080.3281  1.96G  4.6s  0.1  
RV1  0.60070.3278  8.59G  3.5s  0.1  
1 Vblock  RV2  0.64420.3205  8.59G  3.5s  0.05 
RV3  0.58630.3556  8.59G  3.5s  1.0 
ACNN  Test  meanstd DSC  Mem.  Time  OLR 

LV1  0.88070.1831  2.55G  14.5s  1.5  
128 Iblocks  LV2  0.80560.2467  2.55G  14.5s  0.1 
LV3  0.79090.2451  2.55G  14.5s  1.5  
LV1  0.91550.1107  1.65G  9.4s  0.1  
32 IIblocks  LV2  0.85900.1627  1.65G  9.4s  1.0 
LV3  0.81860.2310  1.65G  9.4s  0.5  
LV1  0.91180.1172  1.67G  6.3s  0.5  
10 IIIblocks  LV2  0.87210.1743  1.67G  6.3s  0.05 
LV3  0.80080.2493  1.67G  6.3s  0.5  
LV1  0.88570.1452  1.96G  4.6s  0.1  
3 IVblocks  LV2  0.85800.1513  1.96G  4.6s  1.5 
LV3  0.79210.2407  1.96G  4.6s  0.5  
LV1  0.85540.1501  8.59G  3.5s  0.1  
1 Vblock  LV2  0.78440.2204  8.59G  3.5s  0.05 
LV3  0.78060.2175  8.59G  3.5s  0.1 
ACNN  Test  meanstd DSC  Mem.  Time  OLR 

Aorta1  0.82550.1833  11.72G  94.4s  0.1  
256 Iblocks  Aorta2  0.77870.2019  11.72G  94.4s  0.05 
Aorta3  0.79730.2185  11.72G  94.4s  1.5  
Aorta1  0.84910.1543  9.64G  62.5s  0.1  
64 IIblocks  Aorta2  0.78710.2095  9.64G  62.5s  0.5 
Aorta3  0.83650.1726  9.64G  62.5s  0.1  
Aorta1  0.83880.1731  9.72G  43.1s  0.05  
20 IIIblocks  Aorta2  0.75750.2528  9.72G  43.1s  0.5 
Aorta3  0.83370.1915  9.72G  43.1s  0.1  
Aorta1  0.80060.1691  6.77G  31.8s  0.05  
6 IVblocks  Aorta2  0.72350.2794  6.77G  31.8s  0.05 
Aorta3  0.78280.2182  6.77G  31.8s  0.1  
Aorta1  0.79370.1580  8.82G  22.2s  0.1  
2 Vblocks  Aorta2  0.69980.2619  8.82G  22.2s  0.5 
Aorta3  0.76770.2568  8.82G  22.2s  0.1  
Aorta1  0.70260.2256  4.98G  24.1s  1.5  
1 VIblock  Aorta2  0.65640.2629  4.98G  24.1s  0.5 
Aorta3  0.74270.2507  4.98G  24.1s  0.05 
Shortcut  Test  meanstd DSC  Mem.  Time  OLR 

RV1  0.67550.3226  1.66G  9.8s  0.5  
RV2  0.72670.2839  1.66G  9.8s  1.5  
RV3  0.68230.3297  1.66G  9.8s  0.05  
LV1  0.91330.1185  1.66G  9.8s  0.05  
Residual  LV2  0.87120.1691  1.66G  9.8s  0.5 
learning  LV3  0.81530.2346  1.66G  9.8s  0.5 
Aorta1  0.84490.1457  9.65G  62.6s  0.5  
Aorta2  0.78200.2318  9.65G  62.6s  0.1  
Aorta3  0.84930.1554  9.65G  62.6s  0.05  
RV1  0.64250.3476  1.65G  9.4s  0.5  
RV2  0.71690.2812  1.65G  9.4s  0.5  
RV3  0.68250.3265  1.65G  9.4s  0.5  
LV1  0.91550.1107  1.65G  9.4s  0.1  
Identity  LV2  0.85900.1627  1.65G  9.4s  1.0 
mapping  LV3  0.81860.2310  1.65G  9.4s  0.5 
Aorta1  0.84910.1543  9.64G  62.5s  0.1  
Aorta2  0.78710.2095  9.64G  62.5s  0.5  
Aorta3  0.83650.1726  9.64G  62.5s  0.1  
RV1  0.67520.3256  2.68G  10.5s  0.05  
RV2  0.70800.2940  2.68G  10.5s  0.1  
RV3  0.65590.3388  2.68G  10.5s  0.5  
LV1  0.91900.0668  2.68G  10.5s  0.05  
Dense4  LV2  0.87260.1627  2.68G  10.5s  0.5 
connection  LV3  0.80320.2392  2.68G  10.5s  0.5 
Aorta1  0.83350.1743  11.89G  65.4s  0.01  
Aorta2  0.78090.2094  11.89G  65.4s  0.05  
Aorta3  0.83440.1750  11.89G  65.4s  0.1 
3 Results
Six atrous blocks are explored and validated on the three datasets to select the optimal atrous block. For the RV and LV data with an image size of , 128 Iblocks, 32 IIblocks, 10 IIIblocks, 3 IVblocks, 1 Vblock are cascaded respectively for a wholeimage receptive field. The feature channel is set as 12, 16, 24, 38, 64 to maintain a similar number of parameters used in each ACNN, this guarantees a fair comparison between the five ACNNs. The parameter number in each ACNN is , , , , and respectively. For the aorta data with an image size of , 256 Iblocks, 64 IIblocks, 20 IIIblocks, 6 IVblocks, 2 Vblocks, and 1 VIblock are cascaded respectively for a wholeimage receptive field. The feature channel is set as 12, 16, 24, 38, 64, 80 respectively to maintain a similar number of parameters used in each ACNN. The parameter number is , , , , , and respectively. Before confirming the optimal shortcut connection and normalization method, identity mapping and FGN is used as the shortcut connection and normalization method in this section of experiments. Detailed results are shown in Sec. 3.1.
Three shortcut connections: residual learning, identity mapping and dense4 connection are explored and validated on the three datasets to select the optimal shortcut connection. Details are illustrated in Sec. 3.2. Before confirming the optimal normalization method, FGN is used as the normalization method in this section of experiments. Four normalization methods: BNinfer, LN, FGN (the same as BNtrain and IN in this paper), GN4 are explored and validated on the three datasets to select the optimal normalization method. Details are stated in Sec. 3.3. It is known that slight difference exists even training exactly the same model in multiple times [27]. In this paper, this variance is given in Sec. 3.4.
Three popular DCNNs are used for the comparison. (1) UNet proposed in [13] with five maxpooling layers. (2) Optimized UNet with FGN proposed in [14] with seven maxpooling layers to achieve the largest receptive field. (3) a hybrid DCNN similar with Deeplab proposed in [5]. The intermediate four maxpooling and deconvolutional layers in Optimized UNet are replaced with two convolutional layers plus one atrous convolutional layer with atrous rate of (2, 4, 8, 16) respectively. The feature root is set as 16 for all methods. The detailed comparison regarding the accuracy, consumedmemory, and speed are give in Sec. 3.5. Examples of the segmentation results are shown in Sec. 3.6.
In the following paragraphs, RV1 refers to the first cross validation (use the first group as the testing and use the second and third group as the training) of RV segmentation, this notation also applies to RV2, RV3, LV1, LV2, LV3, Aorta1, Aorta2, and Aorta3.
3.1 Atrous Block
The segmentation accuracy, consumedmemory, training time for 100 iterations, and optimal learning rate of the five ACNNs for the RV segmentation, the five ACNNs for the LV segmentation, and the six ACNNs for the aorta segmentation are shown in Tab. 1, Tab. 2, and Tab. 3 respectively. The mean DSC for each patient in the RV, LV and aorta data with the five or six ACNNs is shown in Fig. 7. We can see that for most of the tests including RV3, LV1, LV3, Aorta1, Aorta2 and Aorta3, the atrous IIblock achieves the highest accuracy. For those tests that the atrous IIblock underperforms including RV1, RV2, LV2, it achieves comparable accuracy. For the RV and LV tests, the atrous IIblock ACNN also consumes the minimum memory. However, for the aorta tests, this advantage no longer exists. This is because the aorta data is with a large image size of where the highresolution feature map consumes a lot of memory. The training time decreases along the number of atrous convolutional layers in each block  for all the three datasets. The optimal learning rates show inconspicuous principles. Some patients, i.e. patient 31 in the RV data, patient 29 and 44 in the LV data, patient 10 and 15 in the aorta data show a clear accuracy order of the five or six atrous blocks: the atrous IIblock outperforms other atrous blocks. The atrous IIblock is used in all experiments below.
3.2 Shortcut Connection
The segmentation accuracy, consumedmemory, training time for 100 iterations, and optimal learning rate of the atrous IIblock ACNN for segmenting the RV, LV and aorta with different shortcut connections: residual learning, identity mapping and dense4 connection are shown in Tab. 4. The mean DSC for each patient in the RV, LV and aorta data with the three shortcut connections is shown in Fig. 8. We can see that, even residual learning is not the shortcut connection which achieves the highest accuracy at most tests, it achieves very similar accuracy to the highest value at those tests where it underperforms, i.e. RV3, LV1, LV2, LV3, Aorta1, and Aorta2. Dense4 connection consumes the largest memory and takes the longest time to train. The consumedmemory of the dense4 connection for the aorta test is an approximate value, as the real value is larger than 12G and the showing value is an optimized value. Residual learning takes almost similar memory and training time as identity mapping. The optimal learning rates show unconspicuous principles as well. Some patients, i.e. patient 6, 12, 31 in the RV data, patient 20 in the LV data, patient 19 in the aorta data show that residual learning obviously outperforms other shortcut connection methods. However, there are also some underperformed examples, i.e. patient 27 in the RV data, patient 29 in the LV data. Residual learning is concluded as the optimal shortcut connection method and is used in later experiments.
Normalization  Test  meanstd DSC  Mem.  Time  OLR 

BNinfer  RV1  0.64740.3449  1.65G  10.0s  0.05 
RV2  0.68170.3138  1.65G  10.0s  0.05  
RV3  0.61620.3505  1.65G  10.0s  0.05  
LV1  0.83130.2146  1.65G  10.0s  0.05  
LV2  0.83020.1990  1.65G  10.0s  0.05  
LV3  0.69760.3507  1.65G  10.0s  0.05  
Aorta1  0.59900.3094  9.64G  57.9s  0.05  
Aorta2  0.66050.2657  9.64G  57.9s  0.05  
Aorta3  0.65100.3351  9.64G  57.9s  0.1  
LN  RV1  0.68010.3334  1.65G  10.3s  1.5 
RV2  0.69170.3065  1.65G  10.3s  0.05  
RV3  0.65690.3309  1.65G  10.3s  0.1  
LV1  0.90040.1225  1.65G  10.3s  0.1  
LV2  0.81710.2381  1.65G  10.3s  0.5  
LV3  0.73200.3226  1.65G  10.3s  0.05  
Aorta1  0.77970.2422  9.64G  66.5s  0.05  
Aorta2  0.78990.2105  9.64G  66.5s  0.05  
Aorta3  0.78560.2554  9.64G  66.5s  0.1  
FGN  RV1  0.67550.3226  1.66G  9.8s  0.5 
RV2  0.72670.2839  1.66G  9.8s  1.5  
RV3  0.68230.3297  1.66G  9.8s  0.05  
LV1  0.91330.1185  1.66G  9.8s  0.05  
LV2  0.87120.1691  1.66G  9.8s  0.5  
LV3  0.81530.2346  1.66G  9.8s  0.5  
Aorta1  0.84490.1457  9.65G  62.6s  0.5  
Aorta2  0.78200.2318  9.65G  62.6s  0.1  
Aorta3  0.84930.1554  9.65G  62.6s  0.05  
GN4  RV1  0.69180.3133  1.65G  14.5s  1.5 
RV2  0.66530.3257  1.65G  14.5s  1.5  
RV3  0.65880.3046  1.65G  14.5s  0.05  
LV1  0.90270.1082  1.65G  14.5s  0.5  
LV2  0.85860.1822  1.65G  14.5s  0.5  
LV3  0.74590.2937  1.65G  14.5s  0.05  
Aorta1  0.83650.1716  9.64G  95.5s  1.0  
Aorta2  0.78170.2029  9.64G  95.5s  0.05  
Aorta3  0.81640.2007  9.64G  95.5s  0.05 
3.3 Normalization Method
The segmentation accuracy, consumedmemory, training time for 100 iterations, and optimal learning rates of the atrous IIblock ACNN for segmenting the RV, LV and aorta with different normalization methods: BNinfer, LN, FGN, and GN4 are shown in Tab. 5. The mean DSC for each patient in the RV, LV and aorta data with the four normalization methods is shown in Fig. 8. We can see that FGN achieves the highest accuracy at most tests, except RV1 and Aorta2. There is not too much difference between the consumedmemory. In terms of the training speed, FGN is similar to BNinfer and LN while GN4 is the slowest. The optimal learning rates of BNinfer are smaller than that of LN, FGN and GN4. For some patients, i.e. patient 4, 14, 23 in the RV data, patient 39, 40 in the LV data, patient 17, 19 in the aorta data, FGN achieves obviously the highest accuracy. FGN is selected as the optimal normalization method and is used in the following experiments.
Test  RV1  RV2  RV3  LV1  LV2  LV3  Aorta1  Aorta2  Aorta3 

Mean  0.6508  0.7045  0.6543  0.9117  0.8611  0.8069  0.8341  0.7830  0.8451 
Variance  0.0187  0.0145  0.0171  0.0035  0.0191  0.0091  0.0156  0.0164  0.0083 
Optimal learning rate  0.5  1.5  0.05  0.05  0.5  0.5  0.5  0.1  0.05 
3.4 Multiple Runs
The models with parameter settings exactly the same as that of the FGN row in Tab. 5 were trained additionally five times (plus the one in Tab. 5, in total six times). The mean and variance of the mean DSC of the six trainings are shown in Tab. 6. We can see that the DSC variance is , which is in the normal range  stated in [27] and is comparable to the DSC variance  when training UNet in multiple times [14].
It is worth noting that the average DSC in Tab. 5 is slightly lower than that in the FGN row in Tab. 5, we think this is because the DSC in Tab. 5 is selected from five trainings with five different learning rates. This selection process may indicate a slightly optimized accuracy, as a less optimized training would not outperformed other learning rates and be selected to be shown. This selection process would not cause unfairness, as it is the same for all other tests as well.
The proposed ACNN  
Test  meanstd DSC  Mem.  Time  OLR  Parameters 
RV1  0.67550.3226  1.66G  9.8s  0.5  1.46M 
RV2  0.72670.2839  1.66G  9.8s  1.5  1.46M 
RV3  0.68230.3297  1.66G  9.8s  0.05  1.46M 
LV1  0.91330.1185  1.66G  9.8s  0.05  1.46M 
LV2  0.87120.1691  1.66G  9.8s  0.5  1.46M 
LV3  0.81530.2346  1.66G  9.8s  0.5  1.46M 
Aorta1  0.84490.1457  9.65G  62.6s  0.5  2.95M 
Aorta2  0.78200.2318  9.65G  62.6s  0.1  2.95M 
Aorta3  0.84930.1554  9.65G  62.6s  0.05  2.95M 
Hybrid network [5]  
Test  meanstd DSC  Mem.  Time  OLR  Parameters 
RV1  0.71010.2875  4.26G  3.3s  1.0  23.1M 
RV2  0.71750.2600  4.26G  3.3s  1.0  23.1M 
RV3  0.69070.2862  4.26G  3.3s  0.5  23.1M 
LV1  0.92050.0995  4.26G  3.3s  1.0  23.1M 
LV2  0.89300.1300  4.26G  3.3s  0.5  23.1M 
LV3  0.83060.2030  4.26G  3.3s  0.1  23.1M 
Aorta1  0.81970.2018  7.65G  6.6s  1.0  23.1M 
Aorta2  0.78460.2145  7.65G  6.6s  1.5  23.1M 
Aorta3  0.74830.3072  7.65G  6.6s  1.0  23.1M 
Optimized UNet with FGN [14]  
Test  meanstd DSC  Mem.  Time  OLR  Parameters 
RV1  0.72040.2795  8.80G  15.1s  0.5  1384.2M 
RV2  0.70020.2900  8.80G  15.1s  0.05  1384.2M 
RV3  0.66360.3047  8.80G  15.1s  0.1  1384.2M 
LV1  0.92410.0965  8.80G  15.1s  0.05  1384.2M 
LV2  0.89320.1211  8.80G  15.1s  0.05  1384.2M 
LV3  0.84340.1912  8.80G  15.1s  1.5  1384.2M 
Aorta1  0.83020.1652  8.80G  18.4s  0.5  1384.2M 
Aorta2  0.81020.1764  8.80G  18.4s  0.5  1384.2M 
Aorta3  0.84190.1737  8.80G  18.4s  1.0  1384.2M 
UNet [13]  
Test  meanstd DSC  Mem.  Time  OLR  Parameters 
RV1  0.69440.2428  4.34G  1.8s  0.5  86.5M 
RV2  0.64520.3297  4.34G  1.8s  0.1  86.5M 
RV3  0.61170.3455  4.34G  1.8s  0.05  86.5M 
LV1  0.92400.0678  4.34G  1.8s  0.1  86.5M 
LV2  0.88740.1592  4.34G  1.8s  0.1  86.5M 
LV3  0.80810.2345  4.34G  1.8s  0.1  86.5M 
Aorta1  0.81650.1843  7.79G  3.7s  0.05  86.5M 
Aorta2  0.79380.2081  7.79G  3.7s  0.05  86.5M 
Aorta3  0.77180.2712  7.79G  3.7s  0.05  86.5M 
3.5 Comparison with Other Methods
The segmentation accuracy, consumedmemory, training time for 100 iterations, optimal learning rates and parameter number of the proposed ACNN, hybrid network [5], optimized UNet [14] and UNet [13] are illustrated in Tab. 7. The mean DSC for each patient in the RV, LV and aorta data with the four segmentation methods is shown in Fig. 8. We can see that the order of the methods which achieve the most times of the highest accuracy is: optimized UNet, the proposed ACNN, hybrid network and UNet. However, the proposed ACNN uses much less parameters than the other three methods. We think this advantage benefits from the dimensionally lossless feature maps inside the proposed ACNN. The proposed ACNN also consumes much less memory and training time for the RV and LV tests than the optimized UNet, however, this advantage disappears for the aorta tests. This is because the large image size of the aorta data increases the memory and training time significantly. The consumedmemory of the optimized UNet for the RV or LV tests is almost the same as that of the aorta tests, as the trainable parameters occupies most of the consumedmemory and the consumedmemory difference caused by the different image size could be ignored. It is possible that we may did not achieve the best performance of the other three methods, however, this less optimization exists for the proposed ACNN as well. Hence, this comparison is considered to be fair.
3.6 Segmentation Details
Four examples of the RV, LV and aorta segmentation results are selected randomly to be shown in Fig. 11. As the RV and LV data are discontinuous MRI images, hence only 2D segmentation slices are shown. It can be seen that reasonable segmentation results are achieved.
4 Discussion
An atrous rate setting  set the atrous rate at the atrous convolutional layer as where is the convolutional kernel size is proposed. It can achieves the largest and fullycovered receptive field with a minimum number of atrous convolutional layers. Comparison experiments with traditional atrous rate settings, i.e. (1, 2, 4, 8, …), (1, 2, 5, 9, …) are not conducted due to: 1) smaller receptive field resulted by traditional atrous rate settings does not definitely indicate lower segmentation accuracy while larger receptive field resulted by the proposed atrous rate setting does not definitely indicate higher segmentation accuracy; 2) except the receptive field, the path number of each input node also influences the segmentation accuracy. The hybrid and complex reasons behind a good segmentation result make it difficult to judge the atrous rate setting from the segmentation accuracy. Hence, in this paper, detailed proof and derivation are given.
Six atrous blocks: Iblock, IIblock, IIIblock, IVblock, Vblock, VIblock with a receptive field of 3, 9, 27, 81, 243, 729 respectively are proposed and explored. For an atrous block with a larger receptive field, i.e. VIblock, a fewer number of blocks and a fewer total number of atrous convolutional layers are needed to cover the whole input image. Under the network framework in this paper, i.e. atrous block cascade, identity mapping, FGN, the experiments indicate that atrous IIblock is optimal for biomedical semantic segmentation. However, if the network framework or setting is changed or the task is changed, the optimal atrous block may be different. This needs future and more exploration into it.
Dense connection was proved to be efficient in [19]. In this paper, it is not adopted due to its undistinguished segmentation accuracy and high memory consuming. Identity mapping was proved to be an improvement of residual learning in [18]. In this paper, it is not used due to its slightly lower robustness and stability. Finally, residual learning is used as the shortcut connection. BN, IN, LN and GN are the four most popular normalization methods used in biomedical semantic segmentation. It was proved in [14] that FGN is the optimal normalization for UNet. In this paper, FGN also outperforms other normalization methods and is used. Except Sec. 3.4, all the other accuracy shown is recorded from the first training only.
The proposed ACNN achieves comparable segmentation accuracy with the hybrid network, optimized UNet and UNet, but with using much less parameters. We think this achievement comes from the efficient information contained in dimensionally lossless feature maps. This advantage is very useful when applying the trained model onto mobile devices, as the trained model occupies much less memory. For images with a smaller image size, the proposed ACNN also consumes less memory and training time. However, the consumed memory and training time increases significantly along the image size. This could be further optimized in the future work. Segmentation DCNNs specific for an anatomy is not compared, i.e. OmegaNet proposed for cardiac MRI segmentation [28] and Focal UNet proposed for classimbalance stent graft marker segmentation [29], as additional algorithms related to the anatomy character is usually applied and are not generalizable to all datasets.
The shown training time is only for 100 iterations and under a clear circumstance. This time could be much longer when the computer and GPU are filled with other processes. In practice, the whole training time takes up to 16 hours to train one model. For a fair comparison, five learning rates are explored for each experiment and takes up to 4 days to show one DSC in above tables.
5 Conclusion
A new dimensionally lossless DCNN  ACNN is proposed with the using of cascaded atrous IIblocks, residual learning and FGN. A new atrous rate setting is proposed to achieve the largest and fullycovered receptive field with a minimum number of atrous convolutional layers. Six atrous blocks including Iblock, IIblock, IIIblock, IVblock, Vblock, VIblock, three shortcut connections including residual learning, identity mapping, dense4 connection, and four normalization methods including BN, IN, LN, GN are explored with massive experiments to select the optimal method for the atrous block, shortcut connection and normalization layer. With much less trainable parameters used in the hybrid network, optimized UNet and UNet, comparable accuracy is achieved by the proposed ACNN. Codes will be available online.
6 Acknowledgement
The authors thank QingBiao Li for the collection and processing of the data. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research.
References
 [1] F. Milletari, N. Navab, and S.A. Ahmadi, “Vnet: Fully convolutional neural networks for volumetric medical image segmentation,” in 3D Vision (3DV), 2016 Fourth International Conference on. IEEE, 2016, pp. 565–571.
 [2] X.Y. Zhou, J. Lin, C. Riga, G.Z. Yang, and S.L. Lee, “Realtime 3D shape instantiation from single fluoroscopy projection for fenestrated stent graft deployment,” IEEE Robotics and Automation Letters, vol. 3, no. 2, pp. 1314–1321, 2018.
 [3] Y. Chen, J. Li, H. Xiao, X. Jin, S. Yan, and J. Feng, “Dual path networks,” in Advances in Neural Information Processing Systems, 2017, pp. 4470–4478.

[4]
T.Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar, “Focal loss for dense
object detection,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, 2017, pp. 2980–2988.  [5] L.C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834–848, 2018.

[6]
Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,”
Nature, vol. 521, no. 7553, pp. 436–444, 2015.  [7] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431–3440.
 [8] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoderdecoder architecture for image segmentation,” IEEE Transactions on Pattern Analysis & Machine Intelligence, no. 12, pp. 2481–2495, 2017.
 [9] F. Yu and V. Koltun, “Multiscale context aggregation by dilated convolutions,” in ICLR, 2016.
 [10] P. Wang, P. Chen, Y. Yuan, D. Liu, Z. Huang, X. Hou, and G. Cottrell, “Understanding convolution for semantic segmentation,” arXiv preprint arXiv:1702.08502, 2017.
 [11] L.C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking atrous convolution for semantic image segmentation,” arXiv preprint arXiv:1706.05587, 2017.
 [12] T. Pohlen, A. Hermans, M. Mathias, and B. Leibe, “Fullresolution residual networks for semantic segmentation in street scenes,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017, pp. 3309–3318.
 [13] O. Ronneberger, P. Fischer, and T. Brox, “Unet: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computerassisted intervention. Springer, 2015, pp. 234–241.
 [14] X.Y. Zhou and G.Z. Yang, “Normalization in training deep convolutional neural networks for 2d biomedical semantic segmentation,” arXiv preprint arXiv:1809.03783, 2018.
 [15] W. Luo, Y. Li, R. Urtasun, and R. Zemel, “Understanding the effective receptive field in deep convolutional neural networks,” in Advances in Neural Information Processing Systems, 2016, pp. 4898–4906.
 [16] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
 [17] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
 [18] ——, “Identity mappings in deep residual networks,” in European conference on computer vision. Springer, 2016, pp. 630–645.
 [19] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten, “Densely connected convolutional networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, vol. 1, no. 2, 2017, p. 3.

[20]
S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network
training by reducing internal covariate shift,” in
International Conference on Machine Learning
, 2015, pp. 448–456.  [21] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” stat, vol. 1050, p. 21, 2016.
 [22] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Instance normalization: The missing ingredient for fast stylization. corr (2016),” arXiv preprint arXiv:1607.08022, 2016.
 [23] Y. Wu and K. He, “Group normalization,” arXiv preprint arXiv:1803.08494, 2018.
 [24] X.Y. Zhou, G.Z. Yang, and S.L. Lee, “A realtime and registrationfree framework for dynamic shape instantiation,” Medical image analysis, vol. 44, pp. 86–97, 2018.
 [25] P. Radau, Y. Lu, K. Connelly, G. Paul, A. Dick, and G. Wright, “Evaluation framework for algorithms segmenting short axis cardiac mri,” The MIDAS JournalCardiac MR Left Ventricle Segmentation Challenge, vol. 49, 2009.
 [26] O. Jimenezdel Toro, H. Müller, M. Krenn, K. Gruenberg, A. A. Taha, M. Winterstein, I. Eggel, A. FoncubiertaRodríguez, O. Goksel, A. Jakab et al., “Cloudbased evaluation of anatomical structure segmentation and landmark detection algorithms: Visceral anatomy benchmarks,” IEEE transactions on medical imaging, vol. 35, no. 11, pp. 2459–2475, 2016.
 [27] V. Sze, Y.H. Chen, T.J. Yang, and J. S. Emer, “Efficient processing of deep neural networks: A tutorial and survey,” Proceedings of the IEEE, vol. 105, no. 12, pp. 2295–2329, 2017.
 [28] D. M. Vigneault, W. Xie, C. Y. Ho, D. A. Bluemke, and J. A. Noble, “net (omeganet): Fully automatic, multiview cardiac mr detection, orientation, and segmentation with deep neural networks,” Medical image analysis, vol. 48, pp. 95–106, 2018.
 [29] X.Y. Zhou, C. Riga, S.L. Lee, and G.Z. Yang, “Towards automatic 3d shape instantiation for deployed stent grafts: 2d multipleclass and classimbalance marker segmentation with equallyweighted focal unet,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2018, pp. 1261–1267.