Biomedical semantic segmentation which predicts the class of each pixel in an image is very important for both pre-operative surgical planning and intra-operative surgical navigation. For example, 3D prostate segmentation from Magnetic Resonance Imaging (MRI) volumes is important for diagnosis through volume assessment and treatment planning through boundary estimation. Intra-operative 2D segmentation of stent graft markers from fluoroscopy images is important for instantiating the 3D shape of a stent graft and hence to navigate the Fenestrated Endovascular Aortic Repair (FEVAR) .
Conventional methods are based on ad hoc expert-designed feature extractors and classifiers which are insufficient. Recently, most segmentation methods are based on Deep Convolutional Neural Network (DCNN) which has shown promising performances for many vision tasks including image classification, object detection , and semantic segmentation . In DCNN, features are extracted and classified automatically by training multiple non-linear modules . Unlike traditional fully connected neural networks where each output node is linked to all input nodes, an output node of DCNN only links to regional input nodes, known as the receptive field. Multiple convolutional layers, as shown in Fig. 1a, and down-sampling layers, i.e. pooling layer shown in Fig. 1b, are necessarily cascaded to achieve a large receptive field which is essential for extracting and classifying abstract features and semantic information. However, as a result, the feature map is down-sampled as well. DCNN with down-sampling layers is desirable for image-level tasks, i.e. image classification, but is insufficient for pixel-level tasks, i.e. image semantic segmentation.
There has been previous research working on compensating the dimension decreasing of feature maps. Up-sampling is a widely-used method. For example, deconvolutional layers and non-linear up-sampling are used respectively in Fully Convolutional Neural Network (FCNN)  and SegNet  to recover the down-sampled feature map to the input image size. An alternative solution is to use atrous convolution , also known as dilated convolution , to replace the down-sampling layer for increasing the receptive field. Atrous convolution inserts zeros between non-zero filter taps to sample the feature map, as shown in Fig. 1c. It increases the receptive field with the atrous rate but maintains the spatial dimension, without increasing the computational complexity. However, applying atrous convolution would cause high memory-consuming and node missing, the details are introduced below. These two accompanying problems prevent the wide application of atrous convolution. Based on the author’s knowledge, there has not been a DCNN with pure atrous convolution for high resolution biomedical semantic segmentation.
Memory shortage is the first challenge for applying atrous convolution, as high-resolution feature map propagation consumes a large amount of memory while the memory of a worker, i.e. GPU, is limited. In previous work, atrous convolution was usually applied jointly with down-sampling layers as a trade-off between the accuracy and memory. For example, in Deeplab , firstly, a feature map at size of the input image was extracted by multiple convolutional and down-sampling layers; secondly, feature maps with a larger receptive field but with the same
spatial size were calculated by multiple atrous convolutional layers; lastly, bio-linear interpolation was utilized to recover the spatial information of the down-sampled feature maps while conditional random field was used to refine the predicted pixel-level probability. In multi-scale context aggregation, a feature map with resolution was firstly down-sampled from the input image by a front-end module, then a context module with seven atrous convolutional layers was applied to extract features with a larger receptive field at the same dimensional size. Similar joint usage can also be found in .
Setting the atrous rates is another challenge when applying atrous convolution, as the output node only links to input nodes which align with non-zero filter taps, as shown in Fig. 1c. The input nodes which align with zero filter taps are not considered. There is no standard on setting the atrous rates yet, for example, atrous rates of (1, 1, 2, 4, 8, 16, 1) were allocated for achieving a receptive field of in 
following the strides of max-pooling layers in FCNN. Wang et al. proposed that an atrous rate setting of (2, 4, 8) would cause gridding effects (some input nodes were missed) and proposed hybrid atrous rates, i.e. (1, 2, 5, 9) to guarantee covering all input nodes. Atrous rates of (6, 12, 18) were set for each block and atrous rates of (1, 2, 4) were set inside each block in  based on experimental indications.
In this paper, we wish to propose a dimensionally lossless DCNN where the spatial dimension of intermediate feature maps is the same as that of the input image. A similar work could be found in , however the spatial dimension of intermediate feature maps at the residual stream is still smaller than that of the input image. For proposing a dimensionally lossless DCNN, the proposed network needs to: 1) achieve the largest receptive field with as few atrous convolutional layers as possible to save the memory; 2) fully cover the receptive field without missing any input node. In the following contexts, firstly, we prove that an atrous rate setting of at the atrous convolutional layer, where is the kernel size and is the sequence number of the atrous convolutional layer, could achieve the largest and fully-covered receptive field with a minimum number of atrous convolutional layers in Sec. 2.1. Then six atrous blocks, three shortcut connections and four normalization methods are explored in Sec. 2.2.1, Sec. 2.2.2 and Sec. 2.2.3 respectively to select the optimal atrous block, shortcut connection and normalization method through experimental indications. Finally, a dimensionally lossless DCNN - Atrous Convolutional Neural Network (ACNN) is proposed with using multiple cascaded atrous II-blocks, residual learning and Fine Group Normalization (FGN) in Sec. 2.2.4. The RV, LV and aorta are used to validate the proposed method with data collection shown in Sec. 2.3 and with results shown in Sec. 3. U-Net , optimized U-Net  and a hybrid network similar to  are used as the comparison method in this paper. With using much less trainable parameters, the proposed ACNN achieved comparable segmentation Dice Similarity Coefficients (DSCs) on the three validation datasets, which indicates the benefit of dimensionally lossless feature maps. Discussions and conclusions are stated in Sec. 4 and Sec. 5 respectively.
The proposed atrous rate setting is proved in Sec. 2.1. The six atrous blocks, three shortcut connections and four normalization methods are stated in Sec. 2.2.1, Sec. 2.2.2 and Sec. 2.2.3 respectively. Details about the proposed ACNN are illustrated in Sec. 2.2.4. Data collection of the RV, LV and aorta and the experimental setup are illustrated in Sec. 2.3.
2.1 Atrous Rate Setting
In this section, we focus on finding the atrous rate setting which would achieve the largest and fully-covered receptive field with a minimum number of atrous convolutional layers. Before the mathematical derivation, three 1D receptive field examples with three different atrous rate settings are intuitively shown in Fig. 2. In this three-layer network, with rates of (1, 2, 4), a receptive field of 15 is achieved, while with atrous rates of (1, 2, 9), a receptive field of 25 is achieved with a coverage ratio (the ratio of linked nodes over all nodes in the receptive field) at 0.84. With the proposed rates of (1, 3, 9), the largest receptive field of 27 is achieved with a full coverage where the coverage ratio is 1.0. Detailed mathematical proofs are presented below.
With an input feature map of size , an output feature map of size is calculated by the convolutional layer with an atrous rate , where , , , and , where is the total number of atrous convolutional layers. Here is the feature height and is the feature width, though these two values are usually equal. The channel numbers of feature maps are denoted as , and
is the input image. If ignore the non-linear modules, i.e. relu, and the biases, an equivalent 2D atrous convolution could achieve a backward propagation fromto , which can be decomposed into two 1D atrous convolutions , with a kernel indexed by :
is an odd number which represents the kernel size, i.e. 3, 5, or 7.is the pixel index. , a element of weight matrix , is a trainable variable. is a indicator function defined as:
Denote vectors, as the 1D input image and the 1D feature map, both indexed by . could be calculated from by:
Define , in which indicates that only the central pixel of is with a non-zero value (=1). It is calculated as:
Set , vectors consisting of 1, then , the element indexed by , is the path number from to the input image’s pixel or node. Thus, represents the receptive field of , where its receptive field coverage could be represented by the non-zero element number in the vector :
and its receptive field size is calculated as:
The receptive field coverage ratio of , denoted by , is then defined as:
For guaranteeing a fully-covered RF from a output pixel or node, our target is to maximize the receptive field size with a constraint of RF coverage ratio:
The total path number from to is represented by:
where represents an exponent calculation. It is the upper bound of because:
This is a sum of geometric progression, one solution can be obtained as:
It satisfies a uniformly covered RF: , where in 1D and the same in 2D, which satisfies the equivalent condition in (12) and thus is a solution to (9). Therefore, the atrous rate setting of at the layer could lead to the largest and fully-covered receptive field at the condition that the same number of atrous convolutional layers is used.
2.2 Atrous Convolutional Neural Network
With the proof in Sec. 2.1, a receptive field of could be achieved by a block of N atrous convolutional layers. Each node in the receptive field is linked evenly. In this paper, the kernel size of atrous convolutional layers is 3, following the setting in . A block of N atrous convolutional layers is with a receptive field of . We call this block as atrous block and call the one specific with N atrous convolutional layers as N-block, here N is expressed in the roman numeral.
The proposed ACNN is designed into multiple cascaded atrous blocks to increase the receptive field linearly by . For achieving a (usually ) whole-image coverage,
blocks need to be cascaded. For solving the gradient vanishing/exploding problems and facilitating the back propagation, shortcut connections, i.e. residual learning, identity mapping and dense connections, and normalization methods, i.e. Batch Normalization (BN), Layer Normalization (LN), Instance Normalization (IN) and Group Normalization (GN), are explored. More details are discussed below.
2.2.1 Atrous Block
As the test image size in this paper is 512 or 256, six atrous blocks (I-block, II-block, III-block, IV-block, V-block, VI-block) are explored with the receptive field of 3, 9, 27, 81, 243, 729 respectively, as shown in Fig. 3. The feature channel - of each intermediate feature map is the same.
The optimal atrous block is determined by experiments, as shown in Sec. 3.1. Here, we state and use the conclusion first - II-block is the optimal atrous block and is used in the following context.
2.2.2 Shortcut Connection
Shortcut connection is essential for DCNN due to the degradation and gradient vanishing/exploding problems . Three popular shortcut connections are explored in this paper: 1) residual learning , 2) identity mapping , 3) dense connection , as shown in Fig. 4. In the residual learning, the normalization and ReLU are placed after the atrous convolution and is added onto . In the identity mapping, the normalization and ReLU are placed before the atrous convolution and is added onto . In the dense connection, the normalization and ReLU are placed before the atrous convolution and is concatenated onto . As the feature map is in high resolution and the layer number is very large (64 layers for the RV and LV experiments while 128 layers for the aorta experiments) in this paper, fully dense connection could not be applied due to the extremely high memory-consuming. In this paper, a dense connection is placed after M (M is 16 for the RV and LV experiments while is 32 for the aorta experiments) identity II-blocks, as shown in Fig. 4c. This grouped dense connection is called dense4 connection, as the atrous convolutional layers are divided into 4 groups.
The optimal shortcut connection is determined by experiments, as shown in Sec. 3.2. Here we state and use the conclusion first - residual learning is the optimal shortcut connection and is used in the following context.
2.2.3 Normalization Method
Normalization is essential for solving the interval covariate shift and gradient vanishing/exploding problems . Firstly, the feature map F is divided into multiple groups. Based on the different division method, the most four popular normalization methods in biomedical semantic segmentation are BN  which divides each channel as a group, IN  which divides each channel and each batch as a group, LN  which divides each batch as a group, and GN 
which divides each batch and multiple channels as a group. The mean and variance of each group is calculated as:
Here, is the element in each feature map group, is the total number of elements in each feature group. A systematic review and detailed group subdivision of these four normalization methods in biomedical semantic segmentation with U-Net structure could be found in .
In this paper, batch size of 1 is mainly explored, as it was proved that batch size of 1 out-performed larger batch sizes for biomedical semantic segmentation . For the proposed ACNN where the feature channel is the same for all intermediate feature maps, BN and IN are the same as Fine Group Normalization (FGN) (set the group number of GN as the feature channel root) when the batch size is 1. Hence, FGN which also represents BN and IN, GN4 which sets the group number of GN as 4 and LN are explored for the subdivision of the feature maps, as shown in Fig. 5.
Then these grouped mean and variance are used to norm the feature groups to a mean of 0.0 and a variance of 1.0:
Here is a small value for the dividing stability. Finally, additional two parameters and are applied on each feature channel to recover the representation ability of DCNN:
During the inference, one way to apply BN, IN, LN and GN is to use the mean and variance of current testing batch to normalize the testing feature maps. BN in this mode is called BN-train in this paper. There is an additional way to apply BN - use the moving average mean and variance of the training feature maps to normalize the testing feature maps. BN in this mode is called BN-infer and is also explored in this paper.
The optimal normalization method is selected based on the experimental results, as shown in Sec. 3.3. Here, we state and use the conclusion first - FGN is the optimal normalization method and is used in the following context. FGN was also proved to be the optimal normalization method when using U-Net structure for biomedical semantic segmentation . For U-Net structure, FGN is different from BN and IN, as the feature channel changes inside the DCNN. For linking to the work in  and better generalizability, the name FGN is used in this paper rather than using IN or BN.
2.2.4 ACNN Architecture
The final proposed ACNN architecture is shown in Fig. 6. Multiple atrous II-blocks with residual learning and FGN normalization are stacked. The number of the residual II-blocks is determined by the input image size, i.e. 32 for a image while 64 for a image.
2.3 Data Collection and Experimental Setup
Three datasets: the RV, LV and aorta are collected and used for the validation.
37 subjects, with different levels of Hypertrophic Cardiomyopathy (HCM) were scanned with a 1.5T MRI scanner (Sonata, Siemens, Erlangen, Germany) , collecting 6082 images with mm slice gap, mm pixel spacing, times frames, and image size. Analyze (AnalyzeDirect, Inc, Overland Park, KS, USA) was used to label the ground truth. Rotation from to with as the interval was used to augment the images. Three groups, with 12, 12, and 13 subjects respectively, were split randomly from the 37 subjects for cross validations.
45 subjects, from the SunnyBrook data set , were scanned with MRI, collecting 805 images with image size. Rotation from to with as the interval was used to augment the images. Three groups, with 15 subjects respectively, were split randomly from the 45 subjects for cross validations.
20 subjects, from the VISCERAL data set , were scanned with Computed Tomography (CT), collecting 4631 images with image size. Rotation from to with as the interval was used to augment the images. Three groups, with 7, 7, and 6 subjects respectively, were split randomly from the 20 subjects for cross validations.
Image intensities were normalized to . Evaluation images were not split. For cross validations, two groups were used in the training stage while the other group was used in the testing stage. The kernel size of the last atrous convolutional layer is 1 while the kernel size of all the other atrous convolutional layers is 3. All intermediate feature maps are with a feature channel of 16. The momentum was set as 0.9. We follow the learning schedule in 
, two epochs were trained and the learning rate was divided by 5 at the second epoch. Five initial learning rates: 1.5, 1.0, 0.5, 0.1, 0.05 were tested and the highest accuracy was shown in this paper as the experimental result. Stochastic Gradient Descent (SGD) was utilized as the optimizer.
Dice Similarity Coefficient (DSC) was used to evaluate the segmentation accuracy:
where is the ground truth and is the class prediction after softmax. The DSC of the foreground is selected to represent the segmentation accuracy, as the DSC of the background is in the same trend as that of the foreground. The workers used were Titan Xp (12G memory) and 1080Ti (11G memory) with the CPUs of Xeon 1650 and Xeon 1620.
The codes were programmed with the Tensorflow Estimator Application Programming Interface (API), tf.layers, tf.contrib.layers and tf.data. The Tensorflow version is 1.8.0. The process status of the CPU and GPU both influence the training speed. Training all models under exactly the same process status is impossible. For a fair speed comparison, the time recorded in this paper is for 100 iterations under a clear environment where all other processes are ended. The consumed memory is recorded based on the showing of "watch nividia-smi" command. The parameter amount is for the weights and biases in the atrous convolutional layers in and is recorded based on the model.summary() in Keras.
Six atrous blocks are explored and validated on the three datasets to select the optimal atrous block. For the RV and LV data with an image size of , 128 I-blocks, 32 II-blocks, 10 III-blocks, 3 IV-blocks, 1 V-block are cascaded respectively for a whole-image receptive field. The feature channel is set as 12, 16, 24, 38, 64 to maintain a similar number of parameters used in each ACNN, this guarantees a fair comparison between the five ACNNs. The parameter number in each ACNN is , , , , and respectively. For the aorta data with an image size of , 256 I-blocks, 64 II-blocks, 20 III-blocks, 6 IV-blocks, 2 V-blocks, and 1 VI-block are cascaded respectively for a whole-image receptive field. The feature channel is set as 12, 16, 24, 38, 64, 80 respectively to maintain a similar number of parameters used in each ACNN. The parameter number is , , , , , and respectively. Before confirming the optimal shortcut connection and normalization method, identity mapping and FGN is used as the shortcut connection and normalization method in this section of experiments. Detailed results are shown in Sec. 3.1.
Three shortcut connections: residual learning, identity mapping and dense4 connection are explored and validated on the three datasets to select the optimal shortcut connection. Details are illustrated in Sec. 3.2. Before confirming the optimal normalization method, FGN is used as the normalization method in this section of experiments. Four normalization methods: BN-infer, LN, FGN (the same as BN-train and IN in this paper), GN4 are explored and validated on the three datasets to select the optimal normalization method. Details are stated in Sec. 3.3. It is known that slight difference exists even training exactly the same model in multiple times . In this paper, this variance is given in Sec. 3.4.
Three popular DCNNs are used for the comparison. (1) U-Net proposed in  with five max-pooling layers. (2) Optimized U-Net with FGN proposed in  with seven max-pooling layers to achieve the largest receptive field. (3) a hybrid DCNN similar with Deeplab proposed in . The intermediate four max-pooling and deconvolutional layers in Optimized U-Net are replaced with two convolutional layers plus one atrous convolutional layer with atrous rate of (2, 4, 8, 16) respectively. The feature root is set as 16 for all methods. The detailed comparison regarding the accuracy, consumed-memory, and speed are give in Sec. 3.5. Examples of the segmentation results are shown in Sec. 3.6.
In the following paragraphs, RV-1 refers to the first cross validation (use the first group as the testing and use the second and third group as the training) of RV segmentation, this notation also applies to RV-2, RV-3, LV-1, LV-2, LV-3, Aorta-1, Aorta-2, and Aorta-3.
3.1 Atrous Block
The segmentation accuracy, consumed-memory, training time for 100 iterations, and optimal learning rate of the five ACNNs for the RV segmentation, the five ACNNs for the LV segmentation, and the six ACNNs for the aorta segmentation are shown in Tab. 1, Tab. 2, and Tab. 3 respectively. The mean DSC for each patient in the RV, LV and aorta data with the five or six ACNNs is shown in Fig. 7. We can see that for most of the tests including RV-3, LV-1, LV3, Aorta-1, Aorta-2 and Aorta-3, the atrous II-block achieves the highest accuracy. For those tests that the atrous II-block under-performs including RV-1, RV-2, LV-2, it achieves comparable accuracy. For the RV and LV tests, the atrous II-block ACNN also consumes the minimum memory. However, for the aorta tests, this advantage no longer exists. This is because the aorta data is with a large image size of where the high-resolution feature map consumes a lot of memory. The training time decreases along the number of atrous convolutional layers in each block - for all the three datasets. The optimal learning rates show inconspicuous principles. Some patients, i.e. patient 31 in the RV data, patient 29 and 44 in the LV data, patient 10 and 15 in the aorta data show a clear accuracy order of the five or six atrous blocks: the atrous II-block out-performs other atrous blocks. The atrous II-block is used in all experiments below.
3.2 Shortcut Connection
The segmentation accuracy, consumed-memory, training time for 100 iterations, and optimal learning rate of the atrous II-block ACNN for segmenting the RV, LV and aorta with different shortcut connections: residual learning, identity mapping and dense4 connection are shown in Tab. 4. The mean DSC for each patient in the RV, LV and aorta data with the three shortcut connections is shown in Fig. 8. We can see that, even residual learning is not the shortcut connection which achieves the highest accuracy at most tests, it achieves very similar accuracy to the highest value at those tests where it under-performs, i.e. RV-3, LV-1, LV-2, LV-3, Aorta-1, and Aorta-2. Dense4 connection consumes the largest memory and takes the longest time to train. The consumed-memory of the dense4 connection for the aorta test is an approximate value, as the real value is larger than 12G and the showing value is an optimized value. Residual learning takes almost similar memory and training time as identity mapping. The optimal learning rates show unconspicuous principles as well. Some patients, i.e. patient 6, 12, 31 in the RV data, patient 20 in the LV data, patient 19 in the aorta data show that residual learning obviously out-performs other shortcut connection methods. However, there are also some under-performed examples, i.e. patient 27 in the RV data, patient 29 in the LV data. Residual learning is concluded as the optimal shortcut connection method and is used in later experiments.
3.3 Normalization Method
The segmentation accuracy, consumed-memory, training time for 100 iterations, and optimal learning rates of the atrous II-block ACNN for segmenting the RV, LV and aorta with different normalization methods: BN-infer, LN, FGN, and GN4 are shown in Tab. 5. The mean DSC for each patient in the RV, LV and aorta data with the four normalization methods is shown in Fig. 8. We can see that FGN achieves the highest accuracy at most tests, except RV-1 and Aorta-2. There is not too much difference between the consumed-memory. In terms of the training speed, FGN is similar to BN-infer and LN while GN4 is the slowest. The optimal learning rates of BN-infer are smaller than that of LN, FGN and GN4. For some patients, i.e. patient 4, 14, 23 in the RV data, patient 39, 40 in the LV data, patient 17, 19 in the aorta data, FGN achieves obviously the highest accuracy. FGN is selected as the optimal normalization method and is used in the following experiments.
|Optimal learning rate||0.5||1.5||0.05||0.05||0.5||0.5||0.5||0.1||0.05|
3.4 Multiple Runs
The models with parameter settings exactly the same as that of the FGN row in Tab. 5 were trained additionally five times (plus the one in Tab. 5, in total six times). The mean and variance of the mean DSC of the six trainings are shown in Tab. 6. We can see that the DSC variance is , which is in the normal range - stated in  and is comparable to the DSC variance - when training U-Net in multiple times .
It is worth noting that the average DSC in Tab. 5 is slightly lower than that in the FGN row in Tab. 5, we think this is because the DSC in Tab. 5 is selected from five trainings with five different learning rates. This selection process may indicate a slightly optimized accuracy, as a less optimized training would not out-performed other learning rates and be selected to be shown. This selection process would not cause unfairness, as it is the same for all other tests as well.
|The proposed ACNN|
|Hybrid network |
|Optimized U-Net with FGN |
3.5 Comparison with Other Methods
The segmentation accuracy, consumed-memory, training time for 100 iterations, optimal learning rates and parameter number of the proposed ACNN, hybrid network , optimized U-Net  and U-Net  are illustrated in Tab. 7. The mean DSC for each patient in the RV, LV and aorta data with the four segmentation methods is shown in Fig. 8. We can see that the order of the methods which achieve the most times of the highest accuracy is: optimized U-Net, the proposed ACNN, hybrid network and U-Net. However, the proposed ACNN uses much less parameters than the other three methods. We think this advantage benefits from the dimensionally lossless feature maps inside the proposed ACNN. The proposed ACNN also consumes much less memory and training time for the RV and LV tests than the optimized U-Net, however, this advantage disappears for the aorta tests. This is because the large image size of the aorta data increases the memory and training time significantly. The consumed-memory of the optimized U-Net for the RV or LV tests is almost the same as that of the aorta tests, as the trainable parameters occupies most of the consumed-memory and the consumed-memory difference caused by the different image size could be ignored. It is possible that we may did not achieve the best performance of the other three methods, however, this less optimization exists for the proposed ACNN as well. Hence, this comparison is considered to be fair.
3.6 Segmentation Details
Four examples of the RV, LV and aorta segmentation results are selected randomly to be shown in Fig. 11. As the RV and LV data are discontinuous MRI images, hence only 2D segmentation slices are shown. It can be seen that reasonable segmentation results are achieved.
An atrous rate setting - set the atrous rate at the atrous convolutional layer as where is the convolutional kernel size is proposed. It can achieves the largest and fully-covered receptive field with a minimum number of atrous convolutional layers. Comparison experiments with traditional atrous rate settings, i.e. (1, 2, 4, 8, …), (1, 2, 5, 9, …) are not conducted due to: 1) smaller receptive field resulted by traditional atrous rate settings does not definitely indicate lower segmentation accuracy while larger receptive field resulted by the proposed atrous rate setting does not definitely indicate higher segmentation accuracy; 2) except the receptive field, the path number of each input node also influences the segmentation accuracy. The hybrid and complex reasons behind a good segmentation result make it difficult to judge the atrous rate setting from the segmentation accuracy. Hence, in this paper, detailed proof and derivation are given.
Six atrous blocks: I-block, II-block, III-block, IV-block, V-block, VI-block with a receptive field of 3, 9, 27, 81, 243, 729 respectively are proposed and explored. For an atrous block with a larger receptive field, i.e. VI-block, a fewer number of blocks and a fewer total number of atrous convolutional layers are needed to cover the whole input image. Under the network framework in this paper, i.e. atrous block cascade, identity mapping, FGN, the experiments indicate that atrous II-block is optimal for biomedical semantic segmentation. However, if the network framework or setting is changed or the task is changed, the optimal atrous block may be different. This needs future and more exploration into it.
Dense connection was proved to be efficient in . In this paper, it is not adopted due to its undistinguished segmentation accuracy and high memory consuming. Identity mapping was proved to be an improvement of residual learning in . In this paper, it is not used due to its slightly lower robustness and stability. Finally, residual learning is used as the shortcut connection. BN, IN, LN and GN are the four most popular normalization methods used in biomedical semantic segmentation. It was proved in  that FGN is the optimal normalization for U-Net. In this paper, FGN also out-performs other normalization methods and is used. Except Sec. 3.4, all the other accuracy shown is recorded from the first training only.
The proposed ACNN achieves comparable segmentation accuracy with the hybrid network, optimized U-Net and U-Net, but with using much less parameters. We think this achievement comes from the efficient information contained in dimensionally lossless feature maps. This advantage is very useful when applying the trained model onto mobile devices, as the trained model occupies much less memory. For images with a smaller image size, the proposed ACNN also consumes less memory and training time. However, the consumed memory and training time increases significantly along the image size. This could be further optimized in the future work. Segmentation DCNNs specific for an anatomy is not compared, i.e. Omega-Net proposed for cardiac MRI segmentation  and Focal U-Net proposed for class-imbalance stent graft marker segmentation , as additional algorithms related to the anatomy character is usually applied and are not generalizable to all datasets.
The shown training time is only for 100 iterations and under a clear circumstance. This time could be much longer when the computer and GPU are filled with other processes. In practice, the whole training time takes up to 16 hours to train one model. For a fair comparison, five learning rates are explored for each experiment and takes up to 4 days to show one DSC in above tables.
A new dimensionally lossless DCNN - ACNN is proposed with the using of cascaded atrous II-blocks, residual learning and FGN. A new atrous rate setting is proposed to achieve the largest and fully-covered receptive field with a minimum number of atrous convolutional layers. Six atrous blocks including I-block, II-block, III-block, IV-block, V-block, VI-block, three shortcut connections including residual learning, identity mapping, dense4 connection, and four normalization methods including BN, IN, LN, GN are explored with massive experiments to select the optimal method for the atrous block, shortcut connection and normalization layer. With much less trainable parameters used in the hybrid network, optimized U-Net and U-Net, comparable accuracy is achieved by the proposed ACNN. Codes will be available online.
The authors thank Qing-Biao Li for the collection and processing of the data. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research.
-  F. Milletari, N. Navab, and S.-A. Ahmadi, “V-net: Fully convolutional neural networks for volumetric medical image segmentation,” in 3D Vision (3DV), 2016 Fourth International Conference on. IEEE, 2016, pp. 565–571.
-  X.-Y. Zhou, J. Lin, C. Riga, G.-Z. Yang, and S.-L. Lee, “Real-time 3-D shape instantiation from single fluoroscopy projection for fenestrated stent graft deployment,” IEEE Robotics and Automation Letters, vol. 3, no. 2, pp. 1314–1321, 2018.
-  Y. Chen, J. Li, H. Xiao, X. Jin, S. Yan, and J. Feng, “Dual path networks,” in Advances in Neural Information Processing Systems, 2017, pp. 4470–4478.
-  T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar, “Focal loss for dense object detection,” in
-  L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834–848, 2018.
Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,”Nature, vol. 521, no. 7553, pp. 436–444, 2015.
-  J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431–3440.
-  V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmentation,” IEEE Transactions on Pattern Analysis & Machine Intelligence, no. 12, pp. 2481–2495, 2017.
-  F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolutions,” in ICLR, 2016.
-  P. Wang, P. Chen, Y. Yuan, D. Liu, Z. Huang, X. Hou, and G. Cottrell, “Understanding convolution for semantic segmentation,” arXiv preprint arXiv:1702.08502, 2017.
-  L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking atrous convolution for semantic image segmentation,” arXiv preprint arXiv:1706.05587, 2017.
-  T. Pohlen, A. Hermans, M. Mathias, and B. Leibe, “Full-resolution residual networks for semantic segmentation in street scenes,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017, pp. 3309–3318.
-  O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241.
-  X.-Y. Zhou and G.-Z. Yang, “Normalization in training deep convolutional neural networks for 2d bio-medical semantic segmentation,” arXiv preprint arXiv:1809.03783, 2018.
-  W. Luo, Y. Li, R. Urtasun, and R. Zemel, “Understanding the effective receptive field in deep convolutional neural networks,” in Advances in Neural Information Processing Systems, 2016, pp. 4898–4906.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
-  ——, “Identity mappings in deep residual networks,” in European conference on computer vision. Springer, 2016, pp. 630–645.
-  G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten, “Densely connected convolutional networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, vol. 1, no. 2, 2017, p. 3.
S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network
training by reducing internal covariate shift,” in
International Conference on Machine Learning, 2015, pp. 448–456.
-  J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” stat, vol. 1050, p. 21, 2016.
-  D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Instance normalization: The missing ingredient for fast stylization. corr (2016),” arXiv preprint arXiv:1607.08022, 2016.
-  Y. Wu and K. He, “Group normalization,” arXiv preprint arXiv:1803.08494, 2018.
-  X.-Y. Zhou, G.-Z. Yang, and S.-L. Lee, “A real-time and registration-free framework for dynamic shape instantiation,” Medical image analysis, vol. 44, pp. 86–97, 2018.
-  P. Radau, Y. Lu, K. Connelly, G. Paul, A. Dick, and G. Wright, “Evaluation framework for algorithms segmenting short axis cardiac mri,” The MIDAS Journal-Cardiac MR Left Ventricle Segmentation Challenge, vol. 49, 2009.
-  O. Jimenez-del Toro, H. Müller, M. Krenn, K. Gruenberg, A. A. Taha, M. Winterstein, I. Eggel, A. Foncubierta-Rodríguez, O. Goksel, A. Jakab et al., “Cloud-based evaluation of anatomical structure segmentation and landmark detection algorithms: Visceral anatomy benchmarks,” IEEE transactions on medical imaging, vol. 35, no. 11, pp. 2459–2475, 2016.
-  V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer, “Efficient processing of deep neural networks: A tutorial and survey,” Proceedings of the IEEE, vol. 105, no. 12, pp. 2295–2329, 2017.
-  D. M. Vigneault, W. Xie, C. Y. Ho, D. A. Bluemke, and J. A. Noble, “-net (omega-net): Fully automatic, multi-view cardiac mr detection, orientation, and segmentation with deep neural networks,” Medical image analysis, vol. 48, pp. 95–106, 2018.
-  X.-Y. Zhou, C. Riga, S.-L. Lee, and G.-Z. Yang, “Towards automatic 3d shape instantiation for deployed stent grafts: 2d multiple-class and class-imbalance marker segmentation with equally-weighted focal u-net,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2018, pp. 1261–1267.