1 Introduction
Medical image segmentation is challenging despite the substantial advance in deep convolutional neural networks (CNNs). To segment objects of different sizes, CNNs should be able to extract multiscale information. However, due to the limited kernel size, the classic convolutional layer can only extract image information within a certain range (called the receptive field). Traditionally, two types of methods are proposed to enlarge the receptive field. One is to expand the convolutional kernel with substantially increased computational cost and time [8]. The other introduces the maximum pooling operation, which roughly doubles the receptive field but loses part of image information [6].
Recent works aim to find better solutions to this problem. Yu et. al. [11, 12] first propose to use the dilated convolution to increase the sizes of receptive fields without losing information. Two types of studies have further improved the method. The first is to arbitrarily assign the dilation rate for each convolutional layer. For example, in order to solve the problem of small and dense target segmentation, Hamaguchi et al. [2] and Li et al. [5] use different configurations of dilation rates in the deep segmentation models and achieve high accuracies. The second category is to use a pyramid structure of dilated convolutions to fuse the information from multiple scales. Chen et al. develop a module called Atrous Spatial Pyramid Pooling (ASPP), in which multiscale information is extracted by parallel convolutions with different dilation rates [1]. Mehta et al. [7] and Wang et al. [10] also report similar pyramid (hybrid) dilated convolutional modules to improve the semantic segmentation.
Although the aforementioned methods show that the dilated convolution is effective, it has two major limitations: 1) training a dilated CNN requires extra efforts on tuning the dilation rates; 2) manually designed dilation rate only provides a fixedsize receptive field, which may not be suitable for different objects in the same image. So following the idea of dilated convolution, we propose a new convolutional network architecture, referred as AdaptiveScale Convolutional Network (ASCNet). By introducing a small 3layer convolutional structure (called additional network), ASCNet adaptively learns the dilation rate for each pixel on the input feature map. Such pixellevel dilation rates form a dilation rate field, which is applied to all convolutional layers. The proposed network is optimized in an endtoend training process, where the convolutions are able to extract image information of different objects in the optimal receptive fields.
We compare the classic CNN, the dilated CNN and the proposed ASCNet on both Herlev [3] and RBC datasets [13]. The experimental results show that ASCNet improves the accuracy of image segmentation on both datasets with a slight increment of computational cost. Furthermore, the dilation rates learned by the proposed ASCNet are positively correlated to the sizes of the objects from both datasets. It confirms that the proposed method effectively finds the optimal dilation rates for objects with different sizes.
2 Method
2.1 Classic and dilated convolutions
The classic convolutional kernel samples the input feature map on an integer grid, for example, the grid for a convolutional kernel is . Classic convolution can be expressed as:
(1) 
where denotes the input value at pixel , denotes the output value at pixel , and is the weight at the offset in the kernel. The classic convolution has small receptive field because of the limited grid size, so it usually works with successive downsampling process to enlarge the receptive field at a cost of spatial resolution.
Dilated convolution [11, 9] provides an alternative way to enlarge the receptive field by inserting “holes” into the convolutional kernel with a dilation rate of , which can be defined as:
(2) 
where is an integer, so dilated convolution is a discrete operation and it degenerates into classic convolution when . Compared to successive downsampling process, dilated convolution is able to systematically integrate multiscale contextual features without losing spatial information.
2.2 Adaptivescale convolution (ASC)
Following the idea of the dilated convolution, we design a novel convolutional module called adaptivescale convolution (ASC). ASC has three distinctions from the dilated convolution: 1) the dilation rate is learned from raw data instead of manually assigned; 2) different pixels on a certain feature map employ adaptive dilation rates rather than the same one; 3) the dilation rate is a float value rather than an integer.
More specifically, we use an additional network to learn the pixellevel dilation rates, which form a rate field with one channel and the same size as the input feature map. Different dilation rates can thus be used on the feature map simultaneously, providing optimal receptive fields to objects with different sizes. The ASC can be expressed as:
(3) 
where is learned by the network with the input of raw image and the parameters of .
As shown in Fig. 1, the learned dilation rate in our proposed ASC is usually a float value and the sampling location may not lie exactly on the regular grid. We therefore introduce a bilinear interpolation to compute the value at p from the integer coordinates on the input feature map [4],
(4) 
where q enumerates all integer locations on the input feature map, is the bilinear interpolation kernel defined as:
(5) 
where and denote ()coordinate of p and q, respectively. In fact, only the four integer locations adjacent to p will contribute to the interpolated value, because equals 0 on other locations. The right panel of Fig. 1 demonstrates an example of p falling inside the grid. The Eq.(3) is further defined as:
(6) 
2.3 Adaptivescale convolutional neural network (ASCNet)
Based on the ASC module, we present adaptivescale convolutional neural network (ASCNet) to deal with pixelwise semantic segmentation. As shown in Fig. 2, ASCNet contains an additional network and a series of ASC modules connected in a sequential order. First, the dilation rate field is learned from the input image by an additional network which is formed by 3
standard convolutions with the activation of RELU. We set the channel numbers of the 3 convolutions to 8, 4, and 1, respectively. Second, the learned rate field is transmitted to each ASC module, providing pixellevel dilation rates. The ASCs thus sample the multiscale information with the optimal receptive fields as mentioned in Section 2.2. The proposed ASCNet is trained in an endtoend fashion, on which the rate field and segmentation output are optimized simultaneously.
3 Experiments
Data Description. We evaluate the proposed method on two public datasets (The Herlev dataset [3] and SCD RBC dataset [13]
): 1) The Herlev dataset consists of 917 cervical cell images (562 for train, 171 for valid and 184 for test) from Pap smear tests. All images are normalized to have zero mean with unit variance intensity and are resized to
; 2) The SCD RBC dataset consists of 314 microscopy images (250 for train and 64 for test) from 5 Sickle Cell Disease (SCD) patients. Binary annotations (red blood cells VS background) are provided. We remove the blank margins of the raw images and resize them into . More information about the two datasets can be found in [3, 13].Experiments. To further evaluate the effectiveness of the proposed ASCNet, we compare five models: 1) the classic CNN, which consists of 7 standard convolutions, and the channel numbers of the first 6 convolutions are set to 8 and that of the last layer is consistent with the number of class to generate the segmentation result; 2) the dilated CNN, which contains 7 dilated convolutions with the dilation rates of 1,1,2,4,8,16, and 1, respectively. Such setting follows the “exponential expansion” scheme reported in [11]; 3) UNet, the previous stateofthearts on these two datasets; 4) ASCNet7, the proposed ASCNet with 7 adaptivescale convolutions, while the channel number is regulated in the same way as the classic CNN; 5) ASCNet14, the proposed ASCNet with 14 adaptivescale convolutions, where the channel numbers of the first 13 convolutions are set to 32 and that of the last layer is set to the number of class. The kernel sizes are 3 for all convolutions in the experiments.
Implementation Details.
We train the models on a single NVIDIA GTX 1080ti GPU. To minimize the softmax cross entropy loss, networks are trained with Adam optimizer for 50000 epochs. Limited to GPU memory, each batch only contains one sample. Furthermore, we employ RELU as the activation function in all the convolutional layers (including the classic convolutions, the dilated convolutions and the ASCs). The networks are implemented in TensorFlow 1.2.1.
4 Results
The Herlev dataset  SCD RBC dataset  
Dice  Precision  Recall  Dice  Precision  Recall  
Classic CNN  0.745  0.765  0.767  0.958  0.967  0.949 
Dilated CNN  0.824  0.817  0.890  0.956  0.960  0.953 
ASCNet7  0.857  0.863  0.891  0.959  0.960  0.958 
UNet [14, 13]  0.869  0.897  0.879  0.957  0.955  
ASCNet14  0.906  0.909  0.925  0.967  0.973  0.961 

Evaluation of the performance of ASCNet. Table 1
reports the results of cell segmentation on the two datasets, where we use Dice, Precision, and Recall to measure the performance of five different approaches. The first three approaches, the classic CNN, the dilated CNN, and the proposed ASCNet7, share the same network backbone of 7 convolutions with a kernel size of 3. As shown in Table
1, ASCNet7 outperforms the other two, suggesting that the proposed ASC module is effective. Moreover, between the last two approaches, ASCNet14 outperforms UNet on both datasets, indicating that the proposed ASCNet can achieve accuracy comparable to that of the stateoftheart methods. Lastly, compared to the ASCNet7, ASCNet14 benefits from the deeper (more convolutions) and wider (more channels) network architecture which is important to extract image information.Fig. 3 illustrates two examples of the segmentation of different models on the Herlev dataset. It shows clearly that segmentation errors reduce significantly after we introduce ASC modules.
Discussion about the learned dilation rate. To better understand the mechanism of the ASC module, we display an example of rate fields generated by ASCNet14 on both datasets in Fig. 4 (a). Additionally, in Fig. 4 (b), we display the corresponding distributions of learned dilation rates. For the Herlev dataset (top row in Fig. 4 (a)), the cervical cells are imaged at high magnification. The ASCNet thus produces larger dilation rates accordingly, which are more capable of aggregating global information of the large objects. For the SCD RBC dataset (bottom row in Fig. 4 (a)), the red blood cells are imaged at low magnification, so the ASCNet chooses to capture more local and fine details by producing smaller dilation rates. Another interesting finding is that the ASCNet can produce dilation rates less than 1, while the dilated convolution is incapable of doing so due to the integer dilation rate. In summary, the proposed ASCNet can adaptively learn the appropriate dilation rates for targets with different sizes, allowing the convolutions to better collect multiscale features systematically.
The dilation rates produced by ASCNet14 are smaller than that produced by ASCNet7 (see Fig. 4 (b)). One possible explanation is that the final size of receptive field (accumulated by successive convolutional layers) required to extract information in a deep model is roughly constant for the same object. As all convolutions in ASCNet share the same dilation rate field, the receptive field in the deeper ASCNet is linearly magnified more times and each convolutional layer then requires smaller dilation rates. Such behaviors are worth further investigation with quantitative analysis in future work.
5 Conclusion
In this work, we have proposed an adaptivescale convolutional neural network (ASCNet) for image segmentation, which adaptively learns pixellevel dilation rates to form a dilation rate field, and ultimately, produces appropriate receptive fields for objects with different sizes. Experimental results demonstrate that the proposed ASCNet is effective in extracting multiscale information and achieves the stateoftheart accuracy on cell segmentation. Based on the visualization of the learned rate fields on the Herlev and the SCD RBC datasets, we discover a positive correlation between the optimal dilation rates and the sizes of the segmentation targets. This finding confirms that a key to improving multiscale information extraction is to set the most appropriate receptive field for each object in the image. In addition, compared to its plain counterparts, ASCNet only introduces an additional 3layer network with a slightly increased computational cost. This allows us to easily replace the classic and dilated convolutions with ASCs in the existing deep models. In future work, we will test the proposed method on other computer vision tasks and further study the mechanism of multiscale feature learning with ASC modules.
References
 [1] Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40(4), 834–848 (2018)
 [2] Hamaguchi, R., Fujita, A., Nemoto, K., Imaizumi, T., Hikosaka, S.: Effective use of dilated convolutions for segmenting small object instances in remote sensing imagery. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). pp. 1442–1450. IEEE (2018)
 [3] Jantzen, J., Norup, J., Dounias, G., Bjerregaard, B.: Papsmear benchmark data for pattern classification. Nature inspired Smart Information Systems (NiSIS 2005) pp. 1–9 (2005)
 [4] Li, J., Chen, Y., Cai, L., Davidson, I., Ji, S.: Dense transformer networks. arXiv preprint arXiv:1705.08881 (2017)

[5]
Li, Y., Zhang, X., Chen, D.: Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1091–1100 (2018)
 [6] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3431–3440 (2015)
 [7] Mehta, S., Rastegari, M., Caspi, A., Shapiro, L., Hajishirzi, H.: Espnet: Efficient spatial pyramid of dilated convolutions for semantic segmentation. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 552–568 (2018)
 [8] Peng, C., Zhang, X., Yu, G., Luo, G., Sun, J.: Large kernel matters–improve semantic segmentation by global convolutional network. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4353–4361 (2017)

[9]
Shi, W., Jiang, F., Zhao, D.: Single image superresolution with dilated convolution based multiscale information learning inception module. In: 2017 IEEE International Conference on Image Processing (ICIP). pp. 977–981. IEEE (2017)
 [10] Wang, P., Chen, P., Yuan, Y., Liu, D., Huang, Z., Hou, X., Cottrell, G.: Understanding convolution for semantic segmentation. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). pp. 1451–1460. IEEE (2018)
 [11] Yu, F., Koltun, V.: Multiscale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122 (2015)
 [12] Yu, F., Koltun, V., Funkhouser, T.: Dilated residual networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 472–480 (2017)
 [13] Zhang, M., Li, X., Xu, M., Li, Q.: Rbc semantic segmentation for sickle cell disease based on deformable unet. In: International Conference on Medical Image Computing and ComputerAssisted Intervention. pp. 695–702. Springer (2018)
 [14] Zhao, J., Li, Q., Li, H., Zhang, L.: Automated segmentation of cervical nuclei in pap smear images using deformable multipath ensemble model. arXiv preprint arXiv:1812.00527 (2018)
Comments
There are no comments yet.