In recent years, the rapid development of deep learning technology(DL) represented by CNNs has greatly promoted the advancement of computer vision research fields such as classification, detection, segmentation and tracking. Many excellent CNN models like AlexNet, VGG, GoogleNet, ResNet
, etc. have been proposed and achieved good results since 2012. In addition, researchers have also established datasets such as ImageNet, COCO
, etc. which have also greatly promoted the development of related research. Initialization using pre-trained model parameters on these datasets also greatly increases the efficiency of the study. In view of the great success of DL in the field of computer vision, researchers have applied it to medical images such as Computer Tomography (CT), ultrasound, X-ray and Magnetic Resonance Imaging (MRI), in which the development of automatic segmentation technology has effectively reduced the time and cost of manual labeling and it is also the objects of this study. A big difference between medical images and natural scene images is that medical images are difficult to obtain. The scarcity of data leads to the inability to train deeper neural networks, and the domain gap of medical images and natural images also leads to bad performance on models pre-trained on ImageNet, COCO and other natural scene datasets. In this case, U-Net and different improved versions based on it are proposed and achieve good segmentation performances with relatively few datasets. These variants focus on improvements in network architecture, such as the integration of recurrent neural networks into U-Net, etc. In this paper, we study the effect of the convolution kernel size to the performance of the model and propose a new module named MixModule. We expect different sizes of convolution kernels to capture different levels of information since they have different receptive fields and the fusion of these information plays an extremely important role in improving network performance.
2 Related Work
Semantic segmentation which classifies each pixel in the image individually is an important research area in computer vision. Before the advent of DL revolution, traditional methods mostly rely on manual extraction of features to predict the category of each pixel. Even in the early days of DL, researchers mainly use patch-wise training which classifies pixels by using an image block around the pixel as input to feed into the CNN. This method is not only inefficient since the content of adjacent pixel blocks is basically repeated but also limits the sensing area due to the pixel block size, so it is difficult to achieve good results. The FCN method proposed by Long et al. that uses a fully convolutional network structure and applies fully convolutional training has completely changed the situation and became the basis of subsequent research. The deeplab series of studies based on FCN propose Atrous convolution and other operations to further improve the accuracy of semantic segmentation. On the basis of semantic segmentation, instance segmentation study is developed which not only predicts the pixel class, but also predicts the class individuals to which the pixel belongs. Mask RCNN method based on Faster RCNN and PANet method based on FPN have achieved the state-of-the-art performance on the instance segmentation task. Although aforementioned methods have achieved impressive results, they are based on pre-trained features on public datasets such as ImageNet. Due to the domain gap between medical images and natural scene images and the scarcity of medical images, the above methods cannot be well transferred to the medical image segmentation task. In response to the characteristics of medical images, Olaf et al. propose U-Net, which achieves competitive performance using a relatively small number of medical images. On the basis of U-Net, researchers have successively proposed R2U-Net, Attention U-Net, etc., which further promote the development of medical image segmentation research. This paper proposes MixModule from the perspective of convolution kernel size and demonstrates its contribution to the performance of U-Net and its variants.
3 Proposed Method
MixModule contains multiple sizes of convolution kernels to capture different ranges of semantic information which is crucial for medical images that emphasize the details of the underlying image. Let denotes the th convolutional kernel whose kernel size is , input channel size is , filter number is . Let
denotes the input tensor with height, width and channels. Let denotes the th output tensor using which is calculated as (1) and the output tensor is obtained by concatenating (2) where is the total number of kernels used.
In this paper, we let n equals 4 and choose three convolution kernel sizes which are , and . Figure 1 and Figure 2 show the details of the modules. Figure 1 is the basic modules used in U-Nnet and its variants and Figure 2 is the corresponding MixModule version.
3.2 Neural Network Architecture
We use MixModule to replace the single-size convolution kernel in the original U-Net and its variants. In this paper, we use U-Net, R2U-Net and Attention U-Net(AttU-Net for short) for experiments, whose network structures are shown in the Figure 3 and the brown module indicates the location of the replacement.
4 Experiment and Results
To demonstrate the effects of MixModule, we performe experiments on two different medical image datasets which include 2D images for skin lesion segmentation and retina blood vessel segmentation (DRIVE and CHASE_DB1). We use PyTorch framework to implement all the experiments on a single GPU machine with an NIVIDIA Quadro P6000.
4.1.1 Skin Lesion Segmentation
This dataset comes from ISIC Skin Image Analysis Workshop and Challenge of MICCAI 2018   and contains 2594 samples in total. The dataset was split into training set(70), validation set(10), and test set(20) which means 1815 images for training, 259 for validation and 520 for testing models. The original samples were slightly different in size from each other and were resized to 192256.
4.1.2 Retina Blood Vessel Segmentation
We perform retina blood vessel segmentation experiments on two different datasets, DRIVE and CHASE_DB1. DRIVE dataset consists of 40 retinal images in total, in which 20 samples are used for training and remaining 20 for testing. The size of each original image is 565
584 pixels and all images are cropped and padded with zeros to 576576 to get a square dataset. We randomly select 531265 patches whose size is 4848 from 20 of the training images in DRIVE dataset and 10 of them are used for validation. Another dataset, CHASE_DB1, contains 28 color retina images with the size of 999960 pixels which are collected from both left and right eyes of 14 school children. 20 samples are randomly selected as training set and the remaining 8 samples are used for testing. Similar to DRIVE dataset, we crop all the samples into 960960 pixels and randomly select 412400 patches of 4848 pixels from the training set of which 10 are used for validation and the remaining for training.
4.2 Quantitative Analysis
To make a detailed comparison and analysis of the model performance, several quantitative analysis metrics are considered, including accuracy (AC)(3), sensitivity (SE)(4), specificity (SP)(5), precision (PC)(6), Jaccard similarity (JS)(7) and F1-score (F1)(8) which is also known as Dice coefficient (DC). Variables involved in these formulas are: True Positive (TP), True Negative (TN), False Positive (FP), False Negative (FN), Ground Truth(GT) and Segmentation Result (SR). In the experiments, we utilize these metrics to evaluate the performance of the proposed approaches against existing ones.
All three datasets are processed by subtracting the mean and normalizing according to the standard deviation. We use Adam optimizer, set the initial learning rate to 0.001 which is reduced by ten times if the training set loss does not drop during 10 consecutive epochs. We augment data using rotation, crop, flip, shift, change in contrast, brightness and hue. We set batch size to 4 for Skin Dataset and 32 for DRIVE and CHASE_DB1 whose patch size is relatively smaller. For each model we train 50 epochs and the result is shown in Table1. Models with MixModule have better performance than those not and the best performance in each metric all comes from MixModule-based models. We also show some outputs of the networks in Figure 4.
In this paper, we propose a new module named MixModule that can combine different ranges of features and can be embedded into different network structures of medical image segmentation tasks . We apply MixModule to U-Net and its two variants R2U-Net and Attention U-Net, get MixU-Net, MixR2U-Net and MixAttU-Net. These models are evaluated using three datasets including skin lesion segmentation and retina blood vessel segmentation. Experimental results show network models with MixModule has better performance than original ones in medical image segmentation tasks on all three datasets, which indicates MixModule has great development and application potential in medical image segmentation field.
-  (2018) Recurrent residual convolutional neural network based on u-net (r2u-net) for medical image segmentation. arXiv preprint arXiv:1802.06955. Cited by: §2, §3.2.
-  (2014) Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv preprint arXiv:1412.7062. Cited by: §2.
-  (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40 (4), pp. 834–848. Cited by: §2.
-  (2017) Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587. Cited by: §2.
-  (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV), pp. 801–818. Cited by: §2.
Deep neural networks segment neuronal membranes in electron microscopy images. In Advances in neural information processing systems, pp. 2843–2851. Cited by: §2.
-  (2019) Skin lesion analysis toward melanoma detection 2018: a challenge hosted by the international skin imaging collaboration (isic). arXiv preprint arXiv:1902.03368. Cited by: §4.1.1.
Imagenet: a large-scale hierarchical image database.
2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §1, §2.
-  (2012) Learning hierarchical features for scene labeling. IEEE transactions on pattern analysis and machine intelligence 35 (8), pp. 1915–1929. Cited by: §2.
-  (2012) Blood vessel segmentation methodologies in retinal images–a survey. Computer methods and programs in biomedicine 108 (1), pp. 407–433. Cited by: §4.1.2.
-  (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §2.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1.
-  (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
-  (2017) Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125. Cited by: §2.
-  (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §1.
-  (2018) Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8759–8768. Cited by: §2.
-  (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. Cited by: §2.
-  (2018) Attention u-net: learning where to look for the pancreas. arXiv preprint arXiv:1804.03999. Cited by: §2, §3.2.
-  (2017) Automatic differentiation in pytorch. Cited by: §4.
Recurrent convolutional neural networks for scene labeling.
31st International Conference on Machine Learning (ICML), Cited by: §2.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §2.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §2, §3.2.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1.
-  (2004) Ridge-based vessel segmentation in color images of the retina. IEEE transactions on medical imaging 23 (4), pp. 501–509. Cited by: §4.1.2.
-  (2015) Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. Cited by: §1.
-  (2018) The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Scientific data 5, pp. 180161. Cited by: §4.1.1.