Multi-Level Contextual Network for Biomedical Image Segmentation

by   Amirhossein Dadashzadeh, et al.

Accurate and reliable image segmentation is an essential part of biomedical image analysis. In this paper, we consider the problem of biomedical image segmentation using deep convolutional neural networks. We propose a new end-to-end network architecture that effectively integrates local and global contextual patterns of histologic primitives to obtain a more reliable segmentation result. Specifically, we introduce a deep fully convolution residual network with a new skip connection strategy to control the contextual information passed forward. Moreover, our trained model is also computationally inexpensive due to its small number of network parameters. We evaluate our method on two public datasets for epithelium segmentation and tubule segmentation tasks. Our experimental results show that the proposed method provides a fast and effective way of producing a pixel-wise dense prediction of biomedical images.



There are no comments yet.


page 1

page 5


Label Refinement Network for Coarse-to-Fine Semantic Segmentation

We consider the problem of semantic image segmentation using deep convol...

Task Decomposition and Synchronization for Semantic Biomedical Image Segmentation

Semantic segmentation is essentially important to biomedical image analy...

Image Segmentation to Distinguish Between Overlapping Human Chromosomes

In medicine, visualizing chromosomes is important for medical diagnostic...

Non-Local Context Encoder: Robust Biomedical Image Segmentation against Adversarial Attacks

Recent progress in biomedical image segmentation based on deep convoluti...

FANet: A Feedback Attention Network for Improved Biomedical Image Segmentation

With the increase in available large clinical and experimental datasets,...

Sparse Coding Driven Deep Decision Tree Ensembles for Nuclear Segmentation in Digital Pathology Images

In this paper, we propose an easily trained yet powerful representation ...

Convolution Neural Networks for Semantic Segmentation: Application to Small Datasets of Biomedical Images

This thesis studies how the segmentation results, produced by convolutio...

Code Repositories


Multi-Level Contextual Network for Biomedical Image Segmentation

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Illustration of different contextual information. In this case, the short-range context focuses more on the local information of pixel P and could not capture useful features. While the long-range context is responsible for global region consistency. To aggregate rich contexts, this method combines contextual information from different scales.

Analysis and interpretation of stained tissue on microscope slide images is one of the main tools for diagnosis and grading of cancer, which is done manually by a pathologist. With the progress of machine learning and computer vision techniques, computer-assisted diagnosis (CAD) has been widely used to build tools for the accurate and automatic analysis of these complex and informative image data.

Recently, deep convolutional neural networks (CNNs), have been successfully used to a wide variety of visual recognition tasks such as image classification [1, 2], and image segmentation [3, 8]

. Unlike the traditional approach of hand-crafted feature extraction methods

[21, 22, 23]

, CNNs learn useful features directly by the composition of multiple linear and non-linear transformations of the training data. Hence, CNN-based methods have gained great popularity for medical image analysis and in particular on biomedical image segmentation tasks

[14, 24, 25]. In this way, one of the earliest paper in biomedical image segmentation based on CNN was published by Ciresan et al. [4]. They trained a CNN using the sliding-window technique to predict the class label of each pixel. The main drawback of this model is that the network could lead to storage overhead and is quite slow if we process a high-resolution image. Some studies also used patch-based classification approaches [5, 6]. For instance, Janowczyk et al. [6] utilized AlexNet architecture [2] for patch-based segmentation for histologic primitives (e.g., nuclei, mitosis, tubules, epithelium, etc.). However, most recent papers now are based on the Fully Convolution Network (FCN) architecture [3]. The FCN-like networks remove fully-connected layers to create an efficient end-to-end learning algorithm without extracting the region proposals. These kinds of networks significantly reduce the redundant computations in the original CNN, which can be an advantage in contrast to previous approaches.

U-net [7], an FCN based network, is the most well-known architecture in biomedical image segmentation. The architecture of the U-net consists of the contracting encoder part to capture context and a successive expanding decoder part to enable precise localization. Segnet [8] is another popular network proposed for natural image semantic segmentation. Segnet has a similar architecture to U-net but with some differences. This model reuses the pooling indices from the encoder and learns extra convolutional layers, while U-net adds skip connections from the encoder features to the corresponding decoder activations. These encoder-decoder based architectures directly concatenate feature maps from earlier layers to recover image detail. However, this strategy is not able to model long-range context, which can be crucial for tasks such as segmentation of tubules in histology images (see Fig. 1). Moreover, these models perform complicate decoder module which usually needs substantial computing resources. To address this problem, in this paper we propose a deep multi-level contextual network, a CNN-based architecture that effectively integrates multi-level long-range (global) and short-range (local) contextual information to achieve a more reliable segmentation result without causing too much computation cost.

In summary, there are three main contributions in our paper which are as follows:

We take advantage of residual learning technique, proposed by [9], to build a deeper FCN-like network with fewer parameters and powerful representational ability.

We propose a new skip connection strategy using pyramid dilated convolution (PDC) to encode multi-scale and multi-level contextual information passed forward.

We perform extensive experiments on two digital pathology tasks including epithelium and tubule segmentation [6] to demonstrate the effectiveness of the proposed model.

The rest of this paper is organized in the following manner. In Section 2 we explain the methodology of our model. The experimental results are discussed in Section 3 and finally the conclusion is given in Section 4.

2 Method

In this section, we start by introducing our fully convolutional residual neural network, and the pyramid dilated convolution (PDC) module. Then, we describe the overall framework of our deep multi-level contextual network for accurate biomedical image segmentation.

2.1 Fully Convolutional Residual Network

Recently, lots of works demonstrated that network depth is of crucial importance in neural network architectures [11]. However, it is difficult to train deeper networks due to problems such as the risk of overfitting, and difficulty in optimization. Also, the gradient would vanish more easily in such networks. To overcome these problems, He et al. [9] proposed residual learning technique to ease the training of networks and enables them to be substantially deeper. This technique allows gradients to flow across multiple layers during the training process by using an identity mappings as the skip connections. In this work, we take advantage of the residual learning technique to make a deep encoder network with 46 convolution layers. For this purpose, as shown in Fig. 4, we consider 5 residual groups that each group contains 3 residual units which are stacked together. The structure of these units is shown in Fig. 2. Each unit can be defined as a general form:

Figure 2: The structure of residual unit with bottleneck used in the proposed method.
Figure 3: Pyramid Dilated Convolution (PDC) module structure.

Here and represent the input and output of the unit, and is the residual function. As it is clear in Fig. 2, we use a bottleneck architecture for each the residual unit, using two convolution layers. This architecture is responsible for reducing the dimensionality and then restoring it [9]. Also, the downsampling operation is performed by the first

convolution with a stride of 2.

2.2 Pyramid Dilated Convolution

Contextual information has been shown to be extremely important for semantic segmentaton tasks [12, 13]. In a CNN, the size of the receptive field can roughly determine how much we need to capture context. Although the wider receptive filed allows us to gather more context, Zhou et al. [15] showed that the actual size of the receptive fields in a CNN is much smaller than the theoretical size, especially on high-level layers. To solve this problem, Fisher et al. [16] propose dilated convolution which can exponentially enlarge receptive field to capture long-range context without losing spatial resolution and increasing the number of parameters.

In Fig. 1 we show the importance of long-range context in biomedical image segmentation. As it is evidence, the local features only collect information around pixel P (Fig. 1(b)). However, capturing more context with a long-range receptive field will bring stronger features which can help to eliminate ambiguity and improve the classification performance.

In this work, we combine different scales of contextual information using a module called pyramid dilated convolution (PDC). This kind of module has been employed successfully in state-of-the-art DeepLabv3 [10] for semantic segmentation. The PDC module used in this paper consists of four parallel dilated convolutions with different dilated rates. Also, for the first layer of this module, we apply a convolution layer with stride to control the resolution of the input feature maps. The refined feature maps of different dilated convolutions are finally concatenated together. Furthermore, The output size of the final feature maps from PDC is 1/16 of the input image. The final module structure is shown in Fig. 3.

Figure 4: Overview of our network framework. A fully convolutional residual network with one convolution layer and five residual groups is used to compute low and high-level features from the input image. The extracted feature maps () are then fed to the PDC module to generate multi-level and multi-scale contextual information. Finally, after a concatenation unit, we applied convolution with a standard 16x bilinear upsampling to build an end-to-end network for pixel-wise dense prediction.

2.3 Overall Framework

With fully convolutional residual network and proposed pyramid dilated convolution (PDC) module, we introduce the overall framework of our network as Fig 4. In this way, we follow the prior works based on an encoder-decoder architecture [3, 7, 8] which applied skip connections in the decoder layers to recover the finer information lost in the downsampling layers.

Different from previous works that directly link an encoder layer to a decoder layer, we utilize a PDC module to control the information being passed via the skip connections. This strategy helps to aggregate multi-scale contextual features from multiple levels. As illustrated in Fig. 4 in the encoder network, for an input image the 5 feature maps () produce by the 5 residual groups. These feature maps except feed to the PDC module to generate multi-scale context maps. Then a fusion unit is used to concatenate these context maps extracted from hierarchical layers.

Since runtime is of concern, we consider a simple decoder architecture with only one convolution and one upsampling layer to produce the final segmentation result using sigmoid based classification.

3 Experimental

In this section, we first briefly introduce the datasets (Sec. 3.1). Then we explain the details for training the network (Sec. 3.2). Finally, we present experimental results for both tasks of tubule and epithelium segmentation (Sec. 3.3).

3.1 Dataset

To show the effectiveness of the proposed method, we carry out comprehensive experiments on two publicly available datasets [6].

The first dataset is the colorectal cancer histopathology image dataset which consists of 85 images of size scanned at 40x and contains a total of 795 delineated tubules.

The second dataset is the estrogen receptor positive (ER+) breast cancer (BCa) histopathology image dataset which consists of 42 images of size scanned at 20x and contains a total of 1735 epithelium regions. To make better use of this dataset, we split each image into four non-overlapping image tiles of .

We use the above datasets for the tasks of tubule segmentation and epithelium segmentation respectively. The tubules and epithelium regions are manually annotated across all images by an expert pathologist and utilized to generate a binary segmentation mask.

We perform all our experiments using 5-fold cross-validation. In each fold, we consider 80% images for training and 20% images for testing. Moreover, to improve generalization and prevent overfitting, we apply the strategy of data augmentation to generate more training data. The augmentation transformations include rescaling and flipping horizontally and vertically.

3.2 Training Network

The Keras open-source deep learning library


with the backend of tensorflow

[18] is used to implement the proposed method.

In the training phase the cross-entropy loss is applied as the cost functions:


where and corresponding to prediction and target. We performed optimization via Adam optimization algorithm [19]. The learning rate is initialized as 0.001, and and . We also applied dropout [2] (with

) before the sigmoid classification layer. The proposed framework has been trained using an Nvidia GTX 1080TI GPU with 3.00GHz 4-core CPU and 16GB RAM. The training phase takes around 2 hours for each fold. The maximum number of epochs were fixed at 200, and the model with the best validation loss was selected and evaluated. Our trained model takes roughly 35ms for a

input RGB image.

Figure 5: Visual comparison on sub-images taken from the test images, (from top to bottom): input image, ground truth, proposed model, baseline model, Segnet [8] and U-net [7]. Green pixels show true positive, red false positive and blue false negative. Note that we perfomed data agumentation for all of these models.

3.3 Results and Comparisons

To carry out a quantitative comparison of accuracy of the proposed method, we measure F1-score which is the harmonic mean of the Precision and Recall

[20] with a value in the range of , defined bellow:


where tp is true positive, fp denotes false positive and fn denotes false negative.

Method F1-score Std.Dev
Original Paper [6]
U-net [7]
Segnet [8]
Baseline+DA+PDC 0.9066
Table 1: Quantitative comparison of epithelium segmentation performance on ER+BCa dataset. ‘DA’ refers to data augmentation we performed.
Method F1-score Std.Dev
Original Paper [6]
U-net [7]
Segnet [8]
Baseline+DA+PDC 0.8950
Table 2: Quantitative comparison of tubule segmentation performance on colorectal cancer dataset. ‘DA’ refers to data augmentation we performed.

The evaluation results are obtained for the both tasks of tubule segmentation and epithelium segmentation with the same settings. Our baseline network is adapted from a fully convolutional residual network with 46 convolution layers which described in Section 2.1. To illustrate the performance of our multi-level contextual network, we compared the proposed method with our baseline network. To evaluate the baseline, we applied a convolution with 16x upsampling layer after the last residual group. We also compared the performance of our method with three state-of-the-art methods including 1) a patch-based segmentation method proposed in the original paper [6] and, 2) the original U-net which won several grand challenges recently and, 3) the original SegNet architecture which is a popular network for natural image semantic segmentation. The quantitative segmentation results are reported in Table 1 and Table 2. All our results except the first row are trained with data augmentation technique.

It is clear that the proposed method achieved the best performance on both of the tasks and outperformed all the other methods, which demonstrates the significance of the multi-level contextual features. Note that we did not do any post-processing of the resulting segmentation.

A visualization of the results of the proposed method and comparison with the other methods is shown in Fig. 5, As shown in the samples, our segmentation results have fewer false negatives with higher accuracy. In addition, our method is not only more accurate but also is faster due to its fewer parameters as Table 3. Fig. 6 shows the overall segmentation results with better visualization on both datasets.

Model Parameters (Million) Model size (MB)
U-net [7] 7.76 93.3
Segnet [8] 29.45 353.7
Proposed 3.4 41.8
Table 3: Comparison of model size and number of parameters for various deep models.

4 Conclusion

In this paper, we address the problem of biomedical image segmentation using a novel end-to-end deep learning framework. Since the contextual information is crucial to obtain good segmentation results, we have presented a deep multi-level contextual network that effectively integrates multi-level contextual features to accurately segment histologic primitives (e.g., nuclei and tubules) from histology images without causing too much computation cost. In this way, we used a deep fully convolution residual network as the encoder network. Then we applied a new skip connection strategy using pyramid dilated convolution modules to modulate multi-scale contextual information passed forward. We have evaluated our method on two public available datasets for epithelium segmentation and tubule segmentation tasks. Our experimental results show that our proposed model is superior to several state-of-the-art methods.

Figure 6: Segmentation results (Mean Std.Dev) of different methods on both datasets. ‘BN’ denotes our baseline network and ‘DA’ refers to data augmentation we performed.


  • [1]

    Gu, J., Wang, Z., Kuen, J., Ma, L., Shahroudy, A., Shuai, B., … & Chen, T. (2018). Recent advances in convolutional neural networks. Pattern Recognition, 77, 354-377.

  • [2]

    Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097-1105).

  • [3] Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3431-3440).
  • [4]

    Ciresan, D., Giusti, A., Gambardella, L. M., & Schmidhuber, J. (2012). Deep neural networks segment neuronal membranes in electron microscopy images. In Advances in neural information processing systems (pp. 2843-2851).

  • [5] Pan, X., Li, L., Yang, H., Liu, Z., Yang, J., Zhao, L., & Fan, Y. (2017). Accurate segmentation of nuclei in pathological images via sparse reconstruction and deep convolutional networks. Neurocomputing, 229, 88-99.
  • [6] Janowczyk, A., & Madabhushi, A. (2016). Deep learning for digital pathology image analysis: A comprehensive tutorial with selected use cases. Journal of pathology informatics, 7.
  • [7] Ronneberger, O., Fischer, P., & Brox, T. (2015, October). U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention (pp. 234-241). Springer, Cham.
  • [8] Badrinarayanan, V., Kendall, A., & Cipolla, R. (2017). SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Transactions on Pattern Analysis & Machine Intelligence, (12), 2481-2495.
  • [9] He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).
  • [10] Chen, L. C., Papandreou, G., Schroff, F., & Adam, H. (2017). Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587.
  • [11] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., … & Rabinovich, A. (2015). Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1-9).
  • [12] Galleguillos, C., & Belongie, S. (2010). Context based object categorization: A critical survey. Computer vision and image understanding, 114(6), 712-722.
  • [13] Liu, Z., Li, X., Luo, P., Loy, C. C., & Tang, X. (2015). Semantic image segmentation via deep parsing network. In Proceedings of the IEEE International Conference on Computer Vision (pp. 1377-1385).
  • [14] Litjens, G., Kooi, T., Bejnordi, B. E., Setio, A. A. A., Ciompi, F., Ghafoorian, M., … & Sánchez, C. I. (2017). A survey on deep learning in medical image analysis. Medical image analysis, 42, 60-88.
  • [15] Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2014). Object detectors emerge in deep scene cnns. arXiv preprint arXiv:1412.6856.
  • [16] Yu, F., & Koltun, V. (2015). Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122.
  • [17]

    Keras: Deep learning library for theano and tensorflow,2015.
  • [18]
  • [19] Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  • [20] Powers, D. M. (2011). Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation.
  • [21] Jacobs, J. G., Panagiotaki, E., & Alexander, D. C. (2014, September). Gleason grading of prostate tumours with max-margin conditional random fields. In International Workshop on Machine Learning in Medical Imaging (pp. 85-92). Springer, Cham.
  • [22] Sirinukunwattana, K., Snead, D. R., & Rajpoot, N. M. (2015, March). A novel texture descriptor for detection of glandular structures in colon histology images. In Medical Imaging 2015: Digital Pathology (Vol. 9420, p. 94200S). International Society for Optics and Photonics.
  • [23] Nguyen, K., Sarkar, A., & Jain, A. K. (2012, October). Structure and context in prostatic gland segmentation and classification. In International Conference on Medical Image Computing and Computer-Assisted Intervention (pp. 115-123). Springer, Berlin, Heidelberg.
  • [24] Roth, H. R., Lu, L., Farag, A., Shin, H. C., Liu, J., Turkbey, E. B., & Summers, R. M. (2015, October). Deeporgan: Multi-level deep convolutional networks for automated pancreas segmentation. In International conference on medical image computing and computer-assisted intervention (pp. 556-564). Springer, Cham.
  • [25] Dhungel, N., Carneiro, G., & Bradley, A. P. (2015, October). Deep learning and structured prediction for the segmentation of mass in mammograms. In International Conference on Medical Image Computing and Computer-Assisted Intervention (pp. 605-612). Springer, Cham.