Log In Sign Up

EM-NET: Centerline-Aware Mitochondria Segmentation in EM Images via Hierarchical View-Ensemble Convolutional Network

by   Zhimin Yuan, et al.
NetEase, Inc

Although deep encoder-decoder networks have achieved astonishing performance for mitochondria segmentation from electron microscopy (EM) images, they still produce coarse segmentations with lots of discontinuities and false positives. Besides, the need for labor intensive annotations of large 3D dataset and huge memory overhead by 3D models are also major limitations. To address these problems, we introduce a multi-task network named EM-Net, which includes an auxiliary centerline detection task to account for shape information of mitochondria represented by centerline. Therefore, the centerline detection sub-network is able to enhance the accuracy and robustness of segmentation task, especially when only a small set of annotated data are available. To achieve a light-weight 3D network, we introduce a novel hierarchical view-ensemble convolution module to reduce number of parameters, and facilitate multi-view information aggregation.Validations on public benchmark showed state-of-the-art performance by EM-Net. Even with significantly reduced training data, our method still showed quite promising results.


page 1

page 2

page 3

page 4

1 Introduction

Mitochondria segmentation is a critical and essential task for neuroscientists to investigate the function of brain. Nowadays, with the advancement of electron microscopy (EM) imaging technology, massive unlabeled EM images can be easily acquired. Nevertheless, manually delineation such unprecedented scale high-resolution data is time-consuming, tedious and also limited reproducibility. However, most existing automated segmentation methods rely on massive labeled data which are not available in most circumstances. Consequently, a fully automated mitochondria segmentation algorithm that requires limited annotated data is in urgent need to help neurologist analyze connectome images.

In virtue of the irregular shape variance, shift size of mitochondria and its complex background in EM images, fully automated mitochondrial segmentation has proven to be a challenging task. Previous studies on mitochondria segmentation mainly focus on designing hand-crafted features and classical machine learning classifiers. Lucchi

[1] extracted ray descriptors (i.e. shape cues) and histogram features (i.e. low level intensity and texture cues) to segment mitochondria. Building upon their work, Lucchi [2] presented an automated graph partitioning framework which utilized an approximate sub-gradient descent algorithm to further improve segmentation performance. These methods achieved promising performance, but the performance of these approaches will soon reach saturation, with the increasing size of data and the limited representability of hand-crafted features.

With the advent of deep convolutional neural networks (CNN), the mitochondria segmentation task has been pushed to a new level. Casser

[3] proposed a modified 2D U-Net which utilized an on-the-fly data augmentation pipeline and reduced down-sampling stages to segment mitochondria. Their method, however, did not take full advantage of 3D spatial information. Xiao [4] proposed a fully residual convolution network using 3D convolution kernel to encode 3D spatial context information and utilized commonly used data augmentation method, such as flip, rotation and so on, to generate sufficient training data, which achieved state-of-the-art segmentation performance. Cheng [5] introduced CNN-based 2D and 3D algorithms with factorized convolutions and online feature-level augmentations to segment mitochondria under scarce annotated training data case. But 3D-based convolution network not only bring huge computation cost, but also need massive labeled data. Unfortunately, the online feature-level augmentations and commonly used data augmentation method are not effective enough when limited annotated data are available.

To address the shortcomings mentioned above, we propose a light-weight multi-task network, named EM-Net, for EM image segmentation. Specifically, we integrate two closely related tasks, i.e. segmentation and centerline detection into a single network. The objective is to take account of the geometrical information of mitochondria represented by centerline to help improve the generalization performance and robustness of segmentation, especially when scarce annotated training samples are available. Moreover, a novel hierarchical view-ensemble convolution (HVEC) based module is introduced to reduce learning parameters and computation cost, and ensemble information on multiple 2D views of a 3D volume. The experiments showed that the proposed approach yielded superior results even with quite limited training data.

2 Method

Our method comprises a main task for semantic 3D segmentation as well as an auxiliary centerline detection task to account for shape information of mitochondria. We formulate the segmentation task as voxel-wise labeling, and the task of centerline detection as regression. The ground truth for regression is proximity score map generated using the centerline annotations for the mitochondria.

The architecture of our proposed EM-Net is shown in Fig. 1. More specifically, the EM-Net consists of one shared encoder path and two task-specific decoder paths, where each decoder path accounts for one task. The total loss of our model is as:


where is a hyper parameter to compromise these two tasks, which is set to 0.7 in all experiments. Other than using 3D convolutions, we propose a novel hierarchical view-ensemble convolution

(HVEC) as building blocks for both encoder and decoder. Each decoder/encoder has three down-sampling/up-sampling stages. The encoder generates multidimensional feature maps which contain abundant context information. With the help of decoder, low resolution feature maps are progressively restored to input patch size. We replace deconvolutional operation with trilinear interpolation operation to restore the feature maps without additional parameters.

Following the idea of U-Net [6, 7] we use skip-connections to integrate low level cues from layers of the encoder to the corresponding layers of decoder. Instead of using concatenation as U-Net, we use sum operation to achieve long-range residual learning. In fact, concatenation-based operation inevitably increases the number of feature channels, which in turn restricts the input patch size.

Figure 2: One sub-block of HVEC module. A HVEC module consists of two sub-blocks with short connection cross them for residual learning.

HVEC module.

Inspired by Inception module

[8], as shown in Fig. 2, we firstly partition the input features into 4 groups and each group produces its own outputs, which are finally fused with concatenation. Information cross different branches are then integrated with 111 convolution. Compared with a standard 3D convolution layer attempting to simultaneously learn filters in all 3 spatial dimensions and 1 channel dimension, this factored scheme is more parameter efficient. However, our HVEC module is different from Inception module in four aspects: 1) instead of conducting 3D convolutions on each feature group, we perform different 2D convolutions (i.e. 133, 313, 331) on the first three groups to encode information of three separable orthogonal views of a 3D volume; 2) on the fourth group of features, 133 convolutions are performed on down-sampled features to capture context information at large scale on a focal view; 3) to capture multi-scale contexts and multiple fields-of-view, the four branches are convoluted in serial fashion, and the feature maps convoluted by previous branch are also added to the next branch as input, resulting hierarchical connections; 4) a whole HVEC module consists of two sub-blocks, and shortcut connections across two HVEC sub-blocks are to reformulate it as learning residual function in medium range. In this way, multi-scale and long-range context information, which are critical to the representation strength of the neural network model, can be encoded in a single module with reduced parameters.

Centerline detection task. Instead of classifying a pixel as centerline or background, we formulate mitochondrial centerline detection as a regression problem [9]. The ground truth for regression is proximity score map that is a distance transform function with peak at mitochondria centerline and zeros on the background. Formally, it is defined as,


where and are two hyper parameters to govern the shape of exponential function, and represents the nearest Euclidean distance between a voxel to the mitochondria centerline. The proximity score map is utilized to train our centerline detection task. At the end of the detection path, final feature map is followed by a Sigmod layer to get the predicted proximity score map . Minimizing mean Euclidean distance (i.e. mean squared error loss) between and is utilized to accomplish the centerline detection task.

It is noticeable, however, that the feature channels in centerline detection path are smaller than segmentation path. Intuitively, there are two reasons for choose this design: a) there are fewer mitochondria and positive pixels in EM volume and the proximity score map, respectively. Deepen the network may result in over-fitting; b) enlarging the number of feature channels will consume too much GPU video memory, which in turn decreases the size of input image size. Therefore, centerline detection path with fewer feature channels is preferred.

Segmentation task. Compared with the centerline detection path, segmentation path has an additional HVEC module, as shown in Fig. 1. Moreover, the feature maps produced by the last HVEC module in the detection path will be concatenated to segmentation path. The goal is to take full advantage of the context information contained in the detection path.

We utilize Jaccard-based loss function which is insensitive to the severe class imbalance in EM data. It is written as,



is the predicted probability map for voxel

and is the corresponding ground truth with one-hot coding label. The small constant number (e.g., ) is to prevent dividing by zero.

Figure 3: The visual comparison of segmentation results.
Methods DSC() JAC()
Lucchi [2] 86.0 75.5
Cetina [10] 86.4 76.0
2D U-Net [6] 91.5 84.4
Cheng (2D) [5] 92.8 86.5
3D U-Net [7] 93.5 87.8
Cheng(3D) [5] 94.1 88.9
Xiao [4] 94.7 90.0
Ours (single task) 94.1 88.8
Ours (multi-task) 94.6 89.7
Table 1: Comparison of different methods for mitochondria segmentation on the FIB-SEM dataset.

3 Results

We evaluate our proposed method on a public benchmark which consists of two stacks [1]. Each stack contains 165 slices of size 768 1024 EM images which are utilized for training and testing, respectively. These images are acquired by focused ion beam scanning electron microscopy (FIB-SEM). It is the most commonly used benchmark to validate mitochondria segmentation approaches.

We implement our model using Pytorch

111 on a workstation with NVIDIA 1080Ti GPU. The model is optimized by Adam optimizer. Learning rate starts from 0.0001 and step-wise learning rate decay scheme is used, where step and decay rate are set to 15 and 0.9, respectively.

As shown in Table 1, we compare our method against state-of-the-art approaches including both traditional machine learning and deep learning based methods. Segmentation performance is measured by Jaccard-Index (JAC) and Dice. Generally speaking, the algorithms based on deep learning significantly outperform traditional methods. Moreover, our method yields an accuracy of 94.6

in Dice and 89.7 in Jaccard index, which is superior than most other approaches. Fig. 3 shows one segmentation result obtained by different methods.

As illustrated in Table 1, the performance of both our method and 3D U-Net are superior than 2D-based network, which further confirms the importance of inter-slice context information. Meanwhile, it can be obviously seen that our model without centerline detection path also performs better than 3D U-Net, which is further improved by using auxiliary detection decoder. These experimental results validate the effectiveness of our HVEC module, and the auxiliary detection task can help improve the robustness and the generalization performance of segmentation task. Due to the limitation of GPU video memory, the input image size of 3D segmentation model cannot be too large. Furthermore, with the help of HVEC module, the parameters of our model are significantly less than 2D U-Net (3.2M vs. 31M) and 3D U-Net (3.2M vs. 19M). Note that we follow the setting in [6, 7] and use four down-sampling stages for 2D U-Net and three down-sampling stages for 3D U-Net. These ablation study results indicate that both our HVEC module and multi-task learning strategy are effective to improve performance.

We further investigate the segmentation performance of our proposed method under limited annotated data circumstances. As shown in Fig. 4, with progressively decreased size of annotated training dataset, all methods as is expected have degenerated performance. However, our EM-Net is more robust to the decreased size of training data, and is invariably higher than baseline methods. Note that, when the training data is decreased to only 30 of training data, the performance of baseline methods has a sharp drop a drop below 80. In contrast, our method still can achieve sound performance (86.7 in JAC), which confirms that EM-Net is an effective solution for mitochondria segmentation even in case with scarce annotated training data .

Figure 4: The comparative mitochondria segmentation performance on various fractions of training samples.

4 Conclusion

In this paper, we proposed a novel network EM-Net which based on multi-task learning strategy for mitochondria segmentation under sufficient and scarce annotated training data cases. Specifically, we joint segmentation and centerline detection tasks in a single network, which is more robust and outperformed our baseline methods by a large margin. Furthermore, we proposed a hierarchical view-ensemble convolution module to encode multi-scale long range context cues with fewer learnable parameters and low computation complexity. And it can be easily applied to any 3D convolution network. Results on public benchmark showed that our method achieved state-of-the-art segmentation performance.


  • [1] A. Lucchi, K. Smith, R. Achanta, et al, “Supervoxel-based Segmentation of Mitochondria in EM Image Stacks with Learned Shape Features,” IEEE Transactions on Medical Imaging, vol. 31, no. 2, pp. 474-486, 2012.
  • [2] A. Lucchi, Y. Li, and P. Fuae, “Learning for Structured Prediction using Approximate Subgradient Descent with Working Sets,” in Computer Vision and Pattern Recognition, pp. 1987-1994, 2013.
  • [3] V. Casser, K. Kang, H. Pfister, et al, “Fast Mitochondria Segmentation for Connectomics,” arXiv:1812.06024, 2018.
  • [4] C. Xiao, X, Chen, W. Li, et al, “Automatic Mitochondria Segmentation for EM Data Using a 3D Supervised Convolutional Network,” Frontiers in Neuroanatomy, vol. 12, pp. 92, 2018
  • [5] H, Cheng, and A. Varshney, “Volume Segmentation using Convolutional Neural Networks with Limited Training Data,” in International Conference on Image Processing, pp. 590-594, 2017.
  • [6] O. Rosenberger, P. Fischer, and T. Brox, “U-Net: Convolutional Networks for Biomedical Image Segmentation,” in Medical Image Computing and Computer-Assisted Intervention, pp. 234-241, 2015.
  • [7] O. Cicek, A. Abdulkadir, S.S. Lienkamp, et al, “3D U-Net: Learning Dense Volumetric Segmentation from Sparse Annotation,” in Medical Image Computing and Computer-Assisted Intervention, pp. 424-432, 2016.
  • [8] F. Chollet, “Xception: Deep Learning with Depthwise Separable Convolutions,” in Computer Vision and Pattern Recognition, pp. 1251-1258, 2017.
  • [9] P. Kainz, M. Urschler, S. Schulter, et al, “You Should use Regression to Detect Cells,” in Medical Image Computing and Computer-Assisted Intervention, pp. 276-283, 2015.
  • [10]

    K. Cetina, J. M. Buenaposada, and L. Baumela, “Multi-class Segmentation of Neuronal Structures in Electron Microscopy Images,”

    BMC Bioinformatics, vol. 19, no. 1, pp. 298, 2018.