Contrastive Learning for Mitochondria Segmentation

by   Zhili Li, et al.

Mitochondria segmentation in electron microscopy images is essential in neuroscience. However, due to the image degradation during the imaging process, the large variety of mitochondrial structures, as well as the presence of noise, artifacts and other sub-cellular structures, mitochondria segmentation is very challenging. In this paper, we propose a novel and effective contrastive learning framework to learn a better feature representation from hard examples to improve segmentation. Specifically, we adopt a point sampling strategy to pick out representative pixels from hard examples in the training phase. Based on these sampled pixels, we introduce a pixel-wise label-based contrastive loss which consists of a similarity loss term and a consistency loss term. The similarity term can increase the similarity of pixels from the same class and the separability of pixels from different classes in feature space, while the consistency term is able to enhance the sensitivity of the 3D model to changes in image content from frame to frame. We demonstrate the effectiveness of our method on MitoEM dataset as well as FIB-SEM dataset and show better or on par with state-of-the-art results.



There are no comments yet.


page 4

page 5


Pixel Contrastive-Consistent Semi-Supervised Semantic Segmentation

We present a novel semi-supervised semantic segmentation method which jo...

Region-level Contrastive and Consistency Learning for Semi-Supervised Semantic Segmentation

Current semi-supervised semantic segmentation methods mainly focus on de...

Contrastive Rendering for Ultrasound Image Segmentation

Ultrasound (US) image segmentation embraced its significant improvement ...

Fool Me Once: Robust Selective Segmentation via Out-of-Distribution Detection with Contrastive Learning

In this work, we train a network to simultaneously perform segmentation ...

Exploring Feature Representation Learning for Semi-supervised Medical Image Segmentation

This paper presents a simple yet effective two-stage framework for semi-...

CO2: Consistent Contrast for Unsupervised Visual Representation Learning

Contrastive learning has been adopted as a core method for unsupervised ...

Dual-Branch Network with Dual-Sampling Modulated Dice Loss for Hard Exudate Segmentation from Colour Fundus Images

Automated segmentation of hard exudates in colour fundus images is a cha...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Known as the powerhouses of cells, mitochondria play a crucial role in the regulation of cellular life and death for they carry out all types of important cellular functions by producing the overwhelming majority of cellular adenosine triphosphate (ATP). Many clinical studies have revealed the correlation between the biological function of mitochondria and their size and geometry [1, 2]. In the last few years, with the development of imaging technology, electron microscopy (EM) images have been widely used for the analysis of nanometer-level structures including mitochondria. Knott et al. [3] deliver a dataset of an adult rodent brain tissue using focused ion beam scanning electron microscopy (FIB-SEM) to see axons, dendrites and their synaptic contacts. More recently, Wei et al. [4] publish a large-scale EM dataset (MitoEM) for 3D mitochondria instance segmentation. MitoEM dataset is larger than FIB-SEM dataset [3] and consists of more mitochondria with sophisticated shapes and ultra-structures that are similar in appearance to mitochondria. Therefore, this dataset poses new challenges to the segmentation of mitochondria.

Many earlier approaches combine traditional image processing and machine learning techniques for mitochondria segmentation. Jorstad and Fua

[5] propose an active surface-based method to refine the boundary surfaces by exploiting the thick and dark membranes of mitochondria. Lucchi et al. [6]

utilize an approximate subgradient descent algorithm to minimize the hinge loss in the margin-sensitive frameworks. However, these methods rely on hand-crafted features to build the classifiers, limited in efficiency and generalizability. Recently, deep learning-based approaches

[7, 8, 9, 10, 11, 12] show promising results in mitochondria segmentation. Liu et al. [7] adopt a modified Mask R-CNN for mitochondria instance segmentation on the FIB-SEM dataset [3]. Oztel et al. [8] propose a fully convolutional network (FCN) to segment mitochondria and then use various post-processing, such as 2D spurious detection filtering, boundary refinement and 3D filtering, to improve the segmentation. Xiao et al. [9]

propose an effective 3D residual fully convolutional network (FCN) to avert the vanishing gradient problem for mitochondria segmentation. Most of the methods listed above improve the segmentation by dedicatedly designing the network architecture, or adding various post-processing algorithms. However, these methods rarely pay extra attention to hard examples, such as noises, imaging artifacts and sub-cellular ultra-structures, which widely exist in EM images. As a result, these approaches can scarcely segment the hard examples correctly, preventing further improvement in mitochondria segmentation accuracy.

In the last few years, contrastive learning has been widely applied in computer vision tasks

[13, 14, 15, 16, 17]. Contrastive learning is an approach to formulate the task of finding similar and dissimilar things for a deep learning model. [13, 14] adopt contrastive learning for unsupervised visual representation learning. For image classification, [15] extend the self-supervised batch contrastive approach to the fully-supervised setting, which effectively leverage label information. More recently, [16] propose to first pretrain the CNN feature extractor using a label-based contrastive loss for semantic segmentation task. [17]

adopt point-wise contrastive learning to improve boundary estimation for ultrasound image segmentation. Inspired by

[17], we extend contrastive learning to solve hard examples in mitochondria segmentation, based on the intra-class similarity of mitochondria textures and the inter-class discrepancy of textures between mitochondria and hard examples.

In this work, we propose a contrastive learning based framework for mitochondria segmentation task. In the training phase, we first sample representative points (i.e. pixels) from hard examples. Then we adopt a contrastive loss, which is computed on the feature space of these points, to facilitate model training. Our contrastive loss consists of two terms, similarity term and consistency term. The similarity loss term is designed to improve the feature similarity of the points from the same class and decrease that of different classes. The consistency loss term focuses on the feature similarity of points in adjacent frames to enhance model robustness of image content changes. We conduct many experiments on both MitoEM dataset and FIB-SEM dataset to verify the effectiveness of proposed method.

Fig. 1: The pipeline of our proposed framework.

Ii Our Method

Our method consists of three components: a backbone network in a symmetric encoder-decoder structure, a point selection block to sample representative points, and a contrastive learning head to compute the contrastive loss during the training stage, as shown in Fig. 1.

Ii-a Network Architecture

We build a semantic segmentation backbone network based on the 3D U-Net [18]. Following [19]

, we generate a mask probability map and a boundary map from the backbone segmentation network. At the end of our framework, a marker-controlled watershed post-processing algorithm

[20] is applied to isolate each instance from the semantic segmentation map. In this paper, we denote our backbone network by U3D-BC following [4].

Training the segmentation network greatly relies on the labeled data which is expected to be diverse and representative. For each input image , we have its dense mask label as well as the boundary label , which can be generated from the mask label, same as [4]. In previous methods, a common way to train the segmentation network is to minimize the overall cross-entropy loss defined as


where is a pixel of the image , denotes the predicted probability map of mitochondria masks, and similarly is the predicted probability of boundaries.

The cross-entropy loss penalizes the predicted probability of all pixels without distinction. Moreover, there are usually far more easy examples than hard samples in EM images. Therefore, it’s challenge for existing methods to learn good representations of hard examples, making the predictions of them inaccuracy. To solve this problem, we propose to first select representative points from both hard examples and easy examples, and then penalize the predictions of hard points additionally in a contrastive manner. In this way, hard examples are paid extra attention and consequently their segmentation results can be possibly aided.

Ii-B Point Selection Block

The point selection block is designed to sample representative points during the training phase. We sample two types of points from the input image. One is uncertain points, which are sampled from hard examples, and the other one is certain points, which have high prediction confidence in the mask probability map. Let denote the number of sampled points, then the specific process of point selection is as follows. Based on the mask probability map, we randomly sample a representative point set , which consists of uncertain points and certain points. The uncertain points are sampled based on the prediction error, which is the absolute value of the difference between the predicted probability and the ground truth label. Top points with the highest prediction error make up uncertain point set . Likewise, we choose the top points with the highest mask probability to constitute the certain point set .

Ii-C Contrastive Learning

With representative points, we adopt contrastive learning to mine hard examples in the training phase. Contrastive learning is mostly utilized in feature space to facilitate model training. The point features are derived from the backbone network. In the backbone segmentation network, we extract a mask probability map and a feature map

from the last two layers. We use trilinear interpolation to upsample the feature map to size of

and concatenate it with the probability map

into a hybrid feature map. The feature vectors of representative points in the hybrid feature map can be extracted according to their point index.

With the feature vectors of representative points, we compute a contrastive loss to “weight” hard points. Our contrastive loss consists of two terms: similarity loss term and consistency loss term. The similarity term can improve the feature similarity of points from the same class, while reducing that of points from different classes. Thus the segmentation of hard examples can be possibly aided. Complementary to the similarity term, the consistency term is designed to enhance the feature similarity between points of the same class at the same position in adjacent frames, and contrastively decrease the similarity of those from different classes. As a result, the inter-frame continuity of segmentation predictions will be improved. Both of the two loss terms contribute to a more discriminative and robust model.

The steps of similarity loss term computation are as follows. First, we split point set into three sets, certain foreground set , uncertain foreground set and background set according to their predicted probability and class label. Every point labeled as foreground in the certain point set will be clustered to . Similar clustering rule is also applied to the uncertain point set : for each point in , if it belongs to the foreground then it will be assigned to . The remaining points in are clustered to . Afterwards, the certain foreground set and the uncertain foreground set form the positive pair while the certain foreground set and the background set build the negative pair. Let represent the feature vector of point and be the normalized coordinate of point . Then the similarity loss is computed in the following formula:


We adopt

to compute the cosine similarity between two vectors

and . · denotes the number of elements in a set. is the similarity weight based on the distance along axis between point and point . is a constant term that adjusts the correlation of features between adjacent frames. is a constant used to ensure that is always greater than zero.

The consistency loss term is computed as follows. For each point with coordinates in the sampled point set , we sample another point on from adjacent frame. If point has the same class label with point , they will be viewed as a positive pair and otherwise be regarded as a negative pair. We denote the positive pair set as , and the negative pair set as . The formulation of loss term can be written down as follows:


The overall loss function of our framework is:


denotes the cross entropy loss on the output mask probability map and boundary map. and are two constant terms that control the weight of and , separately.

Iii Experiments and Results

In this section, we first introduce some details related to the experimental datasets and implementation. Then the results of our method and the quantitative and qualitative comparisons with other approaches are presented.

Iii-a Datasets

We evaluate our method on two datasets (MitoEM dataset and FIB-SEM dataset). The related details are depicted as follows:

MitoEM dataset [4]: This dataset consists of two volumes, in voxels at resolution. The image volumes are acquired from a rat (MitoEM-R) and a human (MitoEM-H) tissue, respectively. For each image stack, only the labels of the first 500 slices have been published, and the labels of the remaining 500 slices are unreleased. In our experiments, we use the first 400 slices of published data for training and evaluate the segmentation performance on the last 100 slices.

FIB-SEM dataset [3]: This dataset is obtained from mouse hippocampus and composed by a training volume and a testing volume. Each volume with a resolution of consists of 165 slices and the size of each slice is .

Iii-B Implementation Details

We implement the proposed method using the Pytorch open-source deep learning library

[21]. In our experiments, the input data size for training is

for both MitoEM and FIB-SEM dataset. The network is trained using the stochastic gradient descent

[22]. We follow the same data augmentation and learning schedule setting as [23]. We set the hyper-parameters in this work as follows: =1024, =0.75, =-4.0, ==0.2, =.

Dataset Method AP-75
Small Medium Large All
MitoEM-R U3D-BC 0.139 0.724 0.895 0.844
Our method 0.203 0.743 0.913 0.870
MitoEM-H U3D-BC 0.232 0.747 0.825 0.773
Our method 0.296 0.778 0.830 0.787
TABLE I: Segmentation Results on MitoEM dataset.
Dataset AP-75
Small Medium Large All
MitoEM-R 0.139 0.724 0.895 0.844
0.166 0.679 0.909 0.858
0.185 0.747 0.909 0.866
0.203 0.743 0.913 0.870
MitoEM-H 0.232 0.747 0.825 0.773
0.216 0.755 0.829 0.778
0.292 0.777 0.825 0.781
0.296 0.778 0.830 0.787
TABLE II: Effect of contrastive loss terms on MitoEM dataset.

Iii-C Evaluation Metrics and Results

For segmentation evaluation, AP-75 and Jaccard index are two main metrics to measure the results. AP-75 evaluates the segmentation performance in instance level. In AP-75 evaluation metric, a predicted instance is considered as a TP only if the voxel-wise overlap between the prediction and corresponding ground truth reaches at least 75%. Jaccard index is a common segmentation criteria to measure the accuracy in pixel level, and its formula is as follows:


Following [4], we report the AP-75 results for small, medium, large and all instances separately to evaluate the segmentation results on MitoEM dataset. Table I shows the performance comparison of U3D-BC and our method on MitoEM-H and MitoEM-R validation datasets. It’s obvious to see that our method brings continuous improvement over baseline on all indicators for both two datasets, which demonstrates the effectiveness of our approach. We also conduct ablation study about the effect of two contrastive loss terms on MitoEM dataset as shown in Table II. When utilizing similarity loss term () or consistency loss term (), our method can always achieve better results over the baseline. After combining them together, the segmentation performance can be further improved. This suggests that our two contrastive loss terms are essential and indispensable for training a good segmentor.

Fig. 2 illustrates the qualitative comparison of segmentation results between the baseline and our method on MitoEM dataset. Our method can correctly distinguish mitochondria-like organelles from real mitochondria. This indicates that our method can solve hard examples better. To clearly show the 3D mitochondria structure, we import our segmentation results into VAST [24] for visualization, as shown in Fig. 3.

(a) Raw image (b) GT (c) U3D-BC (d) Ours
Fig. 2: Visualization results on MitoEM dataset. From left to right: raw images, ground truth, results of U3D-BC, and results of our method.

On FIB-SEM dataset, we utilize Jaccard index (foreground IoU) to measure segmentation accuracy following [8, 9, 10, 11]. The quantitative comparisons with previous methods are presented in Table III, where the top three values are masked in bold for distinction. It can be seen that the Jaccard index of our approach is higher than most previous algorithms, demonstrating that our method achieves state-of-the-art results. To explicitly compare the qualitative segmentation differences between our method and others, we displayed 3 examples in Fig. 4. Note that our approach segments most of the mitochondria and effectively reduces FPs and FNs, achieving better qualitative results than [8, 11] and comparable results to [9].

(a) MitoEM-R (b) MitoEM-H
Fig. 3: Instance segmentation results on MitoEM-R and MitoEM-H.
Method Jaccard index
Lucchi et al. 2013 [6] 0.867
3D U-Net [18] 0.874
Oztel et al. 2017 [8] 0.907
Cheng et al. 2017 [10] 0.889
Improved Mask R-CNN [7] 0.849
Xiao et al. 2018 [9] 0.900
Casser et al. 2020 [11] 0.890
Liu et al. 2020 [12] 0.864
Ours 0.895
TABLE III: segmentation results on FIB-SEM dataset

Iv Conclusion

In electron microscopy images, there are many hard examples which limit the improvement of mitochondria segmentation accuracy. To address this issue, we proposed an effective contrastive learning-based framework in which we first pick out representative points, then build positive pair and negative pair based on their labels and prediction probability, and finally compute a contrastive loss to facilitate model training. Our contrastive loss can increase intra-class compactness, inter-class separability and feature consistency between adjacent frames, thereby enabling a better feature representation. The experimental results on both two EM datasets demonstrate the effectiveness of our method in mitochondria segmentation. In the future, we plan to explore the application of contrastive learning in semi-supervised mitochondria segmentation task.

(a) Raw image
(b) Oztel et al. 2017 [8]
(c) Xiao et al. 2018 [9]
(d) Casser et al. 2020 [11]
(e) Ours
Fig. 4: The qualitative comparisons between different methods on FIB-SEM dataset. Green pixels denote true positives (TP), red pixels denote false negatives (FN), blue pixels denote false positives (FP), and black pixels denote true negatives (TN).

V Acknowledgment

This work was supported in part by National Key R&D Program of China under Grant 2020AAA0105702, in part by the National Natural Science Foundation of China (NSFC) under Grants 62076230, in part by the University Synergy Innovation Program of Anhui Province GXXT-2019-025.


  • [1] R. Roychaudhuri, M. Yang, Minako M. Hoshi, and David B. Teplow. Amyloid -protein assembly and alzheimer disease. Journal of Biological Chemistry, 284:4749–4753, 2009.
  • [2] S. Campello and L. Scorrano. Mitochondrial shape changes: orchestrating cell pathophysiology. EMBO Reports, 11:678–684, 2010.
  • [3] G. Knott, H. Marchman, D. Wall, and B. Lich. Serial section scanning electron microscopy of adult brain tissue using focused ion beam milling. Journal of Neuroscience, 28(12):2959–2964, 2008.
  • [4] D. Wei, Z. Lin, D. Barranco, N. Wendt, X. Liu, W. Yin, X. Huang, A. Gupta, W. Jang, X. Wang, I. Arganda-Carreras, J. Lichtman, and H. Pfister. MitoEM dataset: Large-scale 3D mitochondria instance segmentation from EM images. In MICCAI, 2020.
  • [5] A. Jorstad and P. Fua. Refining mitochondria segmentation in electron microscopy imagery with active surfaces. In Computer Vision - ECCV 2014 Workshops, pages 367–379, 2015.
  • [6] A. Lucchi, Y. Li, and P. Fua. Learning for structured prediction using approximate subgradient descent with working sets. In CVPR, pages 1987–1994, 2013.
  • [7] J. Liu, W. Li, C. Xiao, B. Hong, Q. Xie, and H. Han.

    Automatic detection and segmentation of mitochondria from SEM images using deep neural network.

    In 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pages 628–631. 2018.
  • [8] I. Oztel, G. Yolcu, I. Ersoy, T. White, and F. Bunyak.

    Mitochondria segmentation in electron microscopy volumes using deep convolutional neural network.

    In BIBM, pages 1195–1200. 2017.
  • [9] C. Xiao, X. Chen, W. Li, L. Li, L. Wang, Q. Xie, and H. Han. Automatic mitochondria segmentation for EM data using a 3D supervised convolutional network. Frontiers in Neuroanatomy, 12:92, 11 2018.
  • [10] H. Cheng and A. Varshney Volume segmentation using convolutional neural networks with limited training data. ICIP, pages 590–594, 2017.
  • [11] V. Casser, K. Kang, H. Pfister and D. Haehn. Fast Mitochondria Detection for Connectomics. MIDL, pages 111–120, 2020.
  • [12] J. Liu, L. Li, Y. Yang, B. Hong, X. Chen, Q. Xie, and H. Han. Automatic reconstruction of mitochondria and endoplasmic reticulum in electron microscopy volumes by deep learning. Frontiers in neuroscience, 14, 2020.
  • [13] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick. Momentum contrast for unsupervised visual representation learning. In CVPR, pages 9726–9735, 2020.
  • [14] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton. A simple framework for contrastive learning of visual representations. In ICML, volume 119, pages 1597–1607, 2020.
  • [15] P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan. Supervised contrastive learning. In Advances in Neural Information Processing Systems, volume 33, 2020.
  • [16] X. Zhao, R. Vemulapalli, P. Mansfield, B. Gong, B. Green, L. Shapira, and Y. Wu. Contrastive learning for label-efficient semantic segmentation. ArXiv, abs/2012.06985, 2020.
  • [17] H. Li, X. Yang, J. Liang, W. Shi, C. Chen, H. Dou, R. Li, R. Gao, G. Zhou, J. Fang, X. Liang, R. Huang, A. Frangi, Z. Chen, and D. Ni. Contrastive rendering for ultrasound image segmentation. In MICCAI, pages 563–572, 2020.
  • [18] Ç. Özgün, A. Ahmed, L. Soeren S., B. Thomas, and R. Olaf. 3D U-Net: Learning dense volumetric segmentation from sparse annotation. In MICCAI, pages 424–432, 2016.
  • [19] H. Chen, X. Qi, L. Yu, and P. Heng. DCAN: Deep contour-aware networks for accurate gland segmentation. In CVPR, pages 2487–2496, 2016.
  • [20] F. Meyer and S. Beucher. Morphological segmentation. In Journal of Visual Communication and Image Representation, pages 21–46, 1990.
  • [21] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. PyTorch: An Imperative Style, High-Performance Deep Learning Library. CoRR, 2019
  • [22] B. Léon. Stochastic gradient descent tricks. Neural networks: Tricks of the trade, 421–436, 2012.
  • [23] K. Lee, J. Zung, P. Li, V. Jain, and H. Seung. Superhuman accuracy on the SNEMI3D connectomics challenge. CoRR, 2017.
  • [24] Daniel R. Berger, H. Sebastian Seung, and Jeff W. Lichtman. Vast (volume annotation and segmentation tool): Efficient manual and semi-automatic labeling of large 3D image stacks. Frontiers in Neural Circuits, 12:88, 2018.