I Introduction
Known as the powerhouses of cells, mitochondria play a crucial role in the regulation of cellular life and death for they carry out all types of important cellular functions by producing the overwhelming majority of cellular adenosine triphosphate (ATP). Many clinical studies have revealed the correlation between the biological function of mitochondria and their size and geometry [1, 2]. In the last few years, with the development of imaging technology, electron microscopy (EM) images have been widely used for the analysis of nanometerlevel structures including mitochondria. Knott et al. [3] deliver a dataset of an adult rodent brain tissue using focused ion beam scanning electron microscopy (FIBSEM) to see axons, dendrites and their synaptic contacts. More recently, Wei et al. [4] publish a largescale EM dataset (MitoEM) for 3D mitochondria instance segmentation. MitoEM dataset is larger than FIBSEM dataset [3] and consists of more mitochondria with sophisticated shapes and ultrastructures that are similar in appearance to mitochondria. Therefore, this dataset poses new challenges to the segmentation of mitochondria.
Many earlier approaches combine traditional image processing and machine learning techniques for mitochondria segmentation. Jorstad and Fua
[5] propose an active surfacebased method to refine the boundary surfaces by exploiting the thick and dark membranes of mitochondria. Lucchi et al. [6]utilize an approximate subgradient descent algorithm to minimize the hinge loss in the marginsensitive frameworks. However, these methods rely on handcrafted features to build the classifiers, limited in efficiency and generalizability. Recently, deep learningbased approaches
[7, 8, 9, 10, 11, 12] show promising results in mitochondria segmentation. Liu et al. [7] adopt a modified Mask RCNN for mitochondria instance segmentation on the FIBSEM dataset [3]. Oztel et al. [8] propose a fully convolutional network (FCN) to segment mitochondria and then use various postprocessing, such as 2D spurious detection filtering, boundary refinement and 3D filtering, to improve the segmentation. Xiao et al. [9]propose an effective 3D residual fully convolutional network (FCN) to avert the vanishing gradient problem for mitochondria segmentation. Most of the methods listed above improve the segmentation by dedicatedly designing the network architecture, or adding various postprocessing algorithms. However, these methods rarely pay extra attention to hard examples, such as noises, imaging artifacts and subcellular ultrastructures, which widely exist in EM images. As a result, these approaches can scarcely segment the hard examples correctly, preventing further improvement in mitochondria segmentation accuracy.
In the last few years, contrastive learning has been widely applied in computer vision tasks
[13, 14, 15, 16, 17]. Contrastive learning is an approach to formulate the task of finding similar and dissimilar things for a deep learning model. [13, 14] adopt contrastive learning for unsupervised visual representation learning. For image classification, [15] extend the selfsupervised batch contrastive approach to the fullysupervised setting, which effectively leverage label information. More recently, [16] propose to first pretrain the CNN feature extractor using a labelbased contrastive loss for semantic segmentation task. [17]adopt pointwise contrastive learning to improve boundary estimation for ultrasound image segmentation. Inspired by
[17], we extend contrastive learning to solve hard examples in mitochondria segmentation, based on the intraclass similarity of mitochondria textures and the interclass discrepancy of textures between mitochondria and hard examples.In this work, we propose a contrastive learning based framework for mitochondria segmentation task. In the training phase, we first sample representative points (i.e. pixels) from hard examples. Then we adopt a contrastive loss, which is computed on the feature space of these points, to facilitate model training. Our contrastive loss consists of two terms, similarity term and consistency term. The similarity loss term is designed to improve the feature similarity of the points from the same class and decrease that of different classes. The consistency loss term focuses on the feature similarity of points in adjacent frames to enhance model robustness of image content changes. We conduct many experiments on both MitoEM dataset and FIBSEM dataset to verify the effectiveness of proposed method.
Ii Our Method
Our method consists of three components: a backbone network in a symmetric encoderdecoder structure, a point selection block to sample representative points, and a contrastive learning head to compute the contrastive loss during the training stage, as shown in Fig. 1.
Iia Network Architecture
We build a semantic segmentation backbone network based on the 3D UNet [18]. Following [19]
, we generate a mask probability map and a boundary map from the backbone segmentation network. At the end of our framework, a markercontrolled watershed postprocessing algorithm
[20] is applied to isolate each instance from the semantic segmentation map. In this paper, we denote our backbone network by U3DBC following [4].Training the segmentation network greatly relies on the labeled data which is expected to be diverse and representative. For each input image , we have its dense mask label as well as the boundary label , which can be generated from the mask label, same as [4]. In previous methods, a common way to train the segmentation network is to minimize the overall crossentropy loss defined as
(1) 
where is a pixel of the image , denotes the predicted probability map of mitochondria masks, and similarly is the predicted probability of boundaries.
The crossentropy loss penalizes the predicted probability of all pixels without distinction. Moreover, there are usually far more easy examples than hard samples in EM images. Therefore, it’s challenge for existing methods to learn good representations of hard examples, making the predictions of them inaccuracy. To solve this problem, we propose to first select representative points from both hard examples and easy examples, and then penalize the predictions of hard points additionally in a contrastive manner. In this way, hard examples are paid extra attention and consequently their segmentation results can be possibly aided.
IiB Point Selection Block
The point selection block is designed to sample representative points during the training phase. We sample two types of points from the input image. One is uncertain points, which are sampled from hard examples, and the other one is certain points, which have high prediction confidence in the mask probability map. Let denote the number of sampled points, then the specific process of point selection is as follows. Based on the mask probability map, we randomly sample a representative point set , which consists of uncertain points and certain points. The uncertain points are sampled based on the prediction error, which is the absolute value of the difference between the predicted probability and the ground truth label. Top points with the highest prediction error make up uncertain point set . Likewise, we choose the top points with the highest mask probability to constitute the certain point set .
IiC Contrastive Learning
With representative points, we adopt contrastive learning to mine hard examples in the training phase. Contrastive learning is mostly utilized in feature space to facilitate model training. The point features are derived from the backbone network. In the backbone segmentation network, we extract a mask probability map and a feature map
from the last two layers. We use trilinear interpolation to upsample the feature map to size of
and concatenate it with the probability mapinto a hybrid feature map. The feature vectors of representative points in the hybrid feature map can be extracted according to their point index.
With the feature vectors of representative points, we compute a contrastive loss to “weight” hard points. Our contrastive loss consists of two terms: similarity loss term and consistency loss term. The similarity term can improve the feature similarity of points from the same class, while reducing that of points from different classes. Thus the segmentation of hard examples can be possibly aided. Complementary to the similarity term, the consistency term is designed to enhance the feature similarity between points of the same class at the same position in adjacent frames, and contrastively decrease the similarity of those from different classes. As a result, the interframe continuity of segmentation predictions will be improved. Both of the two loss terms contribute to a more discriminative and robust model.
The steps of similarity loss term computation are as follows. First, we split point set into three sets, certain foreground set , uncertain foreground set and background set according to their predicted probability and class label. Every point labeled as foreground in the certain point set will be clustered to . Similar clustering rule is also applied to the uncertain point set : for each point in , if it belongs to the foreground then it will be assigned to . The remaining points in are clustered to . Afterwards, the certain foreground set and the uncertain foreground set form the positive pair while the certain foreground set and the background set build the negative pair. Let represent the feature vector of point and be the normalized coordinate of point . Then the similarity loss is computed in the following formula:
(2)  
(3)  
(4)  
(5)  
(6) 
We adopt
to compute the cosine similarity between two vectors
and . · denotes the number of elements in a set. is the similarity weight based on the distance along axis between point and point . is a constant term that adjusts the correlation of features between adjacent frames. is a constant used to ensure that is always greater than zero.The consistency loss term is computed as follows. For each point with coordinates in the sampled point set , we sample another point on from adjacent frame. If point has the same class label with point , they will be viewed as a positive pair and otherwise be regarded as a negative pair. We denote the positive pair set as , and the negative pair set as . The formulation of loss term can be written down as follows:
(7)  
(8)  
(9) 
The overall loss function of our framework is:
(10) 
denotes the cross entropy loss on the output mask probability map and boundary map. and are two constant terms that control the weight of and , separately.
Iii Experiments and Results
In this section, we first introduce some details related to the experimental datasets and implementation. Then the results of our method and the quantitative and qualitative comparisons with other approaches are presented.
Iiia Datasets
We evaluate our method on two datasets (MitoEM dataset and FIBSEM dataset). The related details are depicted as follows:
MitoEM dataset [4]: This dataset consists of two volumes, in voxels at resolution. The image volumes are acquired from a rat (MitoEMR) and a human (MitoEMH) tissue, respectively. For each image stack, only the labels of the first 500 slices have been published, and the labels of the remaining 500 slices are unreleased. In our experiments, we use the first 400 slices of published data for training and evaluate the segmentation performance on the last 100 slices.
FIBSEM dataset [3]: This dataset is obtained from mouse hippocampus and composed by a training volume and a testing volume. Each volume with a resolution of consists of 165 slices and the size of each slice is .
IiiB Implementation Details
We implement the proposed method using the Pytorch opensource deep learning library
[21]. In our experiments, the input data size for training isfor both MitoEM and FIBSEM dataset. The network is trained using the stochastic gradient descent
[22]. We follow the same data augmentation and learning schedule setting as [23]. We set the hyperparameters in this work as follows: =1024, =0.75, =4.0, ==0.2, =.Dataset  Method  AP75  

Small  Medium  Large  All  
MitoEMR  U3DBC  0.139  0.724  0.895  0.844 
Our method  0.203  0.743  0.913  0.870  
MitoEMH  U3DBC  0.232  0.747  0.825  0.773 
Our method  0.296  0.778  0.830  0.787 
Dataset  AP75  

Small  Medium  Large  All  
MitoEMR  0.139  0.724  0.895  0.844  
0.166  0.679  0.909  0.858  
0.185  0.747  0.909  0.866  
0.203  0.743  0.913  0.870  
MitoEMH  0.232  0.747  0.825  0.773  
0.216  0.755  0.829  0.778  
0.292  0.777  0.825  0.781  
0.296  0.778  0.830  0.787 
IiiC Evaluation Metrics and Results
For segmentation evaluation, AP75 and Jaccard index are two main metrics to measure the results. AP75 evaluates the segmentation performance in instance level. In AP75 evaluation metric, a predicted instance is considered as a TP only if the voxelwise overlap between the prediction and corresponding ground truth reaches at least 75%. Jaccard index is a common segmentation criteria to measure the accuracy in pixel level, and its formula is as follows:
(11) 
Following [4], we report the AP75 results for small, medium, large and all instances separately to evaluate the segmentation results on MitoEM dataset. Table I shows the performance comparison of U3DBC and our method on MitoEMH and MitoEMR validation datasets. It’s obvious to see that our method brings continuous improvement over baseline on all indicators for both two datasets, which demonstrates the effectiveness of our approach. We also conduct ablation study about the effect of two contrastive loss terms on MitoEM dataset as shown in Table II. When utilizing similarity loss term () or consistency loss term (), our method can always achieve better results over the baseline. After combining them together, the segmentation performance can be further improved. This suggests that our two contrastive loss terms are essential and indispensable for training a good segmentor.
Fig. 2 illustrates the qualitative comparison of segmentation results between the baseline and our method on MitoEM dataset. Our method can correctly distinguish mitochondrialike organelles from real mitochondria. This indicates that our method can solve hard examples better. To clearly show the 3D mitochondria structure, we import our segmentation results into VAST [24] for visualization, as shown in Fig. 3.
On FIBSEM dataset, we utilize Jaccard index (foreground IoU) to measure segmentation accuracy following [8, 9, 10, 11]. The quantitative comparisons with previous methods are presented in Table III, where the top three values are masked in bold for distinction. It can be seen that the Jaccard index of our approach is higher than most previous algorithms, demonstrating that our method achieves stateoftheart results. To explicitly compare the qualitative segmentation differences between our method and others, we displayed 3 examples in Fig. 4. Note that our approach segments most of the mitochondria and effectively reduces FPs and FNs, achieving better qualitative results than [8, 11] and comparable results to [9].
Method  Jaccard index 

Lucchi et al. 2013 [6]  0.867 
3D UNet [18]  0.874 
Oztel et al. 2017 [8]  0.907 
Cheng et al. 2017 [10]  0.889 
Improved Mask RCNN [7]  0.849 
Xiao et al. 2018 [9]  0.900 
Casser et al. 2020 [11]  0.890 
Liu et al. 2020 [12]  0.864 
Ours  0.895 
Iv Conclusion
In electron microscopy images, there are many hard examples which limit the improvement of mitochondria segmentation accuracy. To address this issue, we proposed an effective contrastive learningbased framework in which we first pick out representative points, then build positive pair and negative pair based on their labels and prediction probability, and finally compute a contrastive loss to facilitate model training. Our contrastive loss can increase intraclass compactness, interclass separability and feature consistency between adjacent frames, thereby enabling a better feature representation. The experimental results on both two EM datasets demonstrate the effectiveness of our method in mitochondria segmentation. In the future, we plan to explore the application of contrastive learning in semisupervised mitochondria segmentation task.
V Acknowledgment
This work was supported in part by National Key R&D Program of China under Grant 2020AAA0105702, in part by the National Natural Science Foundation of China (NSFC) under Grants 62076230, in part by the University Synergy Innovation Program of Anhui Province GXXT2019025.
References
 [1] R. Roychaudhuri, M. Yang, Minako M. Hoshi, and David B. Teplow. Amyloid protein assembly and alzheimer disease. Journal of Biological Chemistry, 284:4749–4753, 2009.
 [2] S. Campello and L. Scorrano. Mitochondrial shape changes: orchestrating cell pathophysiology. EMBO Reports, 11:678–684, 2010.
 [3] G. Knott, H. Marchman, D. Wall, and B. Lich. Serial section scanning electron microscopy of adult brain tissue using focused ion beam milling. Journal of Neuroscience, 28(12):2959–2964, 2008.
 [4] D. Wei, Z. Lin, D. Barranco, N. Wendt, X. Liu, W. Yin, X. Huang, A. Gupta, W. Jang, X. Wang, I. ArgandaCarreras, J. Lichtman, and H. Pfister. MitoEM dataset: Largescale 3D mitochondria instance segmentation from EM images. In MICCAI, 2020.
 [5] A. Jorstad and P. Fua. Refining mitochondria segmentation in electron microscopy imagery with active surfaces. In Computer Vision  ECCV 2014 Workshops, pages 367–379, 2015.
 [6] A. Lucchi, Y. Li, and P. Fua. Learning for structured prediction using approximate subgradient descent with working sets. In CVPR, pages 1987–1994, 2013.

[7]
J. Liu, W. Li, C. Xiao, B. Hong, Q. Xie, and H. Han.
Automatic detection and segmentation of mitochondria from SEM images using deep neural network.
In 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pages 628–631. 2018. 
[8]
I. Oztel, G. Yolcu, I. Ersoy, T. White, and F. Bunyak.
Mitochondria segmentation in electron microscopy volumes using deep convolutional neural network.
In BIBM, pages 1195–1200. 2017.  [9] C. Xiao, X. Chen, W. Li, L. Li, L. Wang, Q. Xie, and H. Han. Automatic mitochondria segmentation for EM data using a 3D supervised convolutional network. Frontiers in Neuroanatomy, 12:92, 11 2018.
 [10] H. Cheng and A. Varshney Volume segmentation using convolutional neural networks with limited training data. ICIP, pages 590–594, 2017.
 [11] V. Casser, K. Kang, H. Pfister and D. Haehn. Fast Mitochondria Detection for Connectomics. MIDL, pages 111–120, 2020.
 [12] J. Liu, L. Li, Y. Yang, B. Hong, X. Chen, Q. Xie, and H. Han. Automatic reconstruction of mitochondria and endoplasmic reticulum in electron microscopy volumes by deep learning. Frontiers in neuroscience, 14, 2020.
 [13] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick. Momentum contrast for unsupervised visual representation learning. In CVPR, pages 9726–9735, 2020.
 [14] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton. A simple framework for contrastive learning of visual representations. In ICML, volume 119, pages 1597–1607, 2020.
 [15] P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan. Supervised contrastive learning. In Advances in Neural Information Processing Systems, volume 33, 2020.
 [16] X. Zhao, R. Vemulapalli, P. Mansfield, B. Gong, B. Green, L. Shapira, and Y. Wu. Contrastive learning for labelefficient semantic segmentation. ArXiv, abs/2012.06985, 2020.
 [17] H. Li, X. Yang, J. Liang, W. Shi, C. Chen, H. Dou, R. Li, R. Gao, G. Zhou, J. Fang, X. Liang, R. Huang, A. Frangi, Z. Chen, and D. Ni. Contrastive rendering for ultrasound image segmentation. In MICCAI, pages 563–572, 2020.
 [18] Ç. Özgün, A. Ahmed, L. Soeren S., B. Thomas, and R. Olaf. 3D UNet: Learning dense volumetric segmentation from sparse annotation. In MICCAI, pages 424–432, 2016.
 [19] H. Chen, X. Qi, L. Yu, and P. Heng. DCAN: Deep contouraware networks for accurate gland segmentation. In CVPR, pages 2487–2496, 2016.
 [20] F. Meyer and S. Beucher. Morphological segmentation. In Journal of Visual Communication and Image Representation, pages 21–46, 1990.
 [21] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. PyTorch: An Imperative Style, HighPerformance Deep Learning Library. CoRR, 2019
 [22] B. Léon. Stochastic gradient descent tricks. Neural networks: Tricks of the trade, 421–436, 2012.
 [23] K. Lee, J. Zung, P. Li, V. Jain, and H. Seung. Superhuman accuracy on the SNEMI3D connectomics challenge. CoRR, 2017.
 [24] Daniel R. Berger, H. Sebastian Seung, and Jeff W. Lichtman. Vast (volume annotation and segmentation tool): Efficient manual and semiautomatic labeling of large 3D image stacks. Frontiers in Neural Circuits, 12:88, 2018.
Comments
There are no comments yet.