Log In Sign Up

Automated image analysis in large-scale cellular electron microscopy: A literature survey

Large-scale electron microscopy (EM) datasets generated using (semi-) automated microscopes are becoming the standard in EM. Given the vast amounts of data, manual analysis of all data is not feasible, thus automated analysis is crucial. The main challenges in automated analysis include the annotation that is needed to analyse and interpret biomedical images, coupled with achieving high-throughput. Here, we review the current state-of-the-art of automated computer techniques and major challenges for the analysis of structures in cellular EM. The advanced computer vision, deep learning and software tools that have been developed in the last five years for automatic biomedical image analysis are discussed with respect to annotation, segmentation and scalability for EM data. Integration of automatic image acquisition and analysis will allow for high-throughput analysis of millimeter-range datasets with nanometer resolution.


page 2

page 7

page 14


FusionNet: A deep fully residual convolutional neural network for image segmentation in connectomics

Electron microscopic connectomics is an ambitious research direction wit...

Annotating Synapses in Large EM Datasets

Reconstructing neuronal circuits at the level of synapses is a central p...

Computer-Assisted Analysis of Biomedical Images

Nowadays, the amount of heterogeneous biomedical data is increasing more...

RhoanaNet Pipeline: Dense Automatic Neural Annotation

Reconstructing a synaptic wiring diagram, or connectome, from electron m...

Towards automated high-throughput screening of C. elegans on agar

High-throughput screening (HTS) using model organisms is a promising met...

A watershed-based algorithm to segment and classify cells in fluorescence microscopy images

Imaging assays of cellular function, especially those using fluorescent ...

1 Large-scale cellular EM

Electron microscopy (EM) is widely used in life sciences to study tissues, cells, subcellular components and (macro) molecular complexes at nanometer scale. Two-dimensional (2D) EM aids in diagnosis of diseases, but routinely it still depends upon biased snapshots of areas of interest. Automated pipelines for collection, stitching and open access publishing of 2D EM have been pioneered for both transmission EM (TEM) images (faas2012virtual) as well as scanning TEM (STEM) (sokol2015large) for acquisition of areas up to 1mm at nanometer-range resolution, Table. 1. Nowadays, imaging of large areas at high resolution is entering the field as a routine method and is provided by most EM manufacturers. This we term nanotomy, for nano-anatomy (ravelli2013destruction; de2020large; dittmayer2021preparation). The large-scale images allow for open access world-wide data sharing; see nanotomy.org111www.nanotomy.orgfor more than 40 published studies and the accessible nanotomy data.

Figure 1: Typical large-scale EM allows to analyze tissue at a high resolution. Overview and snapshots of several cellular, subcellular and macromolecular structures of which up to a million can be present per dataset (de2020large). Bars: red m; green m; white m. Full access to digital zoomable data at full resolution is via
2D EM Methodology
Transmission Electron Microscope (TEM) A widefield electron beam illuminates an ultra-thin specimen and transmitted electrons are detected at the other site of the sample. The structure that is electron dense appears dark and others appear lighter depending on their (lack of) scattering.
Scanning Electron Microscope (SEM) The raster scanning beam interacts with the material and can result in back scattering or formation of secondary electrons. Their intensity reveals sample information.
Scanning Transmission Electron Microscopy (STEM) SEM on ultrathin sections and using a detector for the transmitted electrons.
Serial section TEM (ssTEM) or SEM (ssSEM) Volume EM technique for examining 3D ultrastructure by scanning adjacent ultrathin (typical 60-80 nm) sections using TEM or SEM, respectively.
Serial Block-face scanning EM (SB SEM) The block face is scanned, followed by removal of the top layer by a diamond knife (typical 20-60 nm) and the newly exposed block face is scanned. This can be repeated thousands of times.
Focused Ion Beam SEM (FIB-SEM) Block face imaging as above, but sections are repeatedly removed by a focused ion beam that has higher precision than a knife (typically down to 4 nm), suitable for smaller volumes.
Table 1: Major large-scale EM techniques. Note that the biomaterial for the techniques below is stained with heavy metals to generate contrast. For further information see the MyScope website and the papers reviewed by peddie2014exploring and titze2016volume.

A typical nanotomy dataset has a size of 5-25GB at 2.5nm pixel size. Nanotomy allows scientists to pan and zoom through different tissues or cellular structures in a Google Earth-like manner, Fig. 1. Thus, large-scale 2D EM provides unbiased data recording to discover events such as pathogenesis of diseases at the supra-cellular level for morphological changes. Moreover, nanotomy allows for the quantification of subcellular hallmarks. With state-of-the-art 2D EM technology, such as multibeam scanning electron microscopy (MB-SEM) (eberle2015high; ren2016transmission), up to 100 times faster acquisition and higher throughput allows for imaging of tissue-wide sections in the range of hours instead of days. For a side by side example of single beam versus multibeam nanotomy, see de2021state. Given the automated and faster image acquisition in 2D EM a data avalanche (petabyte-range per microscope/month) will soon be a reality.

Automated large-scale three-dimensional (3D) or volume EM (vEM), which creates a stack of images, is also booming as reviewed by peddie2014exploring and titze2016volume, Table. 1. Examples include whole-cell volume reconstruction of up to 35 cellular organelle classes (heinrich2021whole). The data acquisition time is not the bottleneck anymore, but data analysis is. Note that one person needed two weeks to manually label a fraction () of the imaged volume. The whole cell could take 60 person years. Hence, the urgent need for automatic image analysis.

Semantic segmentation of EM images is the automatic process of mapping each pixel to known or newly discovered classes of subcellular structures. The challenges of automatic segmentation of EM images are due to the rendering in one color channel (grayscale) of subcellular structures with different sizes, shapes and heteregoneous appearances, surrounded with complicated structures. 2D EM datasets only have neighbouring ( axes) context as reference for structural analysis. The vEM datasets leverage knowledge from adjacent sections (axis) that share high resemblance, and therefore aids in better reconstruction of the volume. Additionally, large-scale 2D EM images are more spatially distributed across cells than traditional EM micrographes of a selected region or vEM that occupy a smaller spatial area, despite both large-scale 2D EM and vEM containing gigabytes of information. It requires global image processing without losing coherent information between patches for automatic analysis.

Here, we review automated methods for large scale EM image segmentation and analysis.

2 Deep learning and segmentation

Deep learning has become the state-of-the-art methodology for many computer vision tasks, including segmentation. This has been a paradigm shift when compared to traditional segmentation methods, which were generally performed using machine learning classifiers such as Adaboost, Random forests and Support Vector Machines (SVMs), which required domain-oriented, handcrafted features as inputs, Table 


. The introduction of Deep Neural Networks (DNN) has enabled automatic extraction of hierarchical representations from raw image data


. Fully Connected Neural Networks (FCNN) and Convolutional Neural Networks (CNN) have significantly advanced computer vision tasks such as classification, detection, and segmentation with end-to-end learning for feature extraction and prediction. FCNNs are the most regular DNNs where two adjacent layers are fully pairwise connected. Images need to be converted to a fixed size column vector for multiplication by

weights connected to each of the

units (neurons) in the next layer followed by a non-linear function. Unlike FCNNs, which require fixed input size, CNNs can process arrays of unstructured data (2D or 3D images) directly using weight sharing filter banks.

Terms Definition Example
Descriptors Handcrafted descriptors Functions that are implemented using domain expertise for the extraction of features. Canny edge detector, Harris corner detector, and more sophisticated ones, such as HOG, LBP, SIFT, and SURF.
Trainable descriptors Functions that learn features from given training sets. Such descriptors can be contour-, color- or texture-based. The trainable COSFIRE approach is an example of contour- and color-based trainable descriptor (azzopardi2013trainable; gecer2017color).
Classifiers Linear classifier

A model that separates a given feature space with a linear boundary. In a 2D space the model is simply a line, while a plan and hyperplanes are used for 3D and higher dimensional spaces.

Logistic regression and Support Vector Machine (SVM) with a linear kernel. The latter learns a linear classifier after transforming the input features to a higher dimensional space.
Non-linear classifier A model that learns a non-linear function to separate classes. The degree of non-linearity is arbitrary. Generalization error tends to increase with increasing non-linearity. Regularization is a technique used to keep a good tradeoff between non-linearity and generalization. Neural networks (NNs), such as deep NNs (DNNs), fully connected NNs (FCNNs) and convolutional NNs (CNNs), are end-to-end models that learn features from the training set and a non-linear function that maps the input to output via the learned features.
Meta-learners Weak and Strong learners A weak learner is a model that performs just better than random guessing, while a strong learner is a model that achieves high performance. Decision trees with with very few and with many layers can be considered as weak and strong learners, respectively.
Ensemble modeling The combination of outcomes of multiple learners to achieve better predictive performance while reducing the risk of overfitting. Boosting (freund1997decision); an iterative ensemble approach. Bootstrap Aggregating, or Bagging, trains weak learners concurrently on subsets of independent and identically distributed data sampled with replacement. Random forest (breiman2001random); a class of bagging models with decision trees.
Challenges with deep architectures Vanishing gradient

Occurs when the partial derivatives of the loss function (that measure an error between the ground truth and prediction) in a learning algorihtm (e.g. gradient descent) become very close to zero. Such negligible derivatives make the learning algorithm unable to propagate useful gradient information from the model output back to the layers near the input of the model.

Caused by activation functions (e.g. sigmoid) that map a large input space to only [0,1], leading to very small derivatives even when the input change is large. Using activation functions (e.g. ReLU) that avoid small derivatives is an effective way to address the vanishing gradient problem.

Degradation in DNNs Insignificant changes in accuracy with increasing number of layers. Deeper additional layers that provide the network power to calculate a more complex function without increasing the training set may lead to a performance saturation and eventually degradation. Residual Neural Networks (ResNet) use residual blocks that connect outputs of stacked layers to the block’s input layer with identity skip connections. This allows for uninterrupted flow of information from previous layers to subsequent ones, realising a complex function with lesser parameters and hence less risk of both overfitting and vanishing gradients.
Table 2: Definitions of common terms used in the literature of machine/deep learning frameworks.

A typical CNN consists of convolutional filters or shared weight filter banks in each layer of the neural network. Convolutional filters are linear functions that are used for low-level basic operations, such as blurring and edge detection. A 2D convolutional filter uses a 2D matrix, referred to as a kernel, centered on a local region in a given image and applies a linear operation between the kernel coefficients and the respective pixel values in the local region concerned. The resulting scalar value is the response for the considered region. The filter is then slided across the whole image to compute the responses at every location. All responses form what is known as a response or a feature map which has the same size as the input image if the sliding is done one pixel at a time. 3D filters operate in a similar way but use 3D kernels and are applied to 3D vEM images. In a CNN, the 2D and 3D filters allow for weight sharing throughout the image capturing of structured data using translational and rotational invariance (krizhevsky2012imagenet; yamashita2018convolutional). The spatial extent of the connectivity of a filter with local pixels is called the receptive field or the kernel size of the filter. The response maps obtained by such filters are then processed by a non-linear activation function, before being downsampled by a pooling unit in order to learn abstract representations (lecun2015deep). The receptive fields of such filters keep reducing with respect to the input image in subsequent convolutional layers. Finally, the response maps are fed to a fully connected layer which determines a label for the given image. Typically, many such convolutional filters (referred to as the number of channels) are applied to a given image or intermediate feature maps to capture various abstract representations of the same input.

Progress in the development of CNNs has led to a plethora of applications including the automatic analysis of medical images. The CNN designed by ciresan2012deep, for instance, was used for the segmentation of neuronal membranes in stacks of EM images. The images were segmented by predicting the label of each local region or patch covered by a convolutional filter in a sliding window approach. Despite the method’s success - winning the 2012 ISBI333IEEE International Symposium on Biomedical Imaging.

EM segmentation challenge - the method has two major limitations. First, the sliding window approach is slow as it suffers from redundancy due to the processing of large overlaps between adjacent patches. Second, there is always a trade-off between the size of the patches (context) and full-resolution prediction. It turns out that localisation ability decreases with an increasing context due to downsampling by the many max pooling layers. Improvements in the semantic segmentation of EM images continued with the development of the Fully Convolutional Network (FCN)

(long2015fully). FCN allows for variable sized images as input by replacing the fully connected layers of a standard CNN with fully convolutional maps, Fig. 2. Now, the spatial maps in the last layer of an FCN correspond to certain local patches or pixels of an input image depending on the network depth.

Figure 2:

Encoder-decoder networks for (a) FCN and (b) U-Net. Each of the first four convolutional layers in the encoder are followed by the nonlinear activation function ReLU and max pooling. The last layer uses a softmax function to assign a probability class score to each pixel. (a) The FCN decoder includes an upsampling component that is linearly combined with the low-level feature maps in the third convolutional layer of the encoder. The sizes of these feature maps are 4 times less than the size of the input image

(denoted by ). Finally, there is a direct upsampling from to the original size of followed by softmax for classification. (b) The symmetrical U-Net architecture shares the features maps in the encoder with the decoder path together with skip connections.

Encoder-decoder architectures of the FCN type have been catalysts in providing better localization and use of larger context. The encoder (contracting path) extracts features from the input, and the decoder (expanding path) converts the abstract features into the spatial resolution of the original image for precise localization, Fig. 2

. In practice, the encoder consists of a set of convolutional layers in an FCN network that successively downsamples the input to intermediate spatial maps up to some extent. Upsampling in the decoder is then performed by using transposed convolutions on interpolated feature maps.

The decoder also captures multi-scale features using skip connections to fuse feature maps from shallow layers, providing larger context. Skip connections bypass some of the neural network layers and take the output of one layer as the input to the subsequent ones. As a result, an alternative and shorter path is provided for backpropagating the error of the loss function, which also contributes in avoiding the vanishing gradient and the degradation problems in deep networks, Table 

2. In principle, skip connections also allow for a better upsampling in higher layers. For instance, the symmetric U-Net architecture transfers full feature maps from the encoder to the decoder paths, achieving the best segmentation results of neuronal membranes in EM images in the ISBI 2015 challenge (ronneberger2015u). In order to reduce the load on memory another encoder-decoder architecture, namely SegNet, was proposed to transfer only the pooling indices between the encoder and decoder paths (badrinarayanan2017segnet). Nevertheless, U-Net can capture more complex information and can thus outperform SegNet in terms of accuracy. Such improvement is attributable to the U-Net’s ability to use learnable upsampling filters and stored encoder layer outputs concatenated using skip connections.

The concept of convolutional output layers in FCNs enables the conversion of popular DNNs, such as AlexNet, VGGNet, ResNet and GoogleNet (Inceptionv1) (krizhevsky2012imagenet; simonyan2014very; he2016deep; szegedy2015going), into encoder-decoder architectures for semantic segmentation. DNNs are universal approximators that can realise a complex function with even two layers of the network. Shallow neural networks with a few layers are not adequate to learn robust models. Deeper networks with small receptive fields, such as VGG-16 and AlexNet, became popular for image classification due to their generalization ability. DNNs have a large number of parameters to learn using the loss function (or error) that quantifies the penalty of the predicted values with respect to the desired ones. The addition of layers to make an architecture deeper brings with it other scientific challenges, such as the vanishing gradient and network degradation, Table. 2. Methods to address these challenges with training deeper networks include different strategies in initializing network parameters (glorot2010understanding), training networks in multiple stages (simonyan2014very), and using companion loss functions or auxiliary supervision in the middle layers (szegedy2015going). The most recent and important networks in computer vision are called the Residual Neural Networks or ResNet that overcome the vanishing gradient and the degradation issues simultaneously (he2016deep), Table. 2.

Spatial pyramid pooling and dilated (or atrous) convolutions have been introduced by the DeepLab family of deep segmentation architectures. DeepLab models addressed the challenges of achieving robustness for different sizes of input images and considering larger context without increasing computational complexity, respectively (chen2014semantic; chen2017deeplab; chen2017rethinking; chen2018encoder). Moreover, multi-scale context aggregation using the Pyramid Scene Parsing Network (PSP-Net) or spatial attention using the attention U-Net have also gained popularity for their ability to capture larger context (zhao2017pyramid; oktay2018attention). The ability to deal with varying scale while capturing more context is particularly important in EM analysis where the neighbourhood of a structure may have an impact in determining a precise delineation.

3 Literature search

The following search query was used in both Pubmed and Web of Science on words in titles only, restricted to 2017-2021: TI=((electron microscopy OR EM) AND (segmentation OR supervised OR unsupervised OR self-supervised)) NOT cryo. Cryo-EM (kucukelbir2014quantifying) was excluded because it involves molecule datasets as opposed to cellular EM. Results from the query that are beyond our scope were excluded.

A detailed review of the resulting 28 papers (Table 3) is given in terms of the data annotation and studied structures, the segmentation approaches, and computational scalability.

Classifier Type Dataset EM type Structures Study
Residual Deconvolutional Network (RDN) S - - - Membranes fakhry2016residual
DeepEM3D S - - - - Membranes zeng2017deepem3d
Feature selection and boosting S - - - -

Mitochondria, synapses, membranes

Pre-trained networks S - - - Neurophil, axons drawitsch2018fluoem
2D Convolutional Neural Network (CNN) S - - - - Mitochondria oztel2018deep
3D Residual FCN S - - - Mitochondria xiao2018automatic
Two-stream U-Net UN - - - Mitochondria, synapses bermudez2019visual
Random forest S - - - - Glomular basement membrane cao2019automatic
Fully Convolutional Network (FCN) S - - - - Mitochondria dietlmeier2019few
Fully Residual U-Net (FRU-Net) S - - - - Membranes gomez2019deep
3D Convolutional Neural Network (CNN) S - - Cells, mitochondria, membranes guay2019transfer
Residual Neural Network (ResNet) S - - - - Neural cell body, cell nucleus jiang2019effective
Morphological operators/superpixels UN - - - Mitochondria, membranes karabaug2019segmentation
Random forest S - - Mitochondria peng2019mitochondria
U-Net, ResNet, HighwayNet, DenseNet S - - - - Membranes urakubo2019uni
DenseUNet S - - - - Mitochondria cao2020denseunet
Fully residual CNN (FR-CNN) S - - - - Membranes he2020deep
HighRes3DZMNet S - - - - Mitochondria, endolysosomes mekuvc2020automatic

U-Net, autoencoder

UN - - Mitochondria peng2020unsupervised
DeepACSON S - - - - Axons abdollahzadeh2021deepacson
2D-3D hybrid network S - - - Cells, granules, mitochondria guay2021dense
3D U-Net S - - - - Up to 35 sub-cellular structures heinrich2021whole
CDeep3EM, EM-Net, PReLU-Net, ResNet S - - - - Mitochondria khadangi2021stellar
Hierarchial encoder-decoder (HED-Net) S - - Mitochondria luo2021hierarchical
Generative adversarial network (GAN) UN - - - - Cytoplasmic caspids shaga2021improved
Annotation-crowd-sourcing, ‘Etch a Cell’ S - - - - ER, mitochondria, nucleus spiers2021deep
U-Net SS - - - - Membranes takaya2021sequential
Hierarchical view CNN (HIVE-Net) S - - Axons, nuclei, mitochondria yuan2021hive
Table 3: Selected research papers for review. Abbreviations used - S (Supervised), UN (Unsupervised), SS (Semi-supervised), F (FIB-SEM), SS (Serial section), SB (SB-SEM).

4 Data, anatomical structures and annotation

Early examples of automated segmentation include the reconstruction of brain tissues for connectomics, which is the map of neuronal cell bodies and their connections (synapses). The pioneer work on automated neuronal membrane segmentation by ciresan2012deep showed the success of neural networks on EM images in the ISBI 2012 challenge. It has motivated more research activity in this direction, resulting in key research papers such as the U-Net architecture (ronneberger2015u). Properties of the most commonly used datasets are shown in Table 4.

Acquisition Dataset Region Imaged tissue parameters (x,y,z) Reference
Volume () Pixels Resolution ()
ssTEM ISBI 2012 / Drosophila I VNC Nervous cord (Drosophila) ciresan2012deep
ssSEM ISBI 2013 / SNEMI3D Cerebral cortex (Mouse) arganda20133d
ssSEM Kasthuri dataset Neocortex (Mouse) kasthuri2015saturated
SB SEM HeLa cells Cultured cells (Human) iudin2016empiar
FIB-SEM EPFL Mouse Hippocampus CA1 Hippocampus (Mouse) lucchi2013learning
FIB-SEM FIB-25 Optic lobe (Drosophila) takemura2015synaptic
Table 4: Commonly re-used 3D EM datasets for image analysis improvement.

Large-scale connectome datasets mostly focused on sub-cellular components related to brain cells such as synapses, pre-synatic sites, post-synaptic sites, axons and mitochondria (takemura2015synaptic; kasthuri2015saturated). With the open cell organelle and challenges to reconstruct tissues at cellular level (heinrich2021whole)

, several organelles besides mitochondria, namely nucleus, plasma membrane, endoplasmic recticulum, nuclear envelope, vesicles, lysosomes, endosomes and microtubules have now become the focus in open-source datasets.

Detailed annotation of large-scale EM images is required for automatic image segmentation and analysis of structures. Based on the 28 reviewed papers, three approaches of annotations are considered: 1) human annotators; 2) software tools for biologists; and 3) specialized microscopy imaging modalities.

4.1 Human annotators

Human annotation is split into two categories, annotations by one or few experts, and collaborative annotations by large groups of experts and non-experts. The latter is referred to as crowd-sourced annotation. Projects such as Etch-a-Cell enable annotation through crowd-sourcing (spiers2021deep). Crowd-sourcing allows volunteers to participate in large-scale annotation tasks digitally, with the aid of tutorials and other guided workflows. For example, volunteers are invited to annotate certain structures, such as endoplasmic reticulum (ER), mitochondria or nuclei in order to collect ground truth segmentation labels for supervised machine learning algorithms. As human segmentations can be erroneous due to imprecise delineation of structures or bias, consensus between various volunteer segmentations is often used as ground truth.

Compared to crowd-sourcing, expert annotations are more accurate but require more time and resources. For example, labeling structures in 30 platelets from a small fraction of the imaged samples required 9 months from two human annotators (guay2021dense). Also, the large scale connectomics project required extensive labelling of ground truth data (plaza2018analyzing), and the proofreading444Proofreading refers to the manual correction of automatic segmentation of image data. of the ground truth dataset took around 5 years in the work of takemura2015synaptic

. Manual segmentation of serial section SEM for axonal reconstruction was estimated to require 100 days for a volume of


4.2 Biomedical image analysis software

Biomedical image analysis software applications are black box tools for many biologists to annotate or proofread EM datasets without having to know the underlying mechanisms. Large-scale connectome reconstruction promoted the development of various tools for proofreading and analysis of the reconstructed data. All the tools are mostly developed for connectome (a map showing how all neurons are connected to each other) proofreading and analysis for serial sectioned image stacks (ssSEM, ssTEM) and block-face images (SB SEM, FIB-SEM). Such tools mostly come with 2D viewers and 2D annotations using brush/flood-fill functions with 3D viewers for visualization, Table 5. Generally, the reconstructed or annotated maps are corrected by external analysis tools by reading data from servers. The data storage format in the server and the mechanism of data distribution to the client determine how well the tool can scale for annotation or proofreading of larger volumes.

No single software tool is typically enough for the entire automated image analysis pipeline. Attempts have been made at bundling several tools into packages for off-the-shelf segmentation. UNI-EM (urakubo2019uni) provides the entire segmentation workflow including pre-processing, data preparation, training, inference, postprocessing, proofreading and visualization. One of its main components is DOJO, as a web-based tool for proofreading (haehn2014design). An additional 3D annotator is included in the package for correcting 3D segmentations using a surface mesh-based 3D viewer. Automatic segmentations are performed by Python-based CNN models such as U-Net, ResNet, Highway Net, DenseNet and flood filling networks (srivastava2015highway; huang2017densely; januszewski2018high), combined with a TensorBoard monitor for performance monitoring (mane2015tensorboard). The user does not need programming skills to use these models.

TrackEM2 is another software tool for optimized manual and semi-automatic reconstruction of neuronal circuits in serial section tera-scale EM images (cardona2012trakem2; schmid2010high). The plugin TrackEM2 is widely used in neuronal tracking and registration for 3D EM stacks for neuronal structures through Fiji (xiao2018automatic; jiang2019effective; ciresan2012deep; li2018fast). CATMAID is a companion tool for TrackEM2, which is for data storage, management and collaborative annotation of large-scale 3D ssTEM datasets using a client-server model (saalfeld2009catmaid). However, 3D neurite tracing to the next lateral 2D image plane obtained from a server can be less optimal due to low bandwidth and network latency issues. To overcome this challenge, webKnossos used a 3D SB SEM dataset storage and transmission in the form of small 3D voxels, as used in Knossos (a standalone data annotation application for connectomics) (helmstaedter2011high). WebKnossos is a cloud/browser-based 3D annotation tool for large-scale distributed data analysis (boergens2017webknossos). An example of open-source demo for dense connectome reconstruction using webKnossos is given by motta2019dense. Similar to CATMAID, VAST also has a data management framework with specific file format conversion before analysis (berger2018vast). A VAST format compatible dataset by kasthuri2015saturated is made publicly available for annotating, which makes it directly usable in an online collaborative framework for large-scale analysis.

Feature/ Software Visualization client Annotation client Collaborative annotation Data conversion format Scalability Train/Test Inclusion of custom algorithm
UNI-EM 2D, 3D/DOJO 2D DOJO, 3D Mesh Web-based PNG/TIFF High DNN Python script
webKnossos 2D, 3D/webKnossos 2D/webKnossos Web-based Knossos High - Python script
FlyEM 2D, 3D/NeuTu 2D/NeuTu Web-based DVID High - -
TrackEM2 2D, 3D/TrackEM2 2D/TrackEM2 Web-based CATMAID Moderate - -
VAST 2D, 3D/VAST 2D/VAST Central server VAST High - -
Ilastik 2D, 3D/Ilastik 2D/Ilastik - HDF5 Low Random forest -
Weka 2D/ImageJ 2D/ImageJ - PNG/TIFF Low Random forest Java script
IMOD 2D/IMOD - - - Camera specific Undefined -
Table 5: Reviewed biomedical image analysis software for manual or semi-automatic image segmentation, annotation and proofreading.

The FlyEM project by Janelia Reasearch Lab aims to fully reconstruct the neural connectivity of the Drosophilia fly brain using EM imaging. Tools and software packages developed in that research lab for proofreading image volumes or for crowd sourcing are open-source666 NeuroProof is a tool (plaza2014focused) that introduces focused proofreading that makes use of automated segmentation and prior information (synapse connectivity) to restrict proofreading to sites that are most impactful for completing the neural circuit; see Section 6 for more detail. The Janelia research tools, such as Raveler and NeuTu, aim at interactive proofreading in a distributed and scalable manner (farm2014raveler; zhao2018neutu). Another software package, namely DVID (Distributed, Versioned, Image-oriented Dataservice), facilitates the distribution and access of data through a common format to access all datasets using a higher-level API without requiring any specific format (katz2019dvid).

Ilastik777 and Trainable Weka Segmentation (TWS) are other tools that are used as plugins in Fiji to segment synaptic junctions and mitochondria (sommer2011ilastik; carreras17). Ilastik originally focused on the work of segmenting synaptic junctions and was further developed into a platform that includes fast interactive training through shallow classifiers (random forests) (berg2019ilastik; kreshuk2011automated). Weka, considered a de-facto for evaluating segmentation, is a trainable binary classifier plugin in Fiji that also allows the training of a random forest classifier for binary segmentation. Trainable Weka Segmentation (TWS) provides options to select the features, load a stored classifier to visualize results on test images and also provide performance metrics. Both are extensively used for labeling neuronal 3D images. Another tool named IMOD can semi-automatically register 3D EM serial sections and was used to segment structures such as small extra-cellular vesicles (kremer1996computer; gomez2019deep).

The above review of annotation tools shows a trend towards semi-automated segmentation on web frameworks for collaborative proofreading. Handling large volumes of dataset through common API in a distributed and scalable backend framework has been in focus for improving existing proofreading or annotation tools.

4.3 The CLEM microscopy technique

Different modalities of microscopy can help to provide more information to EM images and can be used to improve automated analysis. Correlation Light and Electron Microscopy (CLEM) is a technique used to visualize structures targeted with fluorescent probes in images using Light microscopy (LM) at cellular or sub cellular context from EM (ando2018). drawitsch2018fluoem performed CLEM for 40-50 EM sections for a 3D connectome dataset using a workflow for Light Microscopy to Electron Microscopy registration. The workflow included the LM reconstruction of flourescent-based labeled 3D data, the local reconstruction of all axons in small subvolumes of the respective EM data, and the determination of the axons that best explained the LM signal. Besides webKnossos, which they used for manual reconstruction, they also utilised partially automatic methods proposed in (berning2015segem; dorkenwald2017automated; staffler2017synem; januszewski2018high). Other attempts for 2D CLEM are based on EM workflows or markers to speed up image registration.

5 Segmentation approaches

The taxonomy illustrated in Fig. 3 contains the main segmentation methodologies reviewed in this section.

Figure 3: Categorization of deep learning methods used for EM segmentation. The gray shading is used for clarity purpose; it indicates which components belong to which category (supervised, semi-supervised or unsupervised).

5.1 Supervised learning

Supervised learning in segmentation refers to the family of machine learning algorithms that use a set of annotated images (training data) to create a computational model that can segment structures in unseen images (test data). The training set is used by the algorithm to determine the model’s parameters in such a way as to to maximize the model’s generalization ability.

5.1.1 End-to-end learning

End-to-end learning refers to the use of gradient-based methods to adjust the parameters of a complex deep neural network as a whole (glasmachers2017limits). In principle, increasing network complexity - e.g. by increasing the number of layers - allows for more complex functions to be learned. In practice, however, there are technical challenges that limit a network’s learning capacity, like vanishing gradients and network degradation, Table 2.

Neuronal membranes and ResNets

The ResNet architecture attempts to counter such limitations with residual blocks (he2016deep). Skip connections enable the design of very deep networks while propagating the gradient of the loss function through a lower number of layers. Various variants of the U-Net architecture using residual blocks for 2D segmentation have been proposed for neuronal structures such as membranes, neural cell bodies and cell nucleus.

FusionNet, a variant of U-Net that uses residual blocks at each level of the network, outperforms the standard U-Net architecture in the task of neuronal membrane segmentation (quan2016fusionnet). A variant of FusionNet with two types of residual blocks improved the segmentation peformance than the traditional U-Net and FusionNet (he2020deep). Better performance is measured for segmentation continuity or integrity, such as removing mitochondria or small vesicles that appear in the 2D-sections of the EM ISBI 2012 benchmark dataset.

A fully residual variant of U-Net, FRU-Net, showed better generalization than the standard U-Net in detecting small extracellular vesicles in TEM images (gomez2019deep).

DenseNets connect each layer to every other layer in a feed-forward manner and help to easily train and regularize DNNs. Better multi-scale feature extraction with limited parameters helped in achieving competitive segmentation performance on the ISBI 2012 EM dataset using DenseUNet, a combination of U-Net and DenseNet (cao2020denseunet).

The deconvolutional network of noh2015learning introduced learnable unpooling layers to reverse the pooling effect differently than the simple decoder networks that use interpolated unpooling. The Residual deconvolutional network (RDN) by fakhry2016residual captured all missed borders and showed very minimal inconsistencies in continuity of membrane detection across the 3D slices on the ISBI 2013 dataset.

An architecture combining ResNet with atrous convolutions in the last layer was used for the segmentation of neural cell bodies and nuclei (jiang2019effective). Multi-scale contextual feature integration investigated with spatial pyramid pooling in the decoder helps integrate high-level low resolution and low-level high resolution features. The pyramid approach outperforms U-Net and Deeplab v3+ and it does not require post-processing as verified in 3D using TrackEM2.

Mitochondria and efficient 3D networks

Mitochondrial distribution inside a cell and alterations in its shape are closely related to neuronal degeneration (leading to disorders such as Alzheimer’s or Parkinson’s disease) or cellular death (cancer studies). High-resolution automatic analysis is required to study the physiological changes in mitochondria. This is possible through serial-section EM techniques such as FIB-SEM and SB SEM. A 2D processing pipeline for mitochondria segmentation was performed by direct upsampling of the encoder features (oztel2018deep). The shallower architecture and faster segmentation network processes smaller blocks or patches, performing better than whole image processing using SegNet (badrinarayanan2017segnet).

Fully supervised deep 3D residual network to model spatial and temporal information using hybrid 2D-3D network was proposed by xiao2018automatic. The hybrid 2D-3D network refers to the use of 3D convolutions at the start and end of an encoder-decoder network. Using 2D max-pooling instead of 3D in the same network helped in capturing anisotropy for SB SEM datasets. Auxiliary supervision in the mid-level features was key for better deep supervision to avoid the vanishing gradient problem. As compared to architectures like U-Net and 3D U-Net, lesser proofreading is needed due to better segmentation accuracy by the proposed network.

Serial section EM dataset segmentation using only 2D DNNs suffer from coarser outputs due to discontinuities as they do not account for inter-slice information. 3D DNNs require a lot of computational memory. A solution was provided by HIVE-Net, a multi-task pseudo 3D residual network that makes use of multi-view 2D convolutions (yuan2021hive). The HIVE-Net extracts features using four branches, three orthogonal views and a final branch to calculate context information from one of the views. Experiments show that the HIVE-Net with lesser number of parameters achieves state-of-the art performance in comparison to deep learning models such as U-Net, 3D U-Net or the hybrid 2D-3D network (xiao2018automatic).

Shape-based segmentation for mitochondria, axons and nuclei

Shape information can also be used for faster and robust segmentation of axons, mitochondria and nuclei in large volumes of serial-section EM datasets (abdollahzadeh2021deepacson). The problem of under-segmentation (when pixels belonging to different semantic objects are grouped into a single region) mainly occurs in low-resolution with large-field of view images where distinct image features are missing. DeepACSON, a Deep learning-based AutomatiC Segmentation of axONs, uses a shape-based delineation method achieving segmentation of coarser structures in large-field of view images. DeepACSON outperformed DeepEM2D/3D and FFN (zeng2017deepem3d; januszewski2018high) with respect to the calculations of morphological measurements by using ovality of mitochondria, tubularity of axons and circularity of nuclei. DeepACSON shows that effective faster analysis even at low resolution is possible at par with the high-resolution morphological analysis involving highly detailed local calculations.

Robust segmentation of local heterogeneous shapes of mitochondria under complex backgrounds in high-resolution EM datasets may be achieved by leveraging shape information in Hierarchical Encoder Decoder (HED) network luo2021hierarchical. Based on the eccentricity (elliptic or circular shape) of mitochondria, ground-truth labels are sub-categorized into two set for training a first stage multi-task network. A second stage network for training based on all segmentation labels is performed for better robustness. The two-stage network shows lesser false positives and fewer missed detections in the segmentation of mitochondria when compared with U-Net, 3D U-Net, and HIVE-Net.

Whole cell 3D reconstruction

Robust and scalable automatic methods for whole-cell 3D reconstruction is required to study intricate organisation of thousands of structures inside a cell. FIB-SEM blocks from 5 different cell types and EM preparation methods were trained for around 35 different cellular organelles. The experimental results demonstrated that a diverse set of all samples used for training improves generalizability more than training on only one specific target sample. Such comprehensive datasets and the trained models for 3D reconstruction of cells are made open-source by heinrich2021whole to explore local cellular interactions and their arrangements in tissues at high resolutions.

5.1.2 Ensemble learning

Ensemble learning methods combine outputs of multiple network predictions for better robustness. Pixel-wise or voxel-wise averaging, majority or median voting are amongst the main aggregation methods. Robust segmentation of structures in high-throughput ssSEM or SB SEM images can be achieved using ensemble models irrespective of the problem of or misalignment in image stacks.

The DeepEM3D model was used in one such ensemble method proposed by zeng2017deepem3d for segmenting neuronal boundaries. DeepEM3D uses deep inception-residual modules in the encoding path and multi-scale contextual feature aggregation in the decoding path (zeng2017deepem3d). Variants of the DeepEM3D architecture for a training a model selective for neuronal boundaries with different thickness. Voxel-wise averaging of predictions accounted for misalignment and anisotropy in ssSEM image stacks.

Random forests are examples of ensemble approaches that can efficiently combine multiple decision trees for improved generalization. A stack of random forests (RFS) were, in fact, investigated for the segmentation of glomerular basement membrane from TEM images based on different intensity ranges (cao2019automatic). Segmentation was then performed by training zoom-view random forests based on groups of membrane intensity images and one full-view random forest taking pixels sampled from all groups. The method of using two-level integrated RFS enhances generalization on different gray-scale intra-image variations and different morphology of the membrane.

Structured prediction is a family of modelling techniques that forecast a set of values rather than a scalar value. In the segmentation context, structured prediction methods seek the joint prediction of the label of the pixel under consideration as well as the labels of the extended neighbourhood. The hierarchical approach by (peng2019mitochondria) demonstrates the effectiveness of such structured prediction methods. It uses an iterative procedure to fine-tune the segmentation based on the local preserving projection (LPP) handcrafted features but also on the class labels determined in previous iterations. Structured and cascaded approaches tend to improve the segmentation results by reducing the number of false positives and false negatives.

Simultaneous segmentation of more than one structure using image descriptors at different scales can be beneficial for structures of brain tissue for connectomics. cetina2018multi focus on concurrently segmenting two structures, namely synapse with mitochondria and mitochondria with membranes using multi-scale feature selection and boosting. Boosting using PIBoost algorithm, which is a multi-class generalization of AdaBoost with weak binary learners (fernandez2014multi), shows better accuracy for both isotropic and anisotropic stacks (SB SEM) due to multiscale representation ability and robustness to class imbalance.

Dense segmentation of cells and a large number of their organelles in high-throughput SB EM images is important for 3D whole cell modeling. A hybrid 2D-3D and 3D convolutional modules with separate output heads were proposed by guay2021dense. Multi-loss training was performed based on outputs from the hybrid 2D-3D network and the 3D spatial pyramid module. An ensemble of randomly initialized instances of the same network with each trained on less than of the SB-SEM volume for seven classes in platelet cells yielded the best results to structural variations in large datasets, and outperformed individual 2D and 3D U-Net approaches.

Multiple network outputs were also combined using a workflow for binary EM segmentation provided by the EM-stellar platform (khadangi2021stellar). The network architectures chosen for experimentation were CDeep3EM, EM-Net, PReLU-Net, ResNet, SegNet, U-Net and VGG-16 (haberl18; khadangi2021net; he2015delving)

. A cross-evaluation using a heatmap of different evaluation metrics and networks shows that configuring an ensemble of various architectures is required to obtain the best results.

khadangi2021stellar also demonstrated that no single deep architecture performs consistently well, and that is why ensemble approaches may have an edge over individual methods.

Ensemble learning methods aim to provide tools for setting up workflows for robust segmentation of large-scale EM datasets. Rapid initialization and training workflow for networks such as EM-stellar, CDeep3EM and other pre-trained models can help improve algorithms and compare their robustness in new EM datasets with minimal annotations.

5.1.3 Transfer learning

Transfer learning aims to adapt the knowledge acquired from one dataset to another, and is mostly used in case an application has insufficient amount of training samples. A model that is pre-trained on a certain dataset is used as the initialization of another model for a different dataset. The pre-trained model is then fine-tuned, usually in the final layers, with the training samples of the new dataset. Through transfer learning, the same neural network segmentation pathways have been used across biological systems (cells and tissues) in 3D (guay2019transfer). A three-class segmentation was proposed by mekuvc2020automatic for the delineation of mitochondria, endolysosomes and background. The domain information of the larger number of mitochondrial structures with similar texture to endolysosomes were used to learn a binary classifier against background. The weights were initialized for training a three-class model where only the last layer was fine-tuned to distinguish between the classes. Transfer learning, thus enables the modelling of a dataset even when there are limited labelled data.

Fine-tuning a pre-trained network, however, comes with the risk of over-fitting on the few labeled training examples of the new application. This challenge has opened up new research avenues, typically known as few-shot learning and domain adaptation. Few-shot is a kind of a meta-learning approach, where the idea is to learn how to learn from the pre-trained model rather than aiming to generalize from the training data (shaban2017one). An example is given by dong2018few about learning to understand the similarity or difference between objects. A pre-trained model is made to learn the similarity or difference between classes on the few labeled samples referred to as the support set. Few-shot learning is dependent on learning features between a pre-trained model and a support set for transfer learning with few labeled examples. Complex non-linear mitochondrial morphology was captured using active feature selection and boosting (dietlmeier2019few)

. The VGG-16 model pretrained on ImageNet database was used as a feature extractor for mitochondria segmentation. Hypercolumns that contain the activations of all CNN layers for each pixel were passed through a linear regressor for selecting features. Only 20 patches or blocks were used from 2 images out of 100 images in FIB-SEM stack for training a gradient based boosting classifier (XGBoost). The proposed few-shot learning shows that far less training data is required as compared to training with many blocks using XGBoost, and that even using only a single training sample (single-shot) can obtain competitive segmentation accuracy.

Domain adaptation is also a form of transfer learning, where the source to target datasets share the same labels (classes) but have a different data distribution. Changes in data distribution can be due to slightly different experimental parameters during EM imaging or due to the imaging of different tissue types or body locations. roels2019domain aimed to learn a latent space with a shared encoder in such a way that the source (annotated samples) and target (few annotated samples) representations are aligned in that feature space. Transferring knowledge from EM images to HeLa cells and from isotropic to anisotropic SEM images is possible using learning in latent space for domain adaptation. In contrast, a two-stream U-Net was jointly trained using a differential loss function to regularize two U-Nets (one for each domain) to avoid domain shift bermudez2018domain. The proposed two-stream U-Net shows that very little training data ( of labeled target data) was required for domain adaptation to achieve state-of-the-art performance when compared to a U-Net trained on fully annotated data.

5.2 Semi-supervised learning

Semi-supervised approaches aim to make use of unlabeled data for training along with a small set of labeled data (zhu2009introduction)

. The distribution patterns from unlabeled data used in training models aim to help build models that can generalize more than supervised learning from the few labeled samples. The aim in semi-supervised learning is to improve the model performance in subsequent iterations using pseudo labels generated from the output of the pre-trained model. Incremental learning is an example of semi-supervised learning where the goal is to, for example, select best features by adding decision trees to the classifier.

(utgoff1989incremental) demonstrated that adding only relevant decision trees can incrementally improve the prediction without needing to go back and retrain the model.

Label propagation in images of a 3D stack using pseudolabels from predictions of a trained network (using few labeled examples) was performed in an incremental setting (takaya2021sequential). The experimental results conclude that the generalization performance using supervised learning on continuous serial section EM images does not perform well when compared with a sequential semi-supervised segmentation (4S) approach. In particular, 4S was applied in an inductive manner where the first local slides were used for predicting the labels of the next. Then, self-training was done using the next local slides of the EM stack to include original labels and a few pseudolabels. The 4S approach improved the performance of the network by reducing false positives along with improving segmentation accuracy when compared with U-Net. Such a semi-supervised approach applied on continuous EM images (having strong correlation between images in stacks) reduce the annotation effort by experts on high-throughput 3D EM datasets.

5.3 Unsupervised approaches

Unsupervised approaches can be categorized into two groups. The first and more traditional relies on image processing and thresholding without involving learning algorithms. An example of such a strategy is the work by karabaug2019segmentation who used traditional image processing algorithms to detect nuclear envelopes (NE) in cancer cell lines used widely for experimentation, the HeLa cells iudin2016empiar. Low-pass filtering, edge detection and dilation to connect disjointed edges, super-pixel analysis, and removal of small super pixel near the image boundary followed by smoothing and filling of holes were performed for NE segmentation. NEs have shapes that are more irregular on the top and bottom slices than the center ones for which a connectivity check of islands to the main nuclear region using adjacent 3D images was carried out as post-processing step. Classical unsupervised algorithm performed better than four deep learning models VGG16, ResNet18, Inception-ResNetv2 and U-Net for NE segmentation (simonyan2014very; he2016deep; szegedy2017inception; ronneberger2015u). The cell nucleus that becomes smaller at the edges than the middle 2D slices lead to a highly imbalanced images in NE shapes for training deep architectures. However, the learning-free unsupervised approach assumes that the cell nucleus is always located at the center of the three dimensional stack and the NE is darker than nucleus and surrounding background, and therefore such learning-free approaches may not be sufficiently robust in generalization.

The other and more advanced group of unsupervised approaches use learning algorithms that configure models from unlabeled data. Self-supervision improves a model on the discovered patterns in unlabeled data by using them as labels for supervised learning. Unsupervised domain adaptation with self-supervision was used to determine pivot locations in the target dataset with no labels that characterise regions of mitochondria or synapse (bermudez2019visual). Correspondences between similar structures in different image stacks were used as soft labels for other supervised techniques. The target domain locations from the correspondences were converted to dense responses, referred to as heatmaps, for adapting the model based on a two-stream U-Net (bermudez2018domain). The results were consistent with those obtained under fullly annotated samples trained on U-Net or semi-supervision (use of transfer learning). No new annotation effort in case of domain shifts was required for volumes of FIB-SEM from different mouse specimens.

Adversarial learning is a methodology that trains networks to predict the same output for two datasets, one source and the other being the adversarial perturbed data (adversarial examples), for which the latter gives a different output in spite of belonging to the same class. Each algorithm can use a different approach, such as sharing weights across domains (peng2020unsupervised) or use generative samples from a generator to confuse the discriminator as is done by Generative Adversarial Networks, or GANs for short (goodfellow2014generative). Adversarial learning is also used for domain adaptation to learn non-discriminating features for robustness to shift in data distributions (peng2020unsupervised). Domain-invariant features in the encoder are learnt through a reconstruction auto-encoder in an unsupervised manner. As the target has no labels, the shared decoder features are still not discriminative to the target domain for which the proposed method uses adversarial learning in the decoder stage. The adversarial learning aims to ensure that the network cannot distinguish between features of the source and target domains. GANs use adversarial learning to generate synthetic training samples with the same statistics as the source training data. Image synthesis using sinGAN was used to generate new images with similar distribution but varied object configuration for a three-class detection of the stages of cytoplasmic caspids of human cytomegalovirus (HCMV) from a TEM dataset (shaga2021improved). Synthetic images added during the training of the detector of three caspid classes enhanced the model’s generalisation performance significantly. The authors indicate their work can serve as a baseline for future development for other EM related datasets such as SBF SEM or STEM images used for detecting virus morphogenesis.

5.4 Performance evaluation metrics

Segmentation can be evaluated using pixel-based matching or segment-based matching for binary segmentation. When the segmentation gives a unique index to each object it is called instance segmentation. Instance segmentation is evaluated by penalising overlaps with other individual segments.

Common pixel-based matching measures are accuracy, true positive rates and false positive rates (jiang2019effective)

. Precision, recall (sensitivity), specificity and F1-score (harmonic mean of precision and recall) are the most basic performance measures used in various studies to quantify the effectiveness of 2D and 3D segmentation methods

(xiao2018automatic; dietlmeier2019few; khadangi2021stellar; takaya2021sequential)

. Most binary image segmentation tasks suffer from class imbalance as the background class is much larger than the objects of interest. In order to address class imbalance, which is also typical in medical image analysis, more sophisticated methods, such as Intersection over Union (IoU) or Jaccard index, which determine how similar the ground-truth and predicted sets are, can be more appropriate. The Dice similarity co-efficient (DSC) is another measure which addresses class imbalance by only considering the segmentation class for evaluation. The DSC and JAC are, in fact, the most commonly used measures for performance evaluation in this field

(mekuvc2020automatic; xiao2018automatic; yuan2021hive; luo2021hierarchical; cao2019automatic; peng2019mitochondria; bermudez2018domain; peng2020unsupervised). xiao2018automatic also used the conformity coefficient, which is a global similarity score with more discrimination capabilities than Jaccard or DSC chang2009performance. To calculate how far the segmented nuclear envelope is from the ground truth, besides the Jaccard index, the Hausdorff distance was used by karabaug2019segmentation. The latter is the spatial distance between two sets, and apart from matching segments, it also takes into account the pixel/voxel localisation. Common metrics used for instance segmentation are the Aggregated Jaccard Index (AJI) and Panoptic Quality (PQ) (yuan2020; luo2021hierarchical), which account for under- and over-segmentation more accurately than the Jaccard index and DSC. Definitions of the most commonly used metrics are illustrated graphically in Fig. 4.

The Jaccard curve, which was inspired by the precision/recall and receiver operating characteristic (ROC) curves, is a measure that quantifies the quality of a segmentation result without involving any thresholds (cetina2018multi).

Boundary matching and information theoretic scores have emerged as two important metrics to evaluate neuronal boundary maps (unnikrishnan2007toward; arbelaez2010contour; arganda2015crowdsourcing)

. The most popular ones are similarity-oriented measures between two clusters for paired labels instead of pixel-wise errors. Boundary maps are transformed to segmentations by finding connected components. The rand index is a cluster similarity measure that quantifies the similarity between the results of two clustering methods, by taking the ratio of the sum of the total number of pairs of points in agreement and pairs of points in disagreements with respect to the total number of pairs between the two clusters. Another measure is based on what are known as the split and merge errors. The split error is computed by taking two randomly selected voxels belonging to the same segment in the ground truth and assessing them based on the joint probability of belonging to the same region in the segmentation result. The merge error is based on whether two voxels predicted as belonging to the same segment do actually belong together. The Rand F-score is then the weighted harmonic mean of the merge and split errors. Metrics like foreground-restricted rand scoring and foreground-restricted information theoretic scoring after border thinning are the state-of-the-art metrics for neuronal boundary segmentation

(cao2020denseunet; zeng2017deepem3d; he2020deep; cetina2018multi; khadangi2021stellar).

Figure 4: Illustration of common performance metrics for (top) binary and (bottom) instance segmentation approaches. Each GT component is matched with only one PR component, the one with which it has the largest intersection. For instance, the PR component overlaps with two GT components, and , but is matched only with due to a larger overlap. The Aggregated Jaccard Index (AJI) takes the sum of the intersections of all matched GT and PR components divided by the sum of their unions plus the unmatched PR components. The Panoptic Quality (PQ) is the sum of the IoU ratios of all associated GT-PR pairs (i.e. TPs) divided by the sum of all TPs and half of the unmatched GT and PR components (kumar2017dataset; kirillov2019panoptic). The symbol indicates the area of the component concerned.

6 Scalability and performance matters

Bioimage analysis software tools use high-performance computing that use advanced computers (many processors running in parallel) or compute clusters (distributed computing) for faster annotation and proofreading. The time taken to complete a task depends on the design used which can be either distributed or cloud computing depending on the local or remote infrastructure used, respectively. Browser (or client) has faster access to volumes of images from cloud storage systems without bothering on the specificities of the underlying infrastructure for data management and processing. This has been observed in vEM for neuron tracing and reconstruction datasets in micrometer wide sections for the whole brain.

Scalability here refers to the data processing frameworks and computational resources that can handle big data. sculley2015hidden show that the problems with scalable machine/deep learning are not the algorithms involved, but the supporting infrastructure which is vast and complex. In this section, we discuss distributed frameworks, such as Hadoop and Apache Spark, and their use in software tools of image annotation (shvachko2010hadoop; zaharia2010spark). Furthermore, we discuss cloud computing and how it enables on-demand provision of computing resources such as cloud applications (containers) and storage devices. An illustration of the main components in bio-image analysis software is shown in Fig. 5.

Figure 5: Typical architectures of scalable computing in available bio-image analysis software.

6.1 Distributed computing

Distributed computing is the process by which a single problem is divided into many parts, controlled by a master node but processed in different computing units (worker nodes). More database or processing nodes can be added to the system based on demand, rather than using a single server with many nodes that may not be used at all times. There can also be redundant nodes that address hardware failures, known as fault tolerance (such as worker nodes or master replica), Fig. 5. Users or clients can access and process data remotely in different systems of the network as simple as accessing local data. Biomedical image analysis software tools for neuronal reconstruction such as VAST, NeuTu, webKnossos, TrackEM2 (CATMAID) use distributed storage and processing.

Large-scale connectomics dataset evaluation at the nanometer scale is important to obtain efficient automated reconstructions of neurons. Detailed metrics to evaluate for synapse connectivity were introduced in VAST. A software ecosystem in VAST is used to evaluate two large datasets (takemura2015synaptic; takemura2017connectome) deployed in a scalable cluster-based solution using Apache-Spark (zaharia2010spark). The latter is an open source data processing framework to store and process data in batch or real-time across clusters of computers. The implementation of VAST is written over the Apache-Spark framework for providing a flexible plugin architecture to deploy easily on different cluster environments for large-scale processing. Distributed Versioned Image-oriented data service (DVID) is a tool for branched or distributed versioning in connectomics reconstruction workflow for collaborating proofreaders from any part of the world (katz2019dvid). DVID provides web-based API for key-value based image labels that provide efficient storage and faster access. NeuTu is a client program that uses DVID as its scalable image database for large-scale, collaborative 3D neuronal reconstruction proofreading.

Decentralized computing, modeled after Google Maps (rasmussen2005keynote), was introduced in the Collaborative Annotation Toolkit for Massive Amounts of Image Data (CATMAID) (saalfeld2009catmaid). It uses in-browser decentralized annotation of large biological stacks. Images are loaded from immediate or from nodes few hops away in the network to make them accessible to the user, Fig. 5. Projects, image stack information, and annotations (metadata) are stored in a centralized server for cross-referencing and collaboration. WebKnossos (boergens2017webknossos) addresses latency issues in accessing sequential 2D images for neurite reconstruction.

Storage and retrieval of large volumes of EM data for visualization of raw and reconstructed neuronal circuit images was developed for data management (yuan2020). Supported by a backend Hadoop framework, it allows for distributed data processing and storage of large amounts of data (shvachko2010hadoop). The delay time in read/write operations from the client was further reduced using a three level image cache at the client side. The scaling of clusters with more generated EM data and redundancy of data in various clusters provides reliable data management and integrity.

6.2 Cloud computing

Cloud computing is the on-demand provision of services such as servers, applications, networking capabilities, and hardware resources on the internet for the end-user (kagadis2013cloud). There are three models of cloud service. The infrastructure as a service (IaaS) model only includes computing resources, networking, and storage. On the other hand, the platform as a service (PaaS) model includes the application design, testing and development tools, middleware, operating systems, and databases. Finally, the software as a service (SaaS) model facilitates the availability of all application services to users from any device with an internet connection and a web browser.

CDeep3EM (haberl18) is a plug and play software for segmentation of biomedical images run on Amazon web services (AWS), which is an implementation of the deep learning model introduced by zeng2017deepem3d. CDeep3EM is pre-configured and made available publicly on AWS which makes it a cloud-based deep convolutional neural network for 2D and 3D biomedical image segmentation. The underlying configuration dependencies and computational infrastructure are abstracted by facilitating a cloud-based version for pre-processing, training and ensemble prediction using DeepEM3D.

EM-stellar (khadangi2021stellar) is a Jupyter notebook platform hosted on Google Colab with ready access to cloud computing resources. For investigating existing work on the use of cloud computing the recent work by von2021

aims at democratising the use of deep learning models by leveraging the open, cloud-based Google Colab resources to simplify access to deep learning in microscopy. Code-free usage by providing state-of-the-art networks as notebooks for segmentation, object detection, denoising, and super-resolution microscopy for multiple biological datasets is provided. Quantitative tools to analyse model performance and optimize it along with data augmentation and transfer learning options are also included. The aim is to establish a framework for developers to interface with the existing notebooks and cloud infrastructure to evolve networks for specific image processing problems in microscopy.

7 Challenges and future trends

Segmentation of large-scale EM datasets focus on distinctive image features captured through variants of U-Net architecture. Techniques for neuronal membrane or mitochondria segmentation aim to improve robustness through additional supervision or ensemble techniques for detecting less false positives and false negatives. Attempts have been made at whole cell 3D reconstruction for segmenting many organelles in volume EM datasets by providing public datasets for further cell model research (heinrich2021whole; guay2021dense)

. As large public datasets are not available for pre-training and new datasets generated from EM techniques lack labels for supervised learning, newer methodologies of transfer learning, namely few-shot learning and domain adaptation have come up. Due to the challenges of data annotation, semi-supervised and unsupervised learning techniques are becoming more appealing.

Semi-supervised approaches such as few-shot (few examples per class), one-shot (one example per class), and zero-shot (no examples at all) are used to adapt pre-trained models to new datasets (snell2017prototypical; vinyals2016matching; socher2013zero). Supervised few-shot learning for segmentation has shown promising results (dong2018few; hu2019attention). Few-shot learning for noisy labels and incremental learning (semi-supervised) is very interesting for EM datasets (liang2022few; tao2020few).

Self-supervised methods with the contrastive learning framework have been used to learn similar or dissimilar pairs from data - SimCLR (Simple Framework for Contrastive Learning of Visual Representations) and MoCo (Momentum Contrastive Learning) (hadsell2006dimensionality; chen2020simple; ciga2020self)

. Networks that are initialized either randomly or by using large datasets, such as ImageNet, do not perform better as compared to pre-training using moCo

(guay21; casser2018fast; perez2014workflow; mekuvc2020automatic). Self-training seems to be another methodology with high potential for further improving the robustness of pre-training models (zoph2020rethinking). A notable study in medical X-ray data classification uses self-training and is expected to be the future for training medical datasets with very limited annotated data (rajan2021self).

Self-supervised methods can also benefit from multi-modal EM data, such as CLEM. seifert2020deepclem give further insight on the potential of deep learning for automatic image registration for CLEM. Other multi-modal training data that can be used for EM segmentation in the future is Color-EM or EDX information (pirozzi2018colorem).

Context information is important to segment structures in EM images. In the case of large scale 2D EM, morphological analysis of structures and their overall arrangement in the tissue is very relevant for medical interpretation, as it can give functional identification to a structure. Temporal context in 3D images helps in better segmentation of structures using neural networks. Similarly, the idea of large-scale 2D EM as sequential data to model the dependencies between patches can be effective to identify structures based on larger global context. Further inspiration can be obtained by the recent success of transformer networks, typically used for natural language processing, that use built-in self-attention between patches

(dosovitskiy2020image; zheng2021rethinking).

The ability to segment new datasets with minimal annotations allows these methods to scale to larger EM datasets generated from state-of-the-art technologies, such as MB-SEM. Other aspects of scalability are improving due to the widespread use of distributed or cloud computing. Accessibility by biologists to such computing resources has also improved through the continued development of bioimage analysis software. Pathologists, for instance, can now use off-the shelf cloud applications for data access and analysis. Distributed computing can also be performed on the cloud which makes it even more appealing.

Indexing of segmented structures is an important topic for future research. The idea is to enable a domain expert to query a database in three different ways, namely, label-based, image-based, and proximity-based. A label-based query would require the user to specify the label of a structure of interest, image-based would require the submission of an image example of a structure of interest and, proximity-based would require the specification of the distance and direction between a set of structures of interest. Such functionality would enable domain experts to enhance their interaction with a database with large volumes of large-scale EM images.

8 Conclusion

Automated image analysis techniques for EM are evolving in accordance with the recent advancements in imaging technologies. For instance, nanotomy, or the acquisition or large-scale 2D EM, makes greater demands on the capture of global context by segmentation algorithms, without the aid of 3D information available in volume EM. State-of-the art segmentation methodologies, for both 2D and vEM, are thoroughly categorised in this review. An overview is also given of other relevant aspects of automated segmentation in EM, such as annotation software, visualisation, and storage and computing frameworks, with an emphasis on distributed and cloud computing.

Given that the lack of fully annotated data in medical imaging will persist and is likely to be compounded by the exponential generation of image data (e.g. by MB-SEM), we suspect that semi-supervised and self-supervised approaches will play a bigger role in the segmentation of EM data in the future.

9 Acknowledgement

This project has received funding from the Centre for Data Science and Systems Complexity at the University of Groningen Part of the work has been sponsored by ZonMW grant 91111.006; the Netherlands Electron Microscopy Infrastructure (NEMI), NWO National Roadmap for Large-Scale Research Infrastructure of the Dutch Research Council (NWO 184.034.014); the Network for Pancreatic Organ donors with Diabetes (nPOD; RRID:SCR), a collaborative T1D research project sponsored by JDRF (nPOD: ) and The Leona M. & Harry B. Helmsley Charitable Trust (Grant ). The content and views expressed are the responsibility of the authors and do not necessarily reflect the official view of nPOD. Organ Procurement Organizations (OPO) partnering with nPOD to provide research resources are listed in