Log In Sign Up

Leveraging Self-supervised Denoising for Image Segmentation

by   Mangal Prakash, et al.

Deep learning (DL) has arguably emerged as the method of choice for the detection and segmentation of biological structures in microscopy images. However, DL typically needs copious amounts of annotated training data that is for biomedical projects typically not available and excessively expensive to generate. Additionally, tasks become harder in the presence of noise, requiring even more high-quality training data. Hence, we propose to use denoising networks to improve the performance of other DL-based image segmentation methods. More specifically, we present ideas on how state-of-the-art self-supervised CARE networks can improve cell/nuclei segmentation in microscopy data. Using two state-of-the-art baseline methods, U-Net and StarDist, we show that our ideas consistently improve the quality of resulting segmentations, especially when only limited training data for noisy micrographs are available.


A comparative study of semi- and self-supervised semantic segmentation of biomedical microscopy data

In recent years, Convolutional Neural Networks (CNNs) have become the st...

DenoiSeg: Joint Denoising and Segmentation

Microscopy image analysis often requires the segmentation of objects, bu...

Self-supervised Deep Learning for Reading Activity Classification

Reading analysis can give important information about a user's confidenc...

Self-supervised U-net for few-shot learning of object segmentation in microscopy images

State-of-the-art segmentation performances are achieved by deep neural n...

N2V2 – Fixing Noise2Void Checkerboard Artifacts with Modified Sampling Strategies and a Tweaked Network Architecture

In recent years, neural network based image denoising approaches have re...

Deep Label Fusion: A 3D End-to-End Hybrid Multi-Atlas Segmentation and Deep Learning Pipeline

Deep learning (DL) is the state-of-the-art methodology in various medica...

1 Introduction

Modern microscopy techniques enable us to capture biological processes at high spatial and temporal resolution. Today, the total amount of acquired image data can be so vast, analyzing it poses a tremendous challenge. In many cases, deep learning (DL) based analysis methods [litjens2017survey, falk2019u, rutter2019convolutional, schmidt2018] are leading the way to address this problem. Still, even common tasks, such as detection or segmentation, typically require human curation to fix remaining errors.

Arguably the two main causes of weak detection and segmentation performance are  too little training data, and

 input images acquired at low signal-to-noise ratios (SNR). In order to make most out of the available training data, augmentation 


and transfer learning 

[doersch2015unsupervised, zamir2018taskonomy] are often used. While data augmentation uses transformed copies of the training data to gain better training performance, transfer learning employs networks pretrained on similar tasks and/or data and finetunes them for the task/data at hand. To address the low SNR, a number of powerful content-aware restoration and denoising methods have recently been developed [zhang2017beyond, weigert2018content, lehtinen2018noise2noise].

Among them are self-supervised methods [krull2019noise2void, batson2019noise2self, krull2019probabilistic], which do not require annotated training data, and can be directly trained on the raw data to be denoised.

Figure 1: Tested network architectures and training schedules. (a) Baseline methods are directly trained to segment noisy data, (b) sequential setup, with denoising being the preprocessing step for subsequent segmentation, (c) finetuning of a pretrained denoising network for segmentation, and (d) finetune-sequential, combining the ideas of (b) and (c).

In this work, we investigate various ways, in which self-supervised denoising can enable cell/nuclei segmentation, even in the presence of extreme levels of noise and limited training data. We explore the efficacy of denoising as a preprocessing step, as part of a transfer learning schema, as well as, in a combination of the two.

We conducted all experiments with two popular DL-based segmentation methods: a standard U-Net [ronneberger2015u, falk2019u] and the more sophisticated StarDist [schmidt2018]. While we find that self-supervised denoising generally improves segmentation results, especially when noise is abundant and training data limitied, we provide detailed results, comparing all approaches for various amounts of training data, noise levels, and types of data. All datasets, results, and code can be found at

2 Methods and Experiments

Figure 2: Results for noise level n40 and n20 on DSB data. Sequential is abbreviated as Seq and Finetune is abbreviated as FT. It can be seen that our proposed training schemes consistently outperform the respective baseline, mainly when only limited segmentation GT is available.

As sketched in Fig. 1, we propose three ways to improve segmentations and compare our results to two baseline methods, namely  a standard U-Net [ronneberger2015u] for 3-class pixel classification, and  StarDist [schmidt2018], designed to learn and utilize a star convex shape prior. These baselines are chosen based on popularity and because they follow rather different segmentation paradigms. The following setups are the ones we propose.

Sequential (Figure 1b): Here, two networks are employed. The first network is a Noise2Void (N2V) network [krull2019noise2void], trained to denoise the full body of available image data. The second network, which henceforth receives the denoised N2V output, is then either a U-Net or StarDist network, trained on all or parts of the available segmentation labels (GT). Note that all weights of the N2V network remain constant during training the segmentation network.

Finetune (Figure 1c): In contrast to the sequential setup, here we retrain the N2V network for segmentation. Since StarDist does not use the exact same network architecture as N2V, this approach only applies to the U-Net baseline.

Finetune Sequential (Figure 1d): Very similar to the sequential setup, also here we first train a N2V denoising network. In contrast to before, the segmentation network is initialized by a copy of the trained N2V network and then finetuned for segmentation. Also here, the weights of the first network stay unchanged during the training of the segmentation network.

Next we describe the detailed setup of the N2V, Segmentation U-Net, and StarDist networks.

N2V Denoising Network: We use the Noise2Void setup as described in [krull2019noise2void]. Conveniently, N2V is just a default U-Net with a modified loss for denoising, allowing us to design a single network that can later be used for N2V training as well as for the U-Net segmentation baseline. We use initial feature maps with batch norm and a batch size of and employ convolution kernels. For all experiments we choose the depth of the U-Net as described below.

U-Net Segmentation Network: We created a U-Net capable of performing either 3-class pixel classification (foreground, border, background) [chen2016dcan, guerrero2018multiclass] or N2V denoising. Hence, the U-Net we use has four output channels, one for each pixel class, and one to regress denoised pixel intensities. Note that, during pixel classification, we give extra emphasis to the border class, by weighing it five times higher in the used loss as suggested in [schmidt2018]. Again we use feature maps, batch size of , and kernels. For all experiments, the depth of the U-Net is chosen to maximize segmentation performance. Hence, results below are not limited by the capacity of network. All networks are trained with a standard learning rate scheduler as used in [krull2019noise2void]. We use an initial learning rate of and a batch size of

with batch normalization. Training is done for

epochs, each consisting of steps. Training data is augmented 8 fold by flips and 90 degree rotations.

StarDist Segmentation Network: Number of feature maps, batch size, convolution kernels, network depth, learning rate, number of training epochs, and step size per epoch used for StarDist are set as described above. Again, the training data is augmented 8 fold by flips and 90 degree rotations. However, StarDist uses output channels that are trained as explained in [schmidt2018].

3 Data and Evaluation Metrics

Figure 3: Results for noise level n200 and n150 on BBBC data. The abbreviations are the same as in Fig. 2. Again, all proposed training schemes outperform their baselines. Here our proposed sequential U-Net schemes even outperform StarDist and StarDist Sequential.
DSB 2018 n40
U-Net 0.4777, 0.4944, 0.5439, 0.5912, 0.6214, 0.6551, 0.6645, 0.6834, 0.7304, 0.7199,
0.5218 0.5634 0.5840 0.6095 0.6217 0.6403 0.6493 0.6685 0.6835 0.6929
U-Net Sequential 0.5608, 0.5862, 0.6127, 0.6523, 0.6679, 0.6791, 0.6958, 0.7226, 0.7360, 0.7373,
0.5675 0.5938 0.6160 0.6349 0.6483 0.6608 0.6700 0.6890 0.6960 0.6950
U-Net Finetune 0.5357, 0.5518, 0.5971, 0.6286, 0.6430, 0.6658, 0.6731, 0.7013, 0.7140, 0.7261,
0.5628 0.5711 0.5987 0.6253 0.6346 0.6444 0.6580 0.6681 0.6840 0.6901
U-Net Finetune Seq. 0.5944, 0.6259, 0.6357, 0.6646, 0.6761, 0.6839, 0.7028, 0.7158, 0.7261, 0.7267,
0.5927 0.6212 0.6262 0.6499 0.6529 0.6611 0.6686 0.6813 0.6898 0.6870
StarDist 0.4796, 0.6085, 0.6400, 0.6620, 0.7572, 0.7679, 0.7795, 0.7827, 0.7884, 0.7883,
0.4789 0.5639 0.5735 0.5913 0.6683 0.6788 0.6948 0.6997 0.7087 0.7150
StarDist Sequential 0.6802, 0.7331, 0.7337, 0.7549, 0.7640, 0.7761, 0.7766, 0.7876, 0.7914, 0.7939,
0.6004 0.6399 0.6548 0.6727 0.6877 0.6906 0.6987 0.7044 0.7107 0.7141
BBBC 004 n200
U-Net 0.5218, 0.5634, 0.5840, 0.6095, 0.6217, 0.6403, 0.6493, 0.6685, 0.6835, 0.6929,
0.6175 0.6405 0.6736 0.6720 0.6868 0.6897 0.7046 0.7161 0.7211 0.7114
U-Net Sequential 0.5675, 0.5938, 0.6160, 0.6349, 0.6483, 0.6608, 0.6700, 0.6890, 0.6960, 0.6950,
0.6781 0.7074 0.7083 0.7069 0.7106 0.7179 0.7132 0.7112 0.7205 0.7179
U-Net Finetune 0.5628, 0.5711, 0.5987, 0.6253, 0.6346, 0.6444, 0.6580, 0.6681, 0.6840, 0.6901,
0.6624 0.6822 0.6856 0.6889 0.6955 0.7129 0.7073 0.7148 0.7219 0.7201
U-Net Finetune Seq. 0.5927, 0.6212, 0.6262, 0.6499, 0.6529, 0.6611, 0.6686, 0.6813, 0.6898, 0.6870,
0.6996 0.7021 0.7082 0.7118 0.7099 0.7235 0.7237 0.7120 0.7162 0.7168
StarDist 0.4789, 0.5639, 0.5735, 0.5913, 0.6683, 0.6788, 0.6948, 0.6997, 0.7087, 0.7150,
0.6313 0.6597 0.6823 0.6914 0.7018 0.7050 0.7113 0.7135 0.7128 0.7174
StarDist Sequential 0.6004, 0.6399, 0.6548, 0.6727, 0.6877, 0.6906, 0.6987, 0.7044, 0.7107, 0.7141,
0.6895 0.6992 0.7001 0.7007 0.7103 0.7116 0.7146 0.7153 0.7202 0.7224
Table 1: Mean performance in terms of average precision (AP) and SEG (in italic) for DSB n40 (8 repetitions) and for BBBC n200 (5 repetitions). Bold number indicate the best performing scheme for a given fraction of segmentation GT (). See the main text for further details.

In this work we use publicly available data, which we randomly split into training (85%) and test sets (15%). We further split the training data into , ten stacked subsets we will use to test our methods in data-limited training regimes. Additionally, we corrupt the raw microscopy data with pixel independent, identically distributed Gaussian noise. Sample images for all datasets are shown in Fig. 4.

DSB 2018 Data:

From the Kaggle 2018 Data Science Bowl challenge, we take the same subset of data as has been used in 

[schmidt2018], showing a diverse collection of cell nuclei imaged by various fluorescence microscopes. We extracted image patches of size from the training set. For this data, manually generated segmentation GT is available. Training subsets through consist of randomly chosen image patches, respectively. Additional noise is added with mean

and standard deviations

, , and to both training data and test data. We refer to the modified datasets as n10, n20, and n40, respectively.

BBBC 004 Data:

This data is available from the Broad Bioimage Benchmark Collection and consists of synthetic nuclei images. Since the data is synthetic, perfect GT labels are available by construction. Here we use only those images which have been generated with an overlap probability of

. We extracted image patches (of size ) from the training set. Training subsets through consist of image patches, respectively. Additional noise is added with mean and standard deviations and to training and test data. Following the naming convention from above, we refer to this data as n150 and n200.

All experiments we conduct are evaluated in terms of Average Precision (AP) [everingham2010pascal] and SEG [ulman2017objective]. The SEG measure is based on the Jaccard similarity index (), computed for matching objects and , and is given by . A ground truth object and a segmented object are considered to be matching if and only if at least of the pixels of are overlapped by pixels in . Average Precision, in contrast, counts the ratio of true positives to the sum of true positives, false positives, and false negatives. All AP and SEG values we report here are obtained by finding the threshold that maximizes AP. For the U-Net this threshold is used to cut the foreground probability maps into discrete image regions.

For StarDist the threshold controls the non-maxima suppression step [schmidt2018].

4 Results

Figure 4: Visual comparison of segmentation results with baseline methods and proposed training schemes for DSB n40 and BBBC n200 . From left to right we show first one noisy input image, then the two insets, respective noise-free data, and the various segmentation results with each object shown in a distinct color. Sequential is abbreviated as Seq and Finetune is abbreviated as FT. In line with the overall performance on the full body of data, also in the examples we show our proposed methods outperform the quality achieved by the baselines.

We investigated all setups described above, on all noise levels, using all subsets of training data , making a total of experimental setups. Each experiment on the DSB data was repeated times while all experiments on the BBBC data were repeated

times, allowing us to report mean performance and standard error for selected noise levels in Figures 

2 and 3. Additionally we show results for DSB data at noise level n40 and results for BBBC data at n200 in Table 1. A complete set of figures and tables, for all conducted experiments, can be found online at

Looking at all results it can be observed that all our proposed schemes outperform their respective baseline when the amount of available training data is limited. The Finetune Sequential scheme is typically performing best among all U-Net based pixel classification pipelines. StarDist, in itself a more powerful method, is indeed the better performing baseline. As before, the proposed StarDist Sequential scheme clearly outperforms its baseline method when fewer training images are available. Note that even if ample training data is provided, all our proposed training schemes perform at least on par with their baselines. It is important to be reminded that improved results using sequential training schemes are not due to limiting network sizes – we have tested various network sizes for both baseline methods and have settled for the best performing configuration we could find.

To our surprise, on the BBBC data, both sequential U-Net schemes outperform StarDist and StartDist Sequential for the n150 and n200 noise levels, despite the StarDist baseline consistently and significantly outperforming the U-Net baseline.

A visual comparison of segmentation results with all the methods trained on the training subset is given in Fig. 4. For the DSB data we show insets that exemplify the often occurring problem of merging segments (bad for AP), while the shown BBBC insets show variations in segmented areas (bad for SEG). These segmentation mistakes are particularly exemplified for baseline schemes whereas sequential schemes for both U-Net and StarDist seem to yield better quality segmentation, in general.

5 Discussion

It is known that there is an overlap between denoising and segmentation tasks [zamir2018taskonomy]. We interpret this as segmentation networks inherently solving the denoising task to a certain extent and vice-versa. In this work we investigated how disentangling the two can be exploited in practice, when noisy data is abundant, but annotations are rare – a situation that is virtually ubiquitously true in biomedical applications.

In these situations, all our proposed schemes show above baseline performance. Among all conducted experiments, sequential training schemes generally lead to the best results. Since this is not only true for the simple U-Net baseline, but also for StarDist, it stands to reason that similar observations would also hold for other DL based approaches and tasks. This suggests that N2V, or other self-supervised denoising methods [krull2019probabilistic, batson2019noise2self], can serve as universal preprocessing blocks for networks solving any given super-task in which denoising is a helpful sub-task.

These denoising blocks can benefit from the whole body of available noisy data, without relying on annotated GT labels required for the super-task. Hence, finding sensible training schedules to train such larger, modular networks is a promising direction of research.

In summary, we show that commonly used networks for image segmentation can likely be boosted in performance by combining them in various ways with unsupervised denoising modules. Our work offers simple recipes for improving DL based segmentation results. Since this is increasingly true at lower signal-to-noise regimes and when segmentation GT is limited, direct benefits for the biomedical imaging community will be inevitable.

6 Acknowledgements

The authors would like to acknowledge the Scientific Computing Facility at MPI-CBG and HPC Cluster at the Center for Information Services and High Performance Computing (ZIH) at TU Dresden for giving us access to their HPC cluster. We also thank Uwe Schmidt, Martin Weigert and Vladimir Ulman from MPI-CBG for helpful discussions. This work was supported by the German Federal Ministry of Research and Education (BMBF) under the code 01IS18026C (ScaDS2).