1 Introduction
Modern microscopy techniques enable us to capture biological processes at high spatial and temporal resolution. Today, the total amount of acquired image data can be so vast, analyzing it poses a tremendous challenge. In many cases, deep learning (DL) based analysis methods [litjens2017survey, falk2019u, rutter2019convolutional, schmidt2018] are leading the way to address this problem. Still, even common tasks, such as detection or segmentation, typically require human curation to fix remaining errors.
Arguably the two main causes of weak detection and segmentation performance are too little training data, and
input images acquired at low signal-to-noise ratios (SNR). In order to make most out of the available training data, augmentation
[simard2003best][doersch2015unsupervised, zamir2018taskonomy] are often used. While data augmentation uses transformed copies of the training data to gain better training performance, transfer learning employs networks pretrained on similar tasks and/or data and finetunes them for the task/data at hand. To address the low SNR, a number of powerful content-aware restoration and denoising methods have recently been developed [zhang2017beyond, weigert2018content, lehtinen2018noise2noise].Among them are self-supervised methods [krull2019noise2void, batson2019noise2self, krull2019probabilistic], which do not require annotated training data, and can be directly trained on the raw data to be denoised.
In this work, we investigate various ways, in which self-supervised denoising can enable cell/nuclei segmentation, even in the presence of extreme levels of noise and limited training data. We explore the efficacy of denoising as a preprocessing step, as part of a transfer learning schema, as well as, in a combination of the two.
We conducted all experiments with two popular DL-based segmentation methods: a standard U-Net [ronneberger2015u, falk2019u] and the more sophisticated StarDist [schmidt2018]. While we find that self-supervised denoising generally improves segmentation results, especially when noise is abundant and training data limitied, we provide detailed results, comparing all approaches for various amounts of training data, noise levels, and types of data. All datasets, results, and code can be found at github.com/juglab/VoidSeg.
2 Methods and Experiments
As sketched in Fig. 1, we propose three ways to improve segmentations and compare our results to two baseline methods, namely a standard U-Net [ronneberger2015u] for 3-class pixel classification, and StarDist [schmidt2018], designed to learn and utilize a star convex shape prior. These baselines are chosen based on popularity and because they follow rather different segmentation paradigms. The following setups are the ones we propose.
Sequential (Figure 1b): Here, two networks are employed. The first network is a Noise2Void (N2V) network [krull2019noise2void], trained to denoise the full body of available image data. The second network, which henceforth receives the denoised N2V output, is then either a U-Net or StarDist network, trained on all or parts of the available segmentation labels (GT). Note that all weights of the N2V network remain constant during training the segmentation network.
Finetune (Figure 1c): In contrast to the sequential setup, here we retrain the N2V network for segmentation. Since StarDist does not use the exact same network architecture as N2V, this approach only applies to the U-Net baseline.
Finetune Sequential (Figure 1d): Very similar to the sequential setup, also here we first train a N2V denoising network. In contrast to before, the segmentation network is initialized by a copy of the trained N2V network and then finetuned for segmentation. Also here, the weights of the first network stay unchanged during the training of the segmentation network.
Next we describe the detailed setup of the N2V, Segmentation U-Net, and StarDist networks.
N2V Denoising Network: We use the Noise2Void setup as described in [krull2019noise2void]. Conveniently, N2V is just a default U-Net with a modified loss for denoising, allowing us to design a single network that can later be used for N2V training as well as for the U-Net segmentation baseline. We use initial feature maps with batch norm and a batch size of and employ convolution kernels. For all experiments we choose the depth of the U-Net as described below.
U-Net Segmentation Network:
We created a U-Net capable of performing either 3-class pixel classification (foreground, border, background) [chen2016dcan, guerrero2018multiclass] or N2V denoising.
Hence, the U-Net we use has four output channels, one for each pixel class, and one to regress denoised pixel intensities.
Note that, during pixel classification, we give extra emphasis to the border class, by weighing it five times higher in the used loss as suggested in [schmidt2018].
Again we use feature maps, batch size of , and kernels.
For all experiments, the depth of the U-Net is chosen to maximize segmentation performance.
Hence, results below are not limited by the capacity of network.
All networks are trained with a standard learning rate scheduler as used in [krull2019noise2void].
We use an initial learning rate of and a batch size of with batch normalization.
Training is done for
StarDist Segmentation Network: Number of feature maps, batch size, convolution kernels, network depth, learning rate, number of training epochs, and step size per epoch used for StarDist are set as described above. Again, the training data is augmented 8 fold by flips and 90 degree rotations. However, StarDist uses output channels that are trained as explained in [schmidt2018].
3 Data and Evaluation Metrics
DSB 2018 n40 | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
Scheme | ||||||||||
U-Net | 0.4777, | 0.4944, | 0.5439, | 0.5912, | 0.6214, | 0.6551, | 0.6645, | 0.6834, | 0.7304, | 0.7199, |
0.5218 | 0.5634 | 0.5840 | 0.6095 | 0.6217 | 0.6403 | 0.6493 | 0.6685 | 0.6835 | 0.6929 | |
U-Net Sequential | 0.5608, | 0.5862, | 0.6127, | 0.6523, | 0.6679, | 0.6791, | 0.6958, | 0.7226, | 0.7360, | 0.7373, |
0.5675 | 0.5938 | 0.6160 | 0.6349 | 0.6483 | 0.6608 | 0.6700 | 0.6890 | 0.6960 | 0.6950 | |
U-Net Finetune | 0.5357, | 0.5518, | 0.5971, | 0.6286, | 0.6430, | 0.6658, | 0.6731, | 0.7013, | 0.7140, | 0.7261, |
0.5628 | 0.5711 | 0.5987 | 0.6253 | 0.6346 | 0.6444 | 0.6580 | 0.6681 | 0.6840 | 0.6901 | |
U-Net Finetune Seq. | 0.5944, | 0.6259, | 0.6357, | 0.6646, | 0.6761, | 0.6839, | 0.7028, | 0.7158, | 0.7261, | 0.7267, |
0.5927 | 0.6212 | 0.6262 | 0.6499 | 0.6529 | 0.6611 | 0.6686 | 0.6813 | 0.6898 | 0.6870 | |
StarDist | 0.4796, | 0.6085, | 0.6400, | 0.6620, | 0.7572, | 0.7679, | 0.7795, | 0.7827, | 0.7884, | 0.7883, |
0.4789 | 0.5639 | 0.5735 | 0.5913 | 0.6683 | 0.6788 | 0.6948 | 0.6997 | 0.7087 | 0.7150 | |
StarDist Sequential | 0.6802, | 0.7331, | 0.7337, | 0.7549, | 0.7640, | 0.7761, | 0.7766, | 0.7876, | 0.7914, | 0.7939, |
0.6004 | 0.6399 | 0.6548 | 0.6727 | 0.6877 | 0.6906 | 0.6987 | 0.7044 | 0.7107 | 0.7141 | |
BBBC 004 n200 | ||||||||||
Scheme | ||||||||||
U-Net | 0.5218, | 0.5634, | 0.5840, | 0.6095, | 0.6217, | 0.6403, | 0.6493, | 0.6685, | 0.6835, | 0.6929, |
0.6175 | 0.6405 | 0.6736 | 0.6720 | 0.6868 | 0.6897 | 0.7046 | 0.7161 | 0.7211 | 0.7114 | |
U-Net Sequential | 0.5675, | 0.5938, | 0.6160, | 0.6349, | 0.6483, | 0.6608, | 0.6700, | 0.6890, | 0.6960, | 0.6950, |
0.6781 | 0.7074 | 0.7083 | 0.7069 | 0.7106 | 0.7179 | 0.7132 | 0.7112 | 0.7205 | 0.7179 | |
U-Net Finetune | 0.5628, | 0.5711, | 0.5987, | 0.6253, | 0.6346, | 0.6444, | 0.6580, | 0.6681, | 0.6840, | 0.6901, |
0.6624 | 0.6822 | 0.6856 | 0.6889 | 0.6955 | 0.7129 | 0.7073 | 0.7148 | 0.7219 | 0.7201 | |
U-Net Finetune Seq. | 0.5927, | 0.6212, | 0.6262, | 0.6499, | 0.6529, | 0.6611, | 0.6686, | 0.6813, | 0.6898, | 0.6870, |
0.6996 | 0.7021 | 0.7082 | 0.7118 | 0.7099 | 0.7235 | 0.7237 | 0.7120 | 0.7162 | 0.7168 | |
StarDist | 0.4789, | 0.5639, | 0.5735, | 0.5913, | 0.6683, | 0.6788, | 0.6948, | 0.6997, | 0.7087, | 0.7150, |
0.6313 | 0.6597 | 0.6823 | 0.6914 | 0.7018 | 0.7050 | 0.7113 | 0.7135 | 0.7128 | 0.7174 | |
StarDist Sequential | 0.6004, | 0.6399, | 0.6548, | 0.6727, | 0.6877, | 0.6906, | 0.6987, | 0.7044, | 0.7107, | 0.7141, |
0.6895 | 0.6992 | 0.7001 | 0.7007 | 0.7103 | 0.7116 | 0.7146 | 0.7153 | 0.7202 | 0.7224 |
In this work we use publicly available data, which we randomly split into training (85%) and test sets (15%). We further split the training data into , ten stacked subsets we will use to test our methods in data-limited training regimes. Additionally, we corrupt the raw microscopy data with pixel independent, identically distributed Gaussian noise. Sample images for all datasets are shown in Fig. 4.
DSB 2018 Data: From the Kaggle 2018 Data Science Bowl challenge, we take the same subset of data as has been used in
BBBC 004 Data: This data is available from the Broad Bioimage Benchmark Collection and consists of synthetic nuclei images.
Since the data is synthetic, perfect GT labels are available by construction.
Here we use only those images which have been generated with an overlap probability of
All experiments we conduct are evaluated in terms of Average Precision (AP) [everingham2010pascal] and SEG [ulman2017objective]. The SEG measure is based on the Jaccard similarity index (), computed for matching objects and , and is given by . A ground truth object and a segmented object are considered to be matching if and only if at least of the pixels of are overlapped by pixels in . Average Precision, in contrast, counts the ratio of true positives to the sum of true positives, false positives, and false negatives. All AP and SEG values we report here are obtained by finding the threshold that maximizes AP. For the U-Net this threshold is used to cut the foreground probability maps into discrete image regions.
For StarDist the threshold controls the non-maxima suppression step [schmidt2018].
4 Results
We investigated all setups described above, on all noise levels, using all subsets of training data , making a total of experimental setups.
Each experiment on the DSB data was repeated times while all experiments on the BBBC data were repeated times, allowing us to report mean performance and standard error for selected noise levels in Figures
Looking at all results it can be observed that all our proposed schemes outperform their respective baseline when the amount of available training data is limited. The Finetune Sequential scheme is typically performing best among all U-Net based pixel classification pipelines. StarDist, in itself a more powerful method, is indeed the better performing baseline. As before, the proposed StarDist Sequential scheme clearly outperforms its baseline method when fewer training images are available. Note that even if ample training data is provided, all our proposed training schemes perform at least on par with their baselines. It is important to be reminded that improved results using sequential training schemes are not due to limiting network sizes – we have tested various network sizes for both baseline methods and have settled for the best performing configuration we could find.
To our surprise, on the BBBC data, both sequential U-Net schemes outperform StarDist and StartDist Sequential for the n150 and n200 noise levels, despite the StarDist baseline consistently and significantly outperforming the U-Net baseline.
A visual comparison of segmentation results with all the methods trained on the training subset is given in Fig. 4. For the DSB data we show insets that exemplify the often occurring problem of merging segments (bad for AP), while the shown BBBC insets show variations in segmented areas (bad for SEG). These segmentation mistakes are particularly exemplified for baseline schemes whereas sequential schemes for both U-Net and StarDist seem to yield better quality segmentation, in general.
5 Discussion
It is known that there is an overlap between denoising and segmentation tasks [zamir2018taskonomy]. We interpret this as segmentation networks inherently solving the denoising task to a certain extent and vice-versa. In this work we investigated how disentangling the two can be exploited in practice, when noisy data is abundant, but annotations are rare – a situation that is virtually ubiquitously true in biomedical applications.
In these situations, all our proposed schemes show above baseline performance. Among all conducted experiments, sequential training schemes generally lead to the best results. Since this is not only true for the simple U-Net baseline, but also for StarDist, it stands to reason that similar observations would also hold for other DL based approaches and tasks. This suggests that N2V, or other self-supervised denoising methods [krull2019probabilistic, batson2019noise2self], can serve as universal preprocessing blocks for networks solving any given super-task in which denoising is a helpful sub-task.
These denoising blocks can benefit from the whole body of available noisy data, without relying on annotated GT labels required for the super-task. Hence, finding sensible training schedules to train such larger, modular networks is a promising direction of research.
In summary, we show that commonly used networks for image segmentation can likely be boosted in performance by combining them in various ways with unsupervised denoising modules. Our work offers simple recipes for improving DL based segmentation results. Since this is increasingly true at lower signal-to-noise regimes and when segmentation GT is limited, direct benefits for the biomedical imaging community will be inevitable.
6 Acknowledgements
The authors would like to acknowledge the Scientific Computing Facility at MPI-CBG and HPC Cluster at the Center for Information Services and High Performance Computing (ZIH) at TU Dresden for giving us access to their HPC cluster. We also thank Uwe Schmidt, Martin Weigert and Vladimir Ulman from MPI-CBG for helpful discussions. This work was supported by the German Federal Ministry of Research and Education (BMBF) under the code 01IS18026C (ScaDS2).