Learned Watershed: End-to-End Learning of Seeded Segmentation

by   Steffen Wolf, et al.

Learned boundary maps are known to outperform hand- crafted ones as a basis for the watershed algorithm. We show, for the first time, how to train watershed computation jointly with boundary map prediction. The estimator for the merging priorities is cast as a neural network that is con- volutional (over space) and recurrent (over iterations). The latter allows learning of complex shape priors. The method gives the best known seeded segmentation results on the CREMI segmentation challenge.


page 1

page 7

page 8


End-to-End Learned Random Walker for Seeded Image Segmentation

We present an end-to-end learned algorithm for seeded segmentation. Our ...

Fine-grained Recurrent Neural Networks for Automatic Prostate Segmentation in Ultrasound Images

Boundary incompleteness raises great challenges to automatic prostate se...

A Simple Guard for Learned Optimizers

If the trend of learned components eventually outperforming their hand-c...

A Differentiable Relaxation of Graph Segmentation and Alignment for AMR Parsing

Abstract Meaning Representations (AMR) are a broad-coverage semantic for...

Boundary-sensitive Network for Portrait Segmentation

Compared to the general semantic segmentation problem, portrait segmenta...

Deep Multi-Structural Shape Analysis: Application to Neuroanatomy

We propose a deep neural network for supervised learning on neuroanatomi...

Shape-Aware Organ Segmentation by Predicting Signed Distance Maps

In this work, we propose to resolve the issue existing in current deep l...

1 Introduction

The watershed algorithm is an important computational primitive in low-level computer vision. Since it does not penalize segment boundary length, it exhibits no shrinkage bias like multi-terminal cuts or (conditional) random fields and is especially suited to segment objects with high surface-to-volume ratio, e.g. neurons in biological images.

In its classic form, the watershed algorithm comprises three basic steps: altitude computation, seed definition, and region assignment. These steps are designed manually for each application of interest. In a typical setup, the altitude is the output of an edge detector (e.g. the gradient magnitude or the gPb detector [2]), the seeds are located at the local minima of the altitude image, and pixels are assigned to seeds according to the drop-of-water principle [9].

In light of the very successful trend towards learning-based image analysis, it is desirable to eliminate hand-crafted heuristics from the watershed algorithm as well. Existing work shows that learned edge detectors significantly improve segmentation quality, especially when convolutional neural networks (CNNs) are used

[7, 27, 33, 4]. We take this idea one step further and propose to learn altitude estimation and region assignment jointly

, in an end-to-end fashion: Our approach no longer employs an auxiliary objective (e.g. accurate boundary strength prediction), but trains the altitude function together with the subsequent region assignment decisions so that the final segmentation error is minimized directly. The resulting training algorithm is closely related to reinforcement learning.



Tloss gradient
Figure 1: Illustration of watershed learning. (a) Raw data with two seeds , ground truth boundary (green), and false boundary (red) at an early training stage. (b) A hole in the altitude map causes the left region to bleed out along the orange path, meeting the blue path at a false watershed. (c) Profile of topographic distances along the two paths. Training increases the altitude at the missed edge (green) and decreases it at the false edge (red). Upon convergence, dotted paths meet at the correct location.

Our method keeps the basic structure of the watershed algorithm intact: Starting from given seeds111Incorporating seed definition into end-to-end learning is a future goal of our research, but beyond the scope of this paper., we maintain a priority queue storing the topographic distance of candidate pixels to their nearest seed. Each iteration assigns the currently best candidate to “its” region and updates the queue. The topographic distance is induced by an altitude function estimated with a CNN. Crucially, and deviating from prior work, we compute altitudes on demand, allowing their conditioning on prior decisions, i.e. partial segmentations. The CNN thus gets the opportunity to learn priors for likely region shapes in the present data. We show how these models can be trained end-to-end from given ground truth segmentations using structured learning. Our experiments show that the resulting segmentations are better than those from hand-crafted algorithms or unstructured learning.

2 Related Work

Various authors demonstrated that learned boundary probabilities (or, more generally, boundary strengths) are superior to designed ones. In the most common setting, these probabilities are defined on the pixel grid, i.e. on the nodes of a grid graph, and serve as input of a

node-based watershed algorithm. Training minimizes a suitable loss (e.g. squared or cross-entropy loss) between the predicted probabilities and manually generated ground truth boundary maps in an unstructured manner, i.e. over all pixels independently. This approach works especially well with powerful models like CNNs. In the important application of connectomis (see section 6.3), this was first demonstrated by [14]. A much deeper network [7] was the winning entry of the ISBI 2012 Neuro-Segmentaion Challenge [3]. Results could be improved further by progress in CNN architectures and more sophisticated data augmentation, using e.g. U-Nets [27], FusionNets [25]

or networks based on inception modules

[5]. Clustering of the resulting watershed superpixels by means of the GALA algorithm [24, 16] (using altitudes from [3] resp. [27]) or the lifted multicut [5] (using altitudes from their own CNN) lead to additional performance gains.

When ground truth is provided in terms of region labels rather than boundary maps, a suitable boundary map must be created first. Simple morphological operations were found sufficient in [27], while [5] preferred smooth probabilities derived from a distance transform starting at the true boundaries. Outside connectomics, [4] achieved superior results by defining the ground truth altitude map in terms of the vector distance transform, which allows optimizing the prediction’s gradient direction and height separately.

Alternatively, one can employ the edge-based watershed algorithm and learn boundary probabilities for the grid graph’s edges. The corresponding ground truth simply indicates if the end points of each edge are supposed to be in different segments or not. From a theoretical perspective, the distinction between node- and edge-based watersheds is not very significant because both can be transformed into each other [22]. However, the algorithmic details differ considerably. Edge-based altitude learning was first proposed in [12]

, who used hand-crafted features and logistic regression. Subsequently,

[31] employed a CNN to learn features and boundary probabilities simultaneously. Watershed superpixel generation and clustering on the basis of these altitudes was investigated in [35].

Learning with unstructured loss functions has the disadvantage that an error at a single point (node or edge) has little effect on the loss, but may lead to large segmentation errors: A single missed boundary pixel can cause a big false merger. Learning with

structured loss functions, as advocated in this paper, avoids this by considering the boundaries in each image jointly, so that the loss can be defined in terms of segmentation accuracy rather than pointwise differences. Holistically-nested edge detection [33, 17] achieves a weak form of this by coupling the loss at multiple resolutions using deep supervision. Such a network was successfully used as a basis for watershed segmentation in [6]. The MALIS algorithm [30] computes shortest paths between pairs of nodes and applies a correction to the highest edge along paths affected by false splits or mergers. This is similar to our training, but we apply corrections to root error edges as defined below. Learned, sparse reconstruction methods such as MaskExtend [20] and Flood-filling networks [15] predict region membership for all nodes in a patch jointly, performing region growing for one seed at a time in a one-against-the-rest fashion. In contrast, our algorithm grows all seeds simultaneously and competitively.

3 Mathematical Framework

The watershed algorithm is especially suitable when regions are primarily defined by their boundaries, not by appearance differences. This is often the case when the goal is instance segmentation (one neuron vs. its neighbors) as opposed to semantic segmentation (neurons vs. blood vessels). In graphical model terms, pairwise potentials between adjacent nodes are crucial in this situation, whereas unary potentials are of lesser importance or missing altogether. Many real-world applications have these characteristics, see [8] and section 6 for examples.

We consider 4-connected grid graphs . The input image maps all nodes to D-dimensional vectors of raw data. A segmentation is defined by a label image specifying the region index or label of each node. The ground truth segmentation is called . Pairwise potentials (i.e. edge weights) are defined by an altitude function over the graph’s edges


where higher values indicate stronger boundary evidence. Since this paper focuses on how to learn , we assume that a set of seed nodes is provided by a suitable oracle (see section 6 for details). The watershed algorithm determines by finding a mapping that assigns each node to the best seed so that


Initially, node assignments are unknown (designated by ) except at the seeds, where they are assumed to be correct:


In this paper, we build upon the edge-based variant of the watershed algorithm [21, 9]. This variant is also known as watershed cuts because segment boundaries are defined by cuts in the graph, i.e. by the set of edges whose incident nodes have different labels. We denote the cuts in our solution as and in the ground truth as .

Let denote the set of all paths from seed to node . Then the max-arc topographic distance between and is defined as [11]


In words, the highest edge in a path determines the path’s altitude, and the path of lowest altitude determines the topographic distance. The watershed algorithm assigns each node to the topographically closest seed [26]:


The minimum distance path from seed to node shall be denoted by . This path is not necessarily unique, but ties are rare and can be broken arbitrarily when is a real-valued function of noisy input data.

It was shown in [9] that the resulting partitioning is equivalent to the minimum spanning forest (MSF) over seeds and edge weights . Thus, we can compute the watershed segmentation incrementally using Prim’s algorithm: Starting from initial seeds , each iteration finds the lowest edge whose start point is already assigned, but whose end point is not


and propagates the seed assignment from to :


In a traditional watershed implementation, the altitude is a fixed, hand-designed function of the input data


for example, the image’s Gaussian gradient magnitude or the “global Probability of boundary” (gPb) detector [2].

4 Joint Structured Learning of Altitude and Region Assignment

We propose to use structured learning to train an altitude regressor jointly with the region assignment procedure defined by Prim’s algorithm. We will discuss two types of learnable altitude functions: comprises models that, once trained, only depend on the input image , whereas additionally incorporates dynamically changing information about the current state of Prim’s algorithm.

4.1 Static Altitude Prediction

To find optimal parameters of a model , consider how Prim’s algorithm proceeds: It builds a MSF which assigns each node to the closest seed by identifying the shortest path from to . Such a path can be wrong in two ways: it may cross and thus miss a ground truth cut edge, or it may end at a false cut edge, placing in the interior of a ground truth region (see figure 1). More formally, we have to distinguish two failure modes: (i) A node was assigned to the wrong seed, i.e. or (ii) it was assigned to the correct seed via a non-admissible path, i.e. a path taking a detour across a different region. To treat both cases uniformly, we construct the corresponding ground truth paths .


root error : missing cut


root error : false cut
Figure 2: Example of root errors in the minimal spanning forest (a) and the constrained MSF (b) of a grid graph. Orange and blue indicate the segmentation in (a) and in (b). The root errors and of a wrongly labeled node are marked red, with corresponding paths and depicted by arrows.

These paths can be found by running Prim’s algorithm with a modified altitude


forcing cuts in the resulting constrained MSF to coincide with  (see figure 2). We denote the topographic distances along and as and respectively. By construction of the MSF, and are equal for all correct nodes. Conversely, they differ for incorrect nodes, causing distance to exceed distance . This property defines the set of incorrect nodes:


Every incorrect path contains at least one erroneous cut edge. The first such edge shall be called the path’s root error edge and is always a missing cut. Training should increase its altitude until it becomes part of the cut set . The root error edge of a ground truth path is the first false cut edge in in failure mode (i) and the first edge where deviates from in mode (ii). Here, the altitude should be decreased to make the edge part of the MSF, see figure 2. Accordingly, we denote the sets of root edges as and .

Since all assignment decisions in Prim’s algorithm are conditioned on decisions taken earlier, the errors in any path also depend on the path’s root error. Structured learning must therefore consider these errors jointly, and we argue that training updates must be derived solely from the root edges: They are the only locations whose required correction direction is unambiguously known. In contrast, we cannot even tell if subsequent errors will simply disappear once the root error has been fixed, or need updates of their own. When the latter applies, however, these edges will eventually become root errors in later training epochs, and we delay updating them to that point.

Since we need a differentiable loss to perform gradient descent, we use the perceptron loss of distance differences:


Correct nodes have zero contribution since holds for them. To serve as a basis for structured learning, we transform this into a loss over altitude differences at root edges. Since topographic distances equal the highest altitude along the shortest path, we have


To derive similar relations for , consider how the constrained MSF is constructed from the unconstrained one: First, edges crossing are removed from the MSF. Each of the resulting orphaned subgraphs is then reconnected into the constrained MSF via the lowest edge not crossing . The newly inserted edges are the root edges of all their child nodes, i.e. all nodes in the respective subgraph. Since these root edges did not belong to the original MSF, their altitude cannot be less than the maximum altitude in the corresponding child subgraph. For , it follows that


We can therefore upper-bound the perceptron loss by


and minimize this upper bound. By rearranging the sum, the loss can be simplified into


where we introduced a weight function counting the children of each root edge


A training epoch of structured learning thus consists of the following steps:

  1. Compute and with current model parameters and determine the MSF and the constrained MSF.

  2. Identify root edges and define the weights and the loss .

  3. Obtain an updated parameter vector via gradient descent on at .

These steps are iterated until convergence, and the resulting final parameter vector is denoted as .

4.2 Relation to Reinforcement Learning

In this section we compare the structured loss function with policy gradient reinforcement learning, which will serve as motivation for a refinement of the weighting function . To see the analogy, we refer to continuous control deep reinforcement learning as proposed by [29, 28, 18].

Looking at the region growing procedure from a reinforcement learning perspective, we define states as tuples where is the edge under consideration, and the action space is the altitude to be predicted by . The Policy Gradient Theorem [29] defines the appropriate update direction of the parameter vector . In a continuous action space, it reads


where is the performance to be optimized, the discounted state distribution, the policy to be learned, and the action-value function estimating the discounted expected future reward


In our case, the state distribution reduces to because Prim’s algorithm reaches each edge exactly once. Inserting our deterministic altitude prediction


where is the Dirac distribution, we get


Comparing equation (20) with equation (15), we observe that , where our weights essentially play the role of the action-value function . This suggests to introduce a discount factor in . To do so, we replace the temporal differences between states in (18) with tree distances or counting the number of edges between node and its root edge. This gives the discounted weights


with discount factor to be chosen such that decays roughly according to the size of the CNNs receptive field. Substituting for in (15) significantly improves convergence in our experiments. This analogy further motivates the application of current deep reinforcement training methods as described in section 5.2.

4.3 Dynamic Altitude Prediction

In every iteration, region growing according to Prim’s algorithm only considers edges with exactly one end node located in the already assigned set. This offers the possibility to delay altitude estimation to the time when the values are actually needed. On-demand altitude computations can take advantage of the partial segmentations already available to derive additional shape clues that help resolving difficult assignment decisions.

Relative Assignments: To incorporate partial segmentations, we remove their dependence on the incidental choice of label values by means of label-independent projection. Consider an edge where node is assigned to seed and node is unassigned. We now construct a labeling relative to , distinguishing nodes assigned to (“me” region), to another seed (“them”) and unassigned (“nobody”). Relative labelings are represented by a standard 1-of-3 coding:


In practice, we process relative labelings by adding a new branch to our neural network that receives as an input, see section 5.1 for details.

Non-Markovian modeling: Another potentially useful cue is afforded by the fact that Prim’s algorithm propagates the assignments recursively. Thus during every evaluation of the complete history from previous iterations along the growth paths can be incorporated. We encode the history about past assignment decisions as an -dimensional vector in each node. In practice, we incorporate history by adding a recurrent layer to our neural network.

We introduce the dynamic altitude predictions:


that receives the relative assignments projection and ’s hidden state as an additional input and outputs both the edge’s altitude and ’s hidden state : This variant of the altitude estimator performs best in our experiments. The emergent behavior of our models suggests that the algorithm uses history to adjust local diffusion parameters such as preferred direction and “viscosity”, similar to the viscous watershed transform described in [32].

Figure 3: Overview implementation of learned watershed algorithm with neural network and priority queue. In each iteration the minimal edge according to equation (6) is found using a priority queue (a) and the region label is propagated (b), which updates the projection . For all unassigned edges that are not in the priority queue and need to be considered by Prim’s algorithm in the next iteration, the altitude is evaluated using the dynamic edge prediction network (c).

5 Methods

5.1 Neural Network Architecture

Our network architecture builds mainly on the work of Yu and Koltun [34] who introduced dilated convolutions to achieve dense segmentations and systematically aggregate multi-scale contextual information without pooling operations. We split our network into two convolutional branches (see Figure 4): The upper branch processes the static input , and the lower one the dynamic input

. Since the input of the upper branch doesn’t change during prediction, its network activations can be precomputed for all edges, leading to a significant speed-up. We choose gated recurrent units (GRU) instead of long short-term memory (LSTM) in the recurrent network part, because GRUs have no internal state and get all history from the hidden state vector

, saving on memory and bookkeeping.

5.2 Training Methods

Augmenting the Input Image: We noted above that structured learning is superior because it considers edges jointly. However, it can only rely on the sparse training sets . In contrast, unstructured learning can make use of all edges and thus has a much bigger training set. This means that more powerful predictors, e.g. much deeper CNNs, can be trained, leading to more robust predictions and bigger receptive fields.

To combine the advantages of both approaches, we propose to augment the input image with an additional channel holding node boundary probabilities predicted by an unstructured model :


We train the CNN separately beforehand and replace with the augmented input everywhere in and . This simplifies structured learning because the predictor only needs to learn a refinement of the already reasonable altitudes in . In principle, one could even train and jointly, but the combined model is too big for reliable optimization.

Training Schedule: Taking advantage of the close relationship with reinforcement learning, we adopt the asynchronous update procedure proposed by [23]. Here, independent workers fetch the current CNN parameters from the master and compute loss gradients for randomly selected training images in parallel. The master then applies these updates to the parameters in an asynchronous fashion. We found experimentally, that this lead to faster and more stable convergence than sequential training.

In order to train the recurrent network part, we replace the standard temporal input ordering with the succession of edges defined by the paths and

. In a sense, backpropagation in time thus becomes backpropagation along the minimum spanning forest.

Figure 4: Network architecture: The static convolutional body extracts features from the raw input and edge detector output. The more shallow dynamic body processes the interactions of different region projections . These informations are combined in a fully connected layer and set into a temporal context using a recurrent GRU layer. The network output is the priority of the edge towards the pixel at the center of the field of view.

6 Experiments and Results

Our experiments illustrate the performance of our proposed end-to-end trainable watershed in combination with static and dynamic altitude prediction. To this end, we compare with standard watershed and power watershed algorithms [8] on statically trained CNNs according to [5], see section 6.2. Furthermore we show in section 6.3 that the learned watershed surpasses the state-of-the-art segmentation in an adapted version of the CREMI Neuron Segmentation Challenge [10].

6.1 Experimental Setup and Evaluation Metrics

Seed Generation Oracle: All segmentation algorithms start at initial seeds which are here provided by a “perfect” oracle. In our experiments, this oracle uses the ground truth segmentation to select one pixel with maximal distance to the region boundary per ground truth region.

Segmentation Metrics: In accordance with the CREMI challenge [10], we use the following segmentation metrics: The Rand score measures the probability of agreement between segmentation and ground truth w.r.t. a randomly chosen node pair . Two segmentations agree if both assign and to the same region or to different regions. The Rand error is the opposite, so that smaller values are better.

The Variation of Information(VOI) between and is defined as where is the conditional entropy [19]. To distinguish split errors from merge errors, we report the summands separately as and

6.2 Artificial Data

Figure 5: Artificial data example. a) Raw with and prediction of baseline CNN. b) Ground truth. c) The brown region leaks out when standard watershed runs on top of baseline CNN. d) Our algorithm uses learned shape priors to close boundary gaps.

Dataset: In order to compare our models and with solutions based on unstructured learning, we create an artificial segmentation benchmark dataset with variable difficulty. First, we generate an edge image via the zero crossing of a 2D Gaussian process. This image is then smoothed with a Gaussian filter and corrupted with Gaussian noise at . For each , we generate 1900 training images and 100 test images of size 252x252. One test image with corresponding ground truth and results is shown in figure 5.

Baseline: We choose a recent edge detection network from [34] to predict boundaries between different instances in combination with standard watershed (WS) and Power Watershed (PWS) [8] to generate an instance segmentation. Since these algorithms work best on slightly smoothed inputs, we apply Gaussian smoothing to the CNN output. The optimal smoothing parameters are determined by grid search on the training dataset. Additionally, we apply all watershed methods directly to smoothed raw image and report their overall best result as RAW + WS.

Performance: The measured segmentation errors of all algorithms are shown in table 1. Observed differences in performance mainly indicate how well each method handles low-contrast edges and narrow gaps between regions. The structurally trained watersheds outperform the unstructured baselines, because our loss function heavily penalizes the resulting segmentation errors. In all experiments, the dynamic prediction function has the best performance, due to its superior modeling power. It can identify holes and close most contours correctly because it learns to derive shape and contingency clues from monitoring intermediate results during the flooding process. A representative example of this effect is shown in figure 5.

5.8 0.8 12.5 1.7 32.2 1.8
6.4 0.9 13.8 1.6 32.4 2.2
NN + WS 6.5 0.8 14.9 3.6 33.4 1.7
NN + PWS 6.5 0.8 14.9 1.7 33.2 1.7
RAW + WS 24.0 1.6 41.9 1.8 55.0 1.8
Table 1: Quality of the segmentation results on our artificial dataset. Reported lowest error for all parameters of baseline watersheds based on the rand error and a 2 pixel boundary distance tolerance.

6.3 Neurite Segmentation

Dataset: The MICCAI Challenge on Circuit Reconstruction from Electron Microscopy Images [10] contains 375 fully annotated slices of electron microscopy images (of resolution 1250x1250 pixels). Part of a data slice is displayed in figure 7 top. Since the test ground truth segmentation has not been disclosed, we generate a new train/test split from the 3 original challenge training datasets by spltnting them into 3x75 z-continuous training- and 3x50 z-continuous test blocks.

Ideally, we would compare with [5] whose results define the state-of-the-art on the CREMI Challenge at time of submission. However, their pipeline, as described in their supplementary material, optimizes 2D segmentations jointly across multiple slices with a complex graphical model, which is beyond the scope of this paper.

Instead, we isolate the 2D segmentation aspect by adapting the challenge in the following manner: We run each segmentation algorithm with fixed ground truth seeds (see section 6.1) and evaluate their results on each -slice separately. The restriction to 2D evaluation requires a slight manual correction of the ground truth: The ground truth accuracy in -direction is just slice. The official 3D evaluation scores compensate for this by ignoring deviations of pixels in -direction. Since this trick doesn’t work in 2D, we remove 4 regions with no visual evidence in the image and all segments smaller than 5 pixels. Boundary tolerances in the x-y plane are treated as in the official CREMI scores where deviations from the true boundary are ignored if they do not exceed 6.25 pixels.

(a) our method follows long thin neurites
(b) it finds weak boundaries using shape priors
(c) failure case
Figure 6: Detailed success and failure cases of our method.

Baseline: We compare the Learned Watershed performance against the Power Watershed[8], Viscous Watershed [32], RandomWalker[13], Stochastic Watershed[1] and Distance Transform Watershed[5]. The boundary probability prediction (the same as in equation (24)) was provided by a deep CNN trained with an unstructured loss-function. In particular, the Distance Transform Watershed (DTWS) and the prediction were used to produce the current state-of-the-art on the CREMI challenge. To obtain the DTWS, one thresholds , computes a distance transform of the background, i.e. the non-boundary pixels and runs the watershed algorithm on the inverted distances. According to [5], this is the best known heuristic to close boundary gaps in these data, but requires manual parameter tuning. We found the parameters of all baseline algorithms by grid search using the training dataset. To ensure fair comparison, we start region growing from ground truth seeds in all cases. Our algorithm takes the augmented image from equation (24) as input and learns how to close boundary gaps.

Comparison to state-of-the-art: We show the 2D CREMI segmentation scores in table 2. It is evident that the learned watershed transform significantly outperforms DTWS in both ARAND and VOI score. Quantitatively, we find that the flooding patterns and therefore the region shapes of the learned watershed prefer to adhere to biologically sensible structures. We illustrate this with our results on one CREMI test slice in figure 7, as well as specific examples in figure 6. We find throughout the dataset that especially thin processes, as depicted in fig. 6(a) left, are a strength of our algorithm. Biologically sensible shape completions can also be found for roundish objects and is particularly noticeable when boundary evidence is weak, as shown in Fig. 6(b) center. However, in rare cases, we find incorrect shape completions (see Fig. 6(c) right), mainly in areas of weak boundary evidence. It stands to reason that these errors could be fixed by providing more training data.

Figure 7: From top: Raw data. Ground truth. Result of distance transform WS (arrows point out major errors). Our algorithm.
ARAND VOI split VOI merge
PowerWS 0.122 0.003 0.340 0.031 0.180 0.019
ViscousWS 0.093 0.003 0.328 0.030 0.069 0.003
RandomWalker 0.103 0.004 0.355 0.037 0.060 0.004
Stochastic WS 0.193 0.012 0.612 0.080 0.077 0.004
DTWS 0.085 0.001 0.320 0.029 0.070 0.005
Learned WS 0.082 0.001 0.319 0.030 0.057 0.004
Table 2: CREMI segmentation metrics evaluated on 2D slices: The Variation of Information between a predicted segmentation and ground truth (lower is better) and the Adapted Rand Error (lower is better) [3].

7 Conclusion

This paper proposes an end-to-end learnable seeded watershed algorithm that performs well an artificial data and neurosegmentation EM images. We found the following aspects to be critical success factors: First, we train a very powerful CNN to control region growing. Second, the CNN is trained in a structured fashion, allowing it to optimize segmentation performance directly, instead of treating pixels independently. Third, we improve modeling power by incorporating dynamic information about the current state of the segmentation. Specifically, feeding the current partial segmentation into the CNN provides shape clues for the next assignment decision, and maintaining a latent history along assignment paths allows to adjust growing parameters locally. We demonstrate experimentally that the resulting algorithm successfully solves difficult configurations like narrow region parts and low-contrast boundaries, where previous algorithms fail. In future work, we plan to include seed generation into the end-to-end learning scheme.