1 Introduction
The watershed algorithm is an important computational primitive in lowlevel computer vision. Since it does not penalize segment boundary length, it exhibits no shrinkage bias like multiterminal cuts or (conditional) random fields and is especially suited to segment objects with high surfacetovolume ratio, e.g. neurons in biological images.
In its classic form, the watershed algorithm comprises three basic steps: altitude computation, seed definition, and region assignment. These steps are designed manually for each application of interest. In a typical setup, the altitude is the output of an edge detector (e.g. the gradient magnitude or the gPb detector [2]), the seeds are located at the local minima of the altitude image, and pixels are assigned to seeds according to the dropofwater principle [9].
In light of the very successful trend towards learningbased image analysis, it is desirable to eliminate handcrafted heuristics from the watershed algorithm as well. Existing work shows that learned edge detectors significantly improve segmentation quality, especially when convolutional neural networks (CNNs) are used
[7, 27, 33, 4]. We take this idea one step further and propose to learn altitude estimation and region assignment jointly, in an endtoend fashion: Our approach no longer employs an auxiliary objective (e.g. accurate boundary strength prediction), but trains the altitude function together with the subsequent region assignment decisions so that the final segmentation error is minimized directly. The resulting training algorithm is closely related to reinforcement learning.
Our method keeps the basic structure of the watershed algorithm intact: Starting from given seeds^{1}^{1}1Incorporating seed definition into endtoend learning is a future goal of our research, but beyond the scope of this paper., we maintain a priority queue storing the topographic distance of candidate pixels to their nearest seed. Each iteration assigns the currently best candidate to “its” region and updates the queue. The topographic distance is induced by an altitude function estimated with a CNN. Crucially, and deviating from prior work, we compute altitudes on demand, allowing their conditioning on prior decisions, i.e. partial segmentations. The CNN thus gets the opportunity to learn priors for likely region shapes in the present data. We show how these models can be trained endtoend from given ground truth segmentations using structured learning. Our experiments show that the resulting segmentations are better than those from handcrafted algorithms or unstructured learning.
2 Related Work
Various authors demonstrated that learned boundary probabilities (or, more generally, boundary strengths) are superior to designed ones. In the most common setting, these probabilities are defined on the pixel grid, i.e. on the nodes of a grid graph, and serve as input of a
nodebased watershed algorithm. Training minimizes a suitable loss (e.g. squared or crossentropy loss) between the predicted probabilities and manually generated ground truth boundary maps in an unstructured manner, i.e. over all pixels independently. This approach works especially well with powerful models like CNNs. In the important application of connectomis (see section 6.3), this was first demonstrated by [14]. A much deeper network [7] was the winning entry of the ISBI 2012 NeuroSegmentaion Challenge [3]. Results could be improved further by progress in CNN architectures and more sophisticated data augmentation, using e.g. UNets [27], FusionNets [25]or networks based on inception modules
[5]. Clustering of the resulting watershed superpixels by means of the GALA algorithm [24, 16] (using altitudes from [3] resp. [27]) or the lifted multicut [5] (using altitudes from their own CNN) lead to additional performance gains.When ground truth is provided in terms of region labels rather than boundary maps, a suitable boundary map must be created first. Simple morphological operations were found sufficient in [27], while [5] preferred smooth probabilities derived from a distance transform starting at the true boundaries. Outside connectomics, [4] achieved superior results by defining the ground truth altitude map in terms of the vector distance transform, which allows optimizing the prediction’s gradient direction and height separately.
Alternatively, one can employ the edgebased watershed algorithm and learn boundary probabilities for the grid graph’s edges. The corresponding ground truth simply indicates if the end points of each edge are supposed to be in different segments or not. From a theoretical perspective, the distinction between node and edgebased watersheds is not very significant because both can be transformed into each other [22]. However, the algorithmic details differ considerably. Edgebased altitude learning was first proposed in [12]
, who used handcrafted features and logistic regression. Subsequently,
[31] employed a CNN to learn features and boundary probabilities simultaneously. Watershed superpixel generation and clustering on the basis of these altitudes was investigated in [35].Learning with unstructured loss functions has the disadvantage that an error at a single point (node or edge) has little effect on the loss, but may lead to large segmentation errors: A single missed boundary pixel can cause a big false merger. Learning with
structured loss functions, as advocated in this paper, avoids this by considering the boundaries in each image jointly, so that the loss can be defined in terms of segmentation accuracy rather than pointwise differences. Holisticallynested edge detection [33, 17] achieves a weak form of this by coupling the loss at multiple resolutions using deep supervision. Such a network was successfully used as a basis for watershed segmentation in [6]. The MALIS algorithm [30] computes shortest paths between pairs of nodes and applies a correction to the highest edge along paths affected by false splits or mergers. This is similar to our training, but we apply corrections to root error edges as defined below. Learned, sparse reconstruction methods such as MaskExtend [20] and Floodfilling networks [15] predict region membership for all nodes in a patch jointly, performing region growing for one seed at a time in a oneagainsttherest fashion. In contrast, our algorithm grows all seeds simultaneously and competitively.3 Mathematical Framework
The watershed algorithm is especially suitable when regions are primarily defined by their boundaries, not by appearance differences. This is often the case when the goal is instance segmentation (one neuron vs. its neighbors) as opposed to semantic segmentation (neurons vs. blood vessels). In graphical model terms, pairwise potentials between adjacent nodes are crucial in this situation, whereas unary potentials are of lesser importance or missing altogether. Many realworld applications have these characteristics, see [8] and section 6 for examples.
We consider 4connected grid graphs . The input image maps all nodes to Ddimensional vectors of raw data. A segmentation is defined by a label image specifying the region index or label of each node. The ground truth segmentation is called . Pairwise potentials (i.e. edge weights) are defined by an altitude function over the graph’s edges
(1) 
where higher values indicate stronger boundary evidence. Since this paper focuses on how to learn , we assume that a set of seed nodes is provided by a suitable oracle (see section 6 for details). The watershed algorithm determines by finding a mapping that assigns each node to the best seed so that
(2) 
Initially, node assignments are unknown (designated by ) except at the seeds, where they are assumed to be correct:
(3) 
In this paper, we build upon the edgebased variant of the watershed algorithm [21, 9]. This variant is also known as watershed cuts because segment boundaries are defined by cuts in the graph, i.e. by the set of edges whose incident nodes have different labels. We denote the cuts in our solution as and in the ground truth as .
Let denote the set of all paths from seed to node . Then the maxarc topographic distance between and is defined as [11]
(4) 
In words, the highest edge in a path determines the path’s altitude, and the path of lowest altitude determines the topographic distance. The watershed algorithm assigns each node to the topographically closest seed [26]:
(5) 
The minimum distance path from seed to node shall be denoted by . This path is not necessarily unique, but ties are rare and can be broken arbitrarily when is a realvalued function of noisy input data.
It was shown in [9] that the resulting partitioning is equivalent to the minimum spanning forest (MSF) over seeds and edge weights . Thus, we can compute the watershed segmentation incrementally using Prim’s algorithm: Starting from initial seeds , each iteration finds the lowest edge whose start point is already assigned, but whose end point is not
(6) 
and propagates the seed assignment from to :
(7) 
In a traditional watershed implementation, the altitude is a fixed, handdesigned function of the input data
(8) 
for example, the image’s Gaussian gradient magnitude or the “global Probability of boundary” (gPb) detector [2].
4 Joint Structured Learning of Altitude and Region Assignment
We propose to use structured learning to train an altitude regressor jointly with the region assignment procedure defined by Prim’s algorithm. We will discuss two types of learnable altitude functions: comprises models that, once trained, only depend on the input image , whereas additionally incorporates dynamically changing information about the current state of Prim’s algorithm.
4.1 Static Altitude Prediction
To find optimal parameters of a model , consider how Prim’s algorithm proceeds: It builds a MSF which assigns each node to the closest seed by identifying the shortest path from to . Such a path can be wrong in two ways: it may cross and thus miss a ground truth cut edge, or it may end at a false cut edge, placing in the interior of a ground truth region (see figure 1). More formally, we have to distinguish two failure modes: (i) A node was assigned to the wrong seed, i.e. or (ii) it was assigned to the correct seed via a nonadmissible path, i.e. a path taking a detour across a different region. To treat both cases uniformly, we construct the corresponding ground truth paths .
These paths can be found by running Prim’s algorithm with a modified altitude
(9) 
forcing cuts in the resulting constrained MSF to coincide with (see figure 2). We denote the topographic distances along and as and respectively. By construction of the MSF, and are equal for all correct nodes. Conversely, they differ for incorrect nodes, causing distance to exceed distance . This property defines the set of incorrect nodes:
(10) 
Every incorrect path contains at least one erroneous cut edge. The first such edge shall be called the path’s root error edge and is always a missing cut. Training should increase its altitude until it becomes part of the cut set . The root error edge of a ground truth path is the first false cut edge in in failure mode (i) and the first edge where deviates from in mode (ii). Here, the altitude should be decreased to make the edge part of the MSF, see figure 2. Accordingly, we denote the sets of root edges as and .
Since all assignment decisions in Prim’s algorithm are conditioned on decisions taken earlier, the errors in any path also depend on the path’s root error. Structured learning must therefore consider these errors jointly, and we argue that training updates must be derived solely from the root edges: They are the only locations whose required correction direction is unambiguously known. In contrast, we cannot even tell if subsequent errors will simply disappear once the root error has been fixed, or need updates of their own. When the latter applies, however, these edges will eventually become root errors in later training epochs, and we delay updating them to that point.
Since we need a differentiable loss to perform gradient descent, we use the perceptron loss of distance differences:
(11) 
Correct nodes have zero contribution since holds for them. To serve as a basis for structured learning, we transform this into a loss over altitude differences at root edges. Since topographic distances equal the highest altitude along the shortest path, we have
(12) 
To derive similar relations for , consider how the constrained MSF is constructed from the unconstrained one: First, edges crossing are removed from the MSF. Each of the resulting orphaned subgraphs is then reconnected into the constrained MSF via the lowest edge not crossing . The newly inserted edges are the root edges of all their child nodes, i.e. all nodes in the respective subgraph. Since these root edges did not belong to the original MSF, their altitude cannot be less than the maximum altitude in the corresponding child subgraph. For , it follows that
(13) 
We can therefore upperbound the perceptron loss by
(14) 
and minimize this upper bound. By rearranging the sum, the loss can be simplified into
(15) 
where we introduced a weight function counting the children of each root edge
(16) 
A training epoch of structured learning thus consists of the following steps:

Compute and with current model parameters and determine the MSF and the constrained MSF.

Identify root edges and define the weights and the loss .

Obtain an updated parameter vector via gradient descent on at .
These steps are iterated until convergence, and the resulting final parameter vector is denoted as .
4.2 Relation to Reinforcement Learning
In this section we compare the structured loss function with policy gradient reinforcement learning, which will serve as motivation for a refinement of the weighting function . To see the analogy, we refer to continuous control deep reinforcement learning as proposed by [29, 28, 18].
Looking at the region growing procedure from a reinforcement learning perspective, we define states as tuples where is the edge under consideration, and the action space is the altitude to be predicted by . The Policy Gradient Theorem [29] defines the appropriate update direction of the parameter vector . In a continuous action space, it reads
(17) 
where is the performance to be optimized, the discounted state distribution, the policy to be learned, and the actionvalue function estimating the discounted expected future reward
(18) 
In our case, the state distribution reduces to because Prim’s algorithm reaches each edge exactly once. Inserting our deterministic altitude prediction
(19) 
where is the Dirac distribution, we get
(20) 
Comparing equation (20) with equation (15), we observe that , where our weights essentially play the role of the actionvalue function . This suggests to introduce a discount factor in . To do so, we replace the temporal differences between states in (18) with tree distances or counting the number of edges between node and its root edge. This gives the discounted weights
(21) 
with discount factor to be chosen such that decays roughly according to the size of the CNNs receptive field. Substituting for in (15) significantly improves convergence in our experiments. This analogy further motivates the application of current deep reinforcement training methods as described in section 5.2.
4.3 Dynamic Altitude Prediction
In every iteration, region growing according to Prim’s algorithm only considers edges with exactly one end node located in the already assigned set. This offers the possibility to delay altitude estimation to the time when the values are actually needed. Ondemand altitude computations can take advantage of the partial segmentations already available to derive additional shape clues that help resolving difficult assignment decisions.
Relative Assignments: To incorporate partial segmentations, we remove their dependence on the incidental choice of label values by means of labelindependent projection. Consider an edge where node is assigned to seed and node is unassigned. We now construct a labeling relative to , distinguishing nodes assigned to (“me” region), to another seed (“them”) and unassigned (“nobody”). Relative labelings are represented by a standard 1of3 coding:
(22) 
In practice, we process relative labelings by adding a new branch to our neural network that receives as an input, see section 5.1 for details.
NonMarkovian modeling: Another potentially useful cue is afforded by the fact that Prim’s algorithm propagates the assignments recursively. Thus during every evaluation of the complete history from previous iterations along the growth paths can be incorporated. We encode the history about past assignment decisions as an dimensional vector in each node. In practice, we incorporate history by adding a recurrent layer to our neural network.
We introduce the dynamic altitude predictions:
(23) 
that receives the relative assignments projection and ’s hidden state as an additional input and outputs both the edge’s altitude and ’s hidden state : This variant of the altitude estimator performs best in our experiments. The emergent behavior of our models suggests that the algorithm uses history to adjust local diffusion parameters such as preferred direction and “viscosity”, similar to the viscous watershed transform described in [32].
5 Methods
5.1 Neural Network Architecture
Our network architecture builds mainly on the work of Yu and Koltun [34] who introduced dilated convolutions to achieve dense segmentations and systematically aggregate multiscale contextual information without pooling operations. We split our network into two convolutional branches (see Figure 4): The upper branch processes the static input , and the lower one the dynamic input
. Since the input of the upper branch doesn’t change during prediction, its network activations can be precomputed for all edges, leading to a significant speedup. We choose gated recurrent units (GRU) instead of long shortterm memory (LSTM) in the recurrent network part, because GRUs have no internal state and get all history from the hidden state vector
, saving on memory and bookkeeping.5.2 Training Methods
Augmenting the Input Image: We noted above that structured learning is superior because it considers edges jointly. However, it can only rely on the sparse training sets . In contrast, unstructured learning can make use of all edges and thus has a much bigger training set. This means that more powerful predictors, e.g. much deeper CNNs, can be trained, leading to more robust predictions and bigger receptive fields.
To combine the advantages of both approaches, we propose to augment the input image with an additional channel holding node boundary probabilities predicted by an unstructured model :
(24) 
We train the CNN separately beforehand and replace with the augmented input everywhere in and . This simplifies structured learning because the predictor only needs to learn a refinement of the already reasonable altitudes in . In principle, one could even train and jointly, but the combined model is too big for reliable optimization.
Training Schedule: Taking advantage of the close relationship with reinforcement learning, we adopt the asynchronous update procedure proposed by [23]. Here, independent workers fetch the current CNN parameters from the master and compute loss gradients for randomly selected training images in parallel. The master then applies these updates to the parameters in an asynchronous fashion. We found experimentally, that this lead to faster and more stable convergence than sequential training.
In order to train the recurrent network part, we replace the standard temporal input ordering with the succession of edges defined by the paths and
. In a sense, backpropagation in time thus becomes backpropagation along the minimum spanning forest.
6 Experiments and Results
Our experiments illustrate the performance of our proposed endtoend trainable watershed in combination with static and dynamic altitude prediction. To this end, we compare with standard watershed and power watershed algorithms [8] on statically trained CNNs according to [5], see section 6.2. Furthermore we show in section 6.3 that the learned watershed surpasses the stateoftheart segmentation in an adapted version of the CREMI Neuron Segmentation Challenge [10].
6.1 Experimental Setup and Evaluation Metrics
Seed Generation Oracle: All segmentation algorithms start at initial seeds which are here provided by a “perfect” oracle. In our experiments, this oracle uses the ground truth segmentation to select one pixel with maximal distance to the region boundary per ground truth region.
Segmentation Metrics: In accordance with the CREMI challenge [10], we use the following segmentation metrics: The Rand score measures the probability of agreement between segmentation and ground truth w.r.t. a randomly chosen node pair . Two segmentations agree if both assign and to the same region or to different regions. The Rand error is the opposite, so that smaller values are better.
The Variation of Information(VOI) between and is defined as where is the conditional entropy [19]. To distinguish split errors from merge errors, we report the summands separately as and
6.2 Artificial Data
Dataset: In order to compare our models and with solutions based on unstructured learning, we create an artificial segmentation benchmark dataset with variable difficulty. First, we generate an edge image via the zero crossing of a 2D Gaussian process. This image is then smoothed with a Gaussian filter and corrupted with Gaussian noise at . For each , we generate 1900 training images and 100 test images of size 252x252. One test image with corresponding ground truth and results is shown in figure 5.
Baseline: We choose a recent edge detection network from [34] to predict boundaries between different instances in combination with standard watershed (WS) and Power Watershed (PWS) [8] to generate an instance segmentation. Since these algorithms work best on slightly smoothed inputs, we apply Gaussian smoothing to the CNN output. The optimal smoothing parameters are determined by grid search on the training dataset. Additionally, we apply all watershed methods directly to smoothed raw image and report their overall best result as RAW + WS.
Performance: The measured segmentation errors of all algorithms are shown in table 1. Observed differences in performance mainly indicate how well each method handles lowcontrast edges and narrow gaps between regions. The structurally trained watersheds outperform the unstructured baselines, because our loss function heavily penalizes the resulting segmentation errors. In all experiments, the dynamic prediction function has the best performance, due to its superior modeling power. It can identify holes and close most contours correctly because it learns to derive shape and contingency clues from monitoring intermediate results during the flooding process. A representative example of this effect is shown in figure 5.
ARAND  

5.8 0.8  12.5 1.7  32.2 1.8  
6.4 0.9  13.8 1.6  32.4 2.2  
NN + WS  6.5 0.8  14.9 3.6  33.4 1.7 
NN + PWS  6.5 0.8  14.9 1.7  33.2 1.7 
RAW + WS  24.0 1.6  41.9 1.8  55.0 1.8 
6.3 Neurite Segmentation
Dataset: The MICCAI Challenge on Circuit Reconstruction from Electron Microscopy Images [10] contains 375 fully annotated slices of electron microscopy images (of resolution 1250x1250 pixels). Part of a data slice is displayed in figure 7 top. Since the test ground truth segmentation has not been disclosed, we generate a new train/test split from the 3 original challenge training datasets by spltnting them into 3x75 zcontinuous training and 3x50 zcontinuous test blocks.
Ideally, we would compare with [5] whose results define the stateoftheart on the CREMI Challenge at time of submission. However, their pipeline, as described in their supplementary material, optimizes 2D segmentations jointly across multiple slices with a complex graphical model, which is beyond the scope of this paper.
Instead, we isolate the 2D segmentation aspect by adapting the challenge in the following manner: We run each segmentation algorithm with fixed ground truth seeds (see section 6.1) and evaluate their results on each slice separately. The restriction to 2D evaluation requires a slight manual correction of the ground truth: The ground truth accuracy in direction is just slice. The official 3D evaluation scores compensate for this by ignoring deviations of pixels in direction. Since this trick doesn’t work in 2D, we remove 4 regions with no visual evidence in the image and all segments smaller than 5 pixels. Boundary tolerances in the xy plane are treated as in the official CREMI scores where deviations from the true boundary are ignored if they do not exceed 6.25 pixels.
Baseline: We compare the Learned Watershed performance against the Power Watershed[8], Viscous Watershed [32], RandomWalker[13], Stochastic Watershed[1] and Distance Transform Watershed[5]. The boundary probability prediction (the same as in equation (24)) was provided by a deep CNN trained with an unstructured lossfunction. In particular, the Distance Transform Watershed (DTWS) and the prediction were used to produce the current stateoftheart on the CREMI challenge. To obtain the DTWS, one thresholds , computes a distance transform of the background, i.e. the nonboundary pixels and runs the watershed algorithm on the inverted distances. According to [5], this is the best known heuristic to close boundary gaps in these data, but requires manual parameter tuning. We found the parameters of all baseline algorithms by grid search using the training dataset. To ensure fair comparison, we start region growing from ground truth seeds in all cases. Our algorithm takes the augmented image from equation (24) as input and learns how to close boundary gaps.
Comparison to stateoftheart: We show the 2D CREMI segmentation scores in table 2. It is evident that the learned watershed transform significantly outperforms DTWS in both ARAND and VOI score. Quantitatively, we find that the flooding patterns and therefore the region shapes of the learned watershed prefer to adhere to biologically sensible structures. We illustrate this with our results on one CREMI test slice in figure 7, as well as specific examples in figure 6. We find throughout the dataset that especially thin processes, as depicted in fig. 6(a) left, are a strength of our algorithm. Biologically sensible shape completions can also be found for roundish objects and is particularly noticeable when boundary evidence is weak, as shown in Fig. 6(b) center. However, in rare cases, we find incorrect shape completions (see Fig. 6(c) right), mainly in areas of weak boundary evidence. It stands to reason that these errors could be fixed by providing more training data.
ARAND  VOI split  VOI merge  

PowerWS  0.122 0.003  0.340 0.031  0.180 0.019 
ViscousWS  0.093 0.003  0.328 0.030  0.069 0.003 
RandomWalker  0.103 0.004  0.355 0.037  0.060 0.004 
Stochastic WS  0.193 0.012  0.612 0.080  0.077 0.004 
DTWS  0.085 0.001  0.320 0.029  0.070 0.005 
Learned WS  0.082 0.001  0.319 0.030  0.057 0.004 
7 Conclusion
This paper proposes an endtoend learnable seeded watershed algorithm that performs well an artificial data and neurosegmentation EM images. We found the following aspects to be critical success factors: First, we train a very powerful CNN to control region growing. Second, the CNN is trained in a structured fashion, allowing it to optimize segmentation performance directly, instead of treating pixels independently. Third, we improve modeling power by incorporating dynamic information about the current state of the segmentation. Specifically, feeding the current partial segmentation into the CNN provides shape clues for the next assignment decision, and maintaining a latent history along assignment paths allows to adjust growing parameters locally. We demonstrate experimentally that the resulting algorithm successfully solves difficult configurations like narrow region parts and lowcontrast boundaries, where previous algorithms fail. In future work, we plan to include seed generation into the endtoend learning scheme.
References
 [1] J. Angulo and D. Jeulin. Stochastic watershed segmentation. In PROC. of the 8th International Symposium on Mathematical Morphology, pages 265–276, 2007.
 [2] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. Contour detection and hierarchical image segmentation. IEEE Trans. Patt. Anal. Mach. Intell., 33(5):898–916, 2011.
 [3] I. ArgandaCarreras, S. Turaga, D. Berger, et al. Crowdsourcing the creation of image segmentation algorithms for connectomics. Front. Neuroanatomy, 9:142, 2015.
 [4] M. Bai and R. Urtasun. Deep watershed transform for instance segmentation. arXiv:1611.08303, 2016.
 [5] T. Beier, C. Pape, N. Rahaman, T. Prange, et al. Multicut brings automated neurite segmentation closer to human performance. Nature Methods, 14(2):101–102, 2017.
 [6] J. Cai, L. Lu, Z. Zhang, F. Xing, L. Yang, and Q. Yin. Pancreas segmentation in MRI using graphbased decision fusion on convolutional neural networks. In Proc. MICCAI, 2016.
 [7] D. C. Ciresan, A. Giusti, L. M. Gambardella, and J. Schmidhuber. Deep neural networks segment neuronal membranes in electron microscopy images. Proc. NIPS’12, 2012.
 [8] C. Couprie, L. Grady, L. Najman, and H. Talbot. Power watershed: A unifying graphbased optimization framework. IEEE Trans. Patt. Anal. Mach. Intell., 33(7), 2011.
 [9] J. Cousty, G. Bertrand, L. Najman, and M. Couprie. Watershed cuts: Minimum spanning forests and the drop of water principle. IEEE Trans. Patt. Anal. Mach. Intell., 2009.
 [10] C. CREMI. Miccai challenge on circuit reconstruction from electron microscopy images, 2017.
 [11] A. X. Falcão, J. Stolfi, and R. de Alencar Lotufo. The image foresting transform: Theory, algorithms, and applications. IEEE Trans. Patt. Anal. Mach. Intell., 26(1):19–29, 2004.
 [12] C. Fowlkes, D. Martin, and J. Malik. Learning affinity functions for image segmentation: combining patchbased and gradientbased approaches. In Proc. CVPR, 2003.
 [13] L. Grady. Random walks for image segmentation. IEEE Trans. Patt. Anal. Mach. Intell., 28(11):1768–1783, 2006.
 [14] V. Jain, J. F. Murray, F. Roth, S. Turaga, V. Zhigulin, K. L. Briggman, M. N. Helmstaedter, W. Denk, and H. S. Seung. Supervised learning of image restoration with convolutional networks. Proc. ICCV’07, pages 1–8, 2007.
 [15] M. Januszewski, J. MaitinShepard, P. Li, J. Kornfeld, W. Denk, and V. Jain. Floodfilling networks. arXiv:1611.00421, 2016.
 [16] S. KnowlesBarley, V. Kaynig, T. R. Jones, A. Wilson, J. Morgan, D. Lee, D. Berger, N. Kasthuri, J. W. Lichtman, and H. Pfister. RhoanaNet pipeline: Dense automatic neural annotation. arXiv:1611.06973, 2016.
 [17] I. Kokkinos. Pushing the boundaries of boundary detection using deep learning. arXiv:1511.07386, 2015.
 [18] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. arXiv:1509.02971, 2015.
 [19] M. Meila. Comparing clusterings: an axiomatic view. In Proc. ICML’05, pages 577–584, 2005.
 [20] Y. Meirovitch, A. Matveev, H. Saribekyan, D. Budden, D. Rolnick, G. Odor, S. K.B. T. R. Jones, H. Pfister, J. W. Lichtman, and N. Shavit. A multipass approach to largescale connectomics. arXiv preprint:1612.02120, 2016.
 [21] F. Meyer. Minimum spanning forests for morphological segmentation. In Mathematical morphology and its applications to image processing, pages 77–84. 1994.
 [22] F. Meyer. Watersheds on weighted graphs. Pattern Recognition Letters, 47:72 – 79, 2014.
 [23] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, et al. Asynchronous methods for deep reinforcement learning. In Proc. ICML’16, 2016.
 [24] J. NunezIglesias, R. Kennedy, T. Parag, J. Shi, and D. Chklovskii. Machine learning of hierarchical clustering to segment 2D and 3D images. PLoS one, 8:e71715, 2013.
 [25] T. M. Quan, D. G. Hilderbrand, and W.K. Jeong. FusionNet: a deep fully residual convolutional neural network for image segmentation in connectomics. arXiv:1612.05360, 2016.
 [26] J. B. Roerdink and A. Meijster. The watershed transform: Definitions, algorithms and parallelization strategies. Fundamenta informaticae, 41(1, 2):187–228, 2000.
 [27] O. Ronneberger, P. Fischer, and T. Brox. UNet: convolutional networks for biomedical image segmentation. Proc. MICCAI’15, pages 234–241, 2015.
 [28] D. Silver, G. Lever, N. Heess, T. Degris, et al. Deterministic policy gradient algorithms. In Proc. ICML’14, 2014.
 [29] R. S. Sutton, D. A. McAllester, S. P. Singh, Y. Mansour, et al. Policy gradient methods for reinforcement learning with function approximation. In Proc. NIPS’99, 1999.
 [30] S. C. Turaga, K. L. Briggman, M. Helmstaedter, W. Denk, and H. S. Seung. Maximin affinity learning of image segmentation. arXiv:0911.5372, 2009.
 [31] S. C. Turaga, J. F. Murray, V. Jain, F. Roth, M. Helmstaedter, K. Briggman, W. Denk, and H. S. Seung. Convolutional networks can learn to generate affinity graphs for image segmentation. Neural Computation, 22(2):511–538, 2010.
 [32] C. Vachier and F. Meyer. The viscous watershed transform. J. Math. Imaging and Vision, 22(2):251–267, 2005.
 [33] S. Xie and Z. Tu. Holisticallynested edge detection. In Proc. ICCV’15, pages 1395–1403, 2015.
 [34] F. Yu and V. Koltun. Multiscale context aggregation by dilated convolutions. arXiv:1511.07122, 2015.
 [35] A. Zlateski and H. S. Seung. Image segmentation by sizedependent single linkage clustering of a watershed basin graph. arXiv:1505.00249, 2015.