1 Introduction
In the computer vision literature, several approaches to assess the quality of contour detection and segmentation algorithms can be found. Most of these measures have been designed to capture the intuition of what humans consider to be two similar results. In particular, these measures are supposed to be robust to certain tolerated deviations, like small shifts of contours. For the contour detection in the Berkeley segmentation dataset
[14], for example, the precision and recall of detected boundary pixels within a threshold distance to the ground truth became the widely used standard
[13, 1]. Contour error measures are, however, not a good fit for segmentations, since small errors in the detection of a contour can lead to the split or merge of segments. Therefore, alternatives like the Variation Of Information (VOI), the Rand Index [18] (RI), the probabilistic Rand index [20, 21], and the segmentation covering measure [1], have been proposed.However, these measures do not acknowledge that there are different criteria for segmentation comparison, and instead accumulate errors uniformly, even for many small differences that are irrelevant in practice. Especially in the field of biomedical image processing, we are often more interested in counting true topological errors like splits and merges of objects, instead of counting small deviations from the ground truth contours. This is in particular the case for imaging methods for which no unique “ground truth” labeling exists. In the imaging of neural tissue with Electron Microscopy (EM), for example, the preparation protocol can alter the volume of neural processes, such that it is hard to know what the true size was [19]. Further, the imaging resolution and data quality might just not be sufficient to clearly locate contours between objects [3], resulting in a high interobserver variability.
To address these issues, we present a novel measure to evaluate segmentations on a clearly specified tolerance criterion: At the core of our measure, that we call Tolerant Edit Distance (TED)^{1}^{1}1Source code available at http://github.com/funkey/ted., is an explicit tolerance criterion (, boundary shifts within a certain range). Using integer linear programming, we find the minimal weighted sum of split and merge errors exceeding the tolerance criterion, and thus provide a timetofix estimate. By interpreting a segmentation as a general labeling of voxels, our measure does not require voxels of the same object to form a connected component, and thus supports anisotropic volumes, missing data, or known object connections via paths outside the volume being considered. The reported results are intuitive, easy to interpret, and errors can be localized in the volume. An illustration of the TED can be found in Figure 1.
Application to Neuron Segmentation. To demonstrate the usefulness of our measure, we present our results in the context of automatic neuron segmentation from EM volumes, an active field of biomedical image processing (for recent advances, see [5, 10, 15, 16, 8]). In this field, the criterion to assess the quality of a segmentation depends on the biological question: On one hand, skeletons of neurons are sufficient to identify individual neurons [17], to study neuron types and their function [4], and to obtain the wiring diagram of a nervous system (the socalled connectome) [3]. In these cases, topological correctness is far more important than the diameter of a neural process or the exact location of its boundary (see Figure 2 for examples). On the other hand, for biophysically realistic neuron simulation, volumetric information is needed to model action potential time dynamics, and to understand and simulate information processing capabilities of single neurons [12]. In this case, the segmentation should be close to the true volume of the reconstructed neurons. Only small deviations in the boundary location might still be tolerable.
Current stateoftheart methods for automatic neuron segmentation can broadly be divided into isotropic [15, 11, 16, 8] and anisotropic methods [5, 10, 6]. For both types, reporting segmentation accuracy in terms of VOI or RI became the defacto standard [15, 11, 10, 16, 8]. Less frequently used [5, 6] is the Anisotropic Edit Distance (AED) [5] and the Warping Error (WE) [9]
. The AED is tailored to the specific error correction steps required for anisotropic volumes (splits and merges of 2D neuron slices within a section, connections and disconnections of slices between sections). The WE aims to measure the difference between ground truth and a proposal segmentation in terms of their topological differences. As such, the WE was the first error measure for neuron segmentation that deals with the delicate question of up to which point a boundary shift is not considered to be an error. However, since the WE assumes a foregroundbackground segmentation where connected foreground objects represent neurons, it is only applicable to isotropic volumes (in anisotropic volumes, connectedness of neurons is not always preserved). Furthermore, only suboptimal solutions to the WE are found using a greedy, randomized heuristic, which makes it difficult to use for evaluation purposes. Consequently, the WE has found its main application in the training of neural networks for image classification
[9].2 Tolerant Edit Distance
The TED measures the difference between two segmentations and , where is a discrete set of voxel (or supervoxel) locations in a volume, and and are sets of labels used by and , respectively. The difference is reported in terms of the minimal number of splits and merges appearing in a relabeling of , as compared with . How is allowed to be relabeled is defined on a tolerance criterion, , the maximal displacement of an object boundary.
We say that a label overlaps with a label , if there exists at least one location such that and . If and represent the same segmentation, each label overlaps with exactly one label , and vice versa. Consequently, if a label overlaps with labels from , we count it as splits. Analogously, if a label overlaps with labels from , we count it as merges. For two labelings and , we denote as and the sum of splits and merges over all labels.
Let a tolerance function be a binary indicator on two labeling functions and ,
(1) 
Further, let be the set of all labeling functions , , all possible labelings of using the labels of , and let be the set of all tolerated relabelings of . The TED is the minimal weighted sum of splits and merges over all tolerable relabelings :
(2) 
where the weights and represent the time or effort needed to fix a split or merge, respectively.
In order to find the minimum of (2), we assume that the tolerance function is local, , there exists a set of tolerable labels for each location , and a tolerable labeling is any combination of those labels:
An example of such a tolerance function is shown in Figure 1 (c). With this assumption, we solve (2) with the following integer linear program (ILP): min_v αs + βm ∑_l ∈A_i v_i←l = 1 ∀i ∈Ω ∑_i ∈Ω v_i←l ≥ 1 ∀l ∈K_y a_kl  v_i←l ≥ 0 ∀i ∈Ω: x(i) = k a_kl  ∑_i ∈Ω:x(i) = k v_i←l ≤ 0 ∀k∈K_x ∀l∈K_y s_k  ∑_l ∈K_y a_kl = 1 ∀k∈K_x m_l  ∑_k ∈K_x a_kl = 1 ∀l∈K_y s  ∑_k ∈K_x s_k = 0 m  ∑_l ∈K_y m_l = 0 At the core of this ILP are binary indicator variables to indicate the assignment of label to location . Constraints (2) and (2) ensure that exactly one of the labels gets chosen for each location and that each label of
has to appear at least once. Further, we introduce binary variables
that indicate the presence of a joint assignment of label from and label from at at least one location. With constraints (2) and (2) we make sure that each if and only if there is at least one location such that and . To count the number of times a label is split in , we further introduce integers . These counts equal the number of times was matched with any other label minus one, which we ensure with constraints (2). Analogously, we introduce integers and constraints (2) for merges caused by label in . The final split and merge numbers and are just the sums of the labelwise splits and merges, ensured by (2) and (2).Once the optimal solution of this ILP has been found, the variables can be used to determine which labels got split and merged, and thus to localize errors.
3 Results
Shift of Object Boundary. To illustrate the behaviour of different error measures in the case of object boundary displacements, we created a simple artificial 1D labeling consisting of two regions. We show the errors of segmentations obtained by shifting the boundary between the objects.
It can clearly be seen that TED assigns the same numbers (one split and one merge error) as soon as a given tolerance criterion is exceeded ( in this example), regardless where the error happens. This is the desired outcome for applications like neuron segmentation, where it is important to count the number of topological errors regardless of how many voxels got affected.
Influence of Distance Threshold. In order to study the effect of the threshold distance for boundary shifts, we used an automatic segmentation result^{2}^{2}2Obtained using Sopnet [5] on a publicly available EM dataset [7] and evaluated the TED for varying thresholds.
The TED reveals that most of the errors occur within the range of about , corresponding to about 12 pixels in the xyplane of this dataset. Depending on the biological need, those errors might be tolerable. In the same plot, we show the VOI of the closest tolerable relabeling to the ground truth under the given boundary shift threshold (, the equivalent of Figure 1 (d) on the proposal segmentation). From this example, we can see that the errors contribute quite significantly with bits to the total VOI of , and thus can shadow true topological errors.
Comparison to RI and VOI. We compare RI and VOI against TED for three manual modifications of the ground truth labeling of [7].
For the shift experiment, we shifted the boundaries of neurons in the ground truth by . For the splits and merges experiment, we split and merged neurons at 10 randomly selected locations, respectively. It can be seen that the small shifts of object boundaries can have a significant contribution to the measures RI and VOI, which confirms our previous observation.
Localization of Errors. Due to the explicit tolerance criterion of the TED, errors can be localized in the volume. In Figure 3 we show example split an merge errors detected by the TED on an automatic segmentation result for the SNEMI dataset [2]. The boundary shift tolerance was set to , which corresponds to voxels for this volume with a resolution of .
4 Conclusions
We presented the TED, a novel measure for segmentation comparison, which tolerates small errors based on an explicit tolerance criterion.
Although we demonstrated the TED in the domain of neuron segmentation, our error measure is not intrinsically limited to this application. In our future work, we will investigate its use for other computer vision problems, and especially on the training of algorithms to minimize this error measure.
A current limitation of the TED is the restriction to use local tolerance functions. Although more involved tolerance criteria could in theory be incorporated into the ILP by adding auxiliary variables, it remains questionable whether the resulting problem is still tractable. Although we did not observe that empirically, even with the current formulation it is conceivable that an optimal solution to the ILP can not be found in reasonable time. This could in particular be the case if ground truth and proposal segmentation differ a lot and a very lax tolerance criterion is used. In these cases, approximate solutions to the proposed ILP might be considered.
References
 [1] P. Arbeláez, M. Maire, C. C. Fowlkes, and J. Malik. From Contours to Regions: An Empirical Evaluation. In CVPR, 2009.
 [2] I. ArgandaCarreras, S. H. Seung, A. Vishwanathan, and D. R. Berger. SNEMI 3D: 3D Segmentation of Neurites in EM Images, 2013.
 [3] A. Cardona. Towards semiautomatic reconstruction of neural circuits. Neuroinformatics, 11(1):31–33, 2013.
 [4] W. Denk, K. L. Briggman, and M. Helmstaedter. Structural neurobiology: missing link to a mechanistic understanding of neural computation. Nature reviews Neuroscience, 2012.

[5]
J. Funke, B. Andres, F. A. Hamprecht, A. Cardona, and M. Cook.
Efficient automatic 3Dreconstruction of branching neurons from em
data.
In
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
, pages 1004–1011, 2012.  [6] J. Funke, J. Martel, S. Gerhard, B. Andres, D. C. Ciresan, A. Giusti, L. M. Gambardella, J. Schmidhuber, H. Pfister, A. Cardona, and M. Cook. Candidate Sampling for Neuron Reconstruction from Anisotropic Electron Microscopy Volumes. In MICCAI, pages 17–24, 2014.
 [7] S. Gerhard, J. Funke, J. Martel, A. Cardona, and R. D. Fetter. Segmented anisotropic ssTEM dataset of neural tissue, 2013.
 [8] G. B. Huang and V. Jain. Deep and Wide Multiscale Recursive Networks for Robust Image Labeling. In arXiv preprint arXiv:1310.0354, 2014.
 [9] V. Jain, B. Bollmann, M. Richardson, D. R. Berger, M. Helmstaedter, K. L. Briggman, W. Denk, J. B. Bowden, J. Mendenhall, W. C. Abraham, K. Harris, N. Kasthuri, K. J. Hayworth, R. Schalek, J. Tapia, J. Lichtman, and S. H. Seung. Boundary Learning by Optimization with Topological Constraints. In CVPR, 2010.
 [10] V. Kaynig and A. VazquezReina. Largescale automatic reconstruction of neuronal processes from electron microscopy images. IEEE transactions on medical imaging, 1(1):1–14, 2013.

[11]
T. Kröger, S. Mikula, W. Denk, U. Köthe, and F. a. Hamprecht.
Learning to segment neurons with nonlocal quality measures.
In
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
, volume 8150 LNCS, pages 419–427, 2013.  [12] M. London and M. Häusser. Dendritic Computation. Annual Review of Neuroscience, 28(1):503–532, 2005.
 [13] D. R. Martin, C. C. Fowlkes, and J. Malik. Learning to Detect Natural Image Boundaries Using Local Brightness, Color, and Texture Cues. PAMI, 2004.
 [14] D. R. Martin, C. C. Fowlkes, D. Tal, and J. Malik. A Database of Human Segmented Natural Images and its Application to Evaluating Segmentation Algorithms and Measuring Ecological Statistics. In ICCV, volume 2, pages 416–423, 2001.
 [15] J. NunezIglesias, R. Kennedy, T. Parag, J. Shi, and D. B. Chklovskii. Machine Learning of Hierarchical Clustering to Segment 2D and 3D Images. Plos One, 8(8):e71715, 2013.
 [16] T. Parag, S. M. Plaza, and L. K. Scheffer. Small Sample Learning of Superpixel Classifiers for EM Segmentation Extended Version. In CoRR, volume abs/1406.1, 2014.
 [17] H. Peng, P. Chung, F. Long, L. Qu, A. Jenett, A. M. Seeds, E. W. Myers, and J. H. Simpson. BrainAligner: 3D registration atlases of Drosophila brains. Nature Methods, 8(6):493–498, 2011.
 [18] W. M. Rand. Objective Criteria for the Evaluation of Clustering Methods. Journal of the Americal Statistical Association, 66:846–850, 1971.
 [19] G. E. Sosinsky, J. Crum, Y. Z. Jones, J. Lanman, B. Smarr, M. Terada, M. E. Martone, T. J. Deerinck, J. E. Johnson, and M. H. Ellisman. The combination of chemical fixation procedures with high pressure freezing and freeze substitution preserves highly labile tissue ultrastructure for electron tomography applications. Journal of Structural Biology, 161(3):359–371, 2008.
 [20] R. Unnikrishnan and M. Hebert. Measures of Similarity. In Seventh IEEE Workshop on Applications of Computer Vision, 2005.
 [21] R. Unnikrishnan, C. Pantofaru, and M. Hebert. Toward objective evaluation of image segmentation algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(6):929–944, 2007.