Learned versus Hand-Designed Feature Representations for 3d Agglomeration

by   John A. Bogovic, et al.
Howard Hughes Medical Institute

For image recognition and labeling tasks, recent results suggest that machine learning methods that rely on manually specified feature representations may be outperformed by methods that automatically derive feature representations based on the data. Yet for problems that involve analysis of 3d objects, such as mesh segmentation, shape retrieval, or neuron fragment agglomeration, there remains a strong reliance on hand-designed feature descriptors. In this paper, we evaluate a large set of hand-designed 3d feature descriptors alongside features learned from the raw data using both end-to-end and unsupervised learning techniques, in the context of agglomeration of 3d neuron fragments. By combining unsupervised learning techniques with a novel dynamic pooling scheme, we show how pure learning-based methods are for the first time competitive with hand-designed 3d shape descriptors. We investigate data augmentation strategies for dramatically increasing the size of the training set, and show how combining both learned and hand-designed features leads to the highest accuracy.



There are no comments yet.


page 12


Learning Multi-Scale Representations for Material Classification

The recent progress in sparse coding and deep learning has made unsuperv...

Continuous Geodesic Convolutions for Learning on 3D Shapes

The majority of descriptor-based methods for geometric processing of non...

Unsupervised Feature Learning for low-level Local Image Descriptors

Unsupervised feature learning has shown impressive results for a wide ra...

DeepDiffusion: Unsupervised Learning of Retrieval-adapted Representations via Diffusion-based Ranking on Latent Feature Manifold

Unsupervised learning of feature representations is a challenging yet im...

Learning Inward Scaled Hypersphere Embedding: Exploring Projections in Higher Dimensions

Majority of the current dimensionality reduction or retrieval techniques...

Space-Filling Curves as a Novel Crystal Structure Representation for Machine Learning Models

A fundamental problem in applying machine learning techniques for chemic...

Learning Feature Representations for Keyphrase Extraction

In supervised approaches for keyphrase extraction, a candidate phrase is...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A core issue underlying any machine learning approach is the choice of feature representation. Traditionally, features have been hand-designed according to domain knowledge and experience (for example, Gabor filters for image analysis or cepstral coefficients for automatic speech recognition). Recently, it has become more common to attempt to learn features based on supervised or unsupervised learning methods [22, 15, 4, 7, 8, 26, 11]

. These automatically derived feature representations have the advantage of not requiring domain expertise and potentially yielding a much larger set of features for a classifier. Perhaps most importantly, however, automatic methods may discover features that are more finely tuned for the particular problem being solved and thus lead to improved accuracy.

For many problems that involve analysis of 3d objects there remains a strong reliance on hand-designed feature descriptors even when machine learning is used in conjunction with such descriptors. For example, the field of 3d shape retrieval has a substantial history of benchmarking hand-designed shape descriptors [23, 31]. Mesh segmentation has recently been addressed using a conditional random field with energy terms based on a hand-curated set of shape-based features [20]. Supervoxel agglomeration for connectomic reconstruction of neurons has, thus far, also been largely dependent on manually specified feature representations [1, 19].

Designing features for representing specific kinds of 3d objects is arguably more intuitive as compared to hand-designing representations for more low-level data (such as raw image patches). For example, describing a neuron fragment in terms of quantities such as curvature, volume, and orientation seems natural. On the other hand, it is less clear this intuitive appeal is a good justification for such a feature representation in a specific task such as neuron fragment agglomeration (Figure 1 and supplementary Figure 5).

(a) Positive Example (b) Negative Example
Figure 1: Renderings of canonical positive and negative edge examples from the training set. We denote a pair of supervoxels which are subject to binary classification as a single edge in the overall agglomeration task [19]. The small black sphere indicates the decision point around which most computations are centered, and the dotted box indicates a cube with a length of pixels in each dimension.

The primary contributions of our work are:

  1. A large set of diverse hand-designed 3d shape descriptors that dramatically improve performance over simple baseline features used in prior work. We evaluate each feature individually, evaluate an ensemble set of all hand-designed features, and compare the computational cost of the features.

  2. An unsupervised learning approach for deriving 3d feature descriptors that, when combined with a novel dynamic pooling scheme, yields performance comparable to an ensemble set of all hand-designed features. To our knowledge, this is the first time purely learned features have been shown to provide competitive performance on a task involving analysis or classification of 3d shapes.

  3. An end-to-end supervised learning approach for deriving 3d feature descriptors. We introduce data augmentation strategies that dramatically expand the size of the training set and thus improve generalization performance of the end-to-end feature learning scheme.

2 Agglomeration of 3d Neuron Fragments

We focus on the application domain of segmentation of large-scale electron microscopy data for the purposes of ‘connectomic’ reconstruction of nervous system structure. Mapping neural circuit connectivity at the resolution of individual synapses is an important goal for neurobiology, which requires nanometer resolution imaging of tissue over large fields of view

[14]. Interpreting the resulting tera- or peta-voxel sized datasets currently involves substantial human effort, and thus increased or complete automation through highly accurate computational reconstruction would be ideal [18].

Automated pipelines for segmentation of both natural and non-natural images have converged on a broadly similar set of steps: boundary prediction, oversegmentation, and agglomeration of segments [12, 11, 19, 1, 5]. In this section we describe the source of the raw data, the creation of 3d segments, and the machine learning problem of fragment agglomeration.

Electron microscopy images: Tissue from a drosophila melanogaster brain was imaged using focused ion-beam scanning electron microscopy (FIB-SEM [21]) at a resolution of nm. The tissue was prepared using high-pressure freeze substitution and stained with heavy metals for contrast during electron microscopy. As compared to traditional electron microscopy methods such as serial-section transmission electron microscopy (ssTEM), FIB-SEM provides the ability to image tissue at very high resolution in all three spatial dimensions. Isotropic resolution at the sub-nm scale is particularly advantageous in drosophila due to the small neurite size that is typical throughout the neuropil.

Boundary prediction: We trained a deep and wide multiscale recursive (DAWMR) network [17] to generate affinity graphs from the electron microscopy data. Affinity graphs are similar to pixel-wise boundary prediction maps, except that they encode connectivity relationships between neighboring pixels (in our case, -connectivity due to the 3d image space) [32]. We supplied the DAWMR network with megavoxels of hand-segmented image data for training (with rotation and x-y reflection augmentations further increasing the total amount of data seen during training). The network uses a total field of view of pixels in the prediction of any single affinity edge.

The ground truth affinity graphs are binary representations where represents the case where two pixels are disconnected (belong to different objects, or are both part of ‘outside’ space unassigned to any object), and represents the case where two pixels are part of the same object. The DAWMR networks, trained on this ground truth, generate analog -valued affinity graphs.

Oversegmentation: The DAWMR-generated affinity graph is thresholded at a value of and objects are ‘grown’ by a seeded watershed procedure to an affinity value of . The affinity graph is then re-segmented at , new objects are added into the overall segmentation, and all objects are grown to a threshold of . This procedure is repeated for thresholds and . A distance-transform based object-breaking watershed procedure is then applied that slightly reduced the rate of undersegmentation in large objects. Finally, all objects are grown to a threshold of .

Training and test sets: Two separate megavoxel volumes were processed by the DAWMR network and oversegmented according to the procedure described above. Neither volume contained data used to train the boundary predictor. Pairs of segments within pixel of each other (we refer to these identified segment-pairs as edges) were labeled by humans as to whether the segments belong to the same or different neuron. One of the two volumes was randomly chosen to be the training set ( edges: positive and negative), and the other volume serves as a test set ( edges: positive and negative). Figure 1 shows examples of both positive and negative segment-pairs.

Learning binary agglomeration decisions: Superpixel agglomeration has recently been posed as a machine learning problem; some methods attempt to optimize classifier performance over a sequence of predictions that reflect, for example, variable ordering of agglomeration decisions based on classifier confidence [27, 19]

. In this work, we simply train a classifier on a one-step cost function that reflects the ground truth binary edge assignments. This is designed to simplify the interpretation of feature contributions and ease the computational burden of the many classification experiments we perform. We learn binary agglomeration decisions using a dropout multilayer perceptron (MLP) 

[16], and for comparison provide certain results using a decision-stump boosting classifier [13].

3 Hand-Designed Features

In this section we describe the proposed hand-designed features and evaluate the performance of each feature by measuring its accuracy on the agglomeration classification problem.

The features for a given pair of segments are computed from a fixed-radius subvolume centered around a ‘decision point’ between the two segments (Figure 1). The subvolume consists of the raw image values as well as the affinity graph produced by the DAWMR network. For simplicity, we often collapse the affinity graph by averaging over the three edge directions, which we refer to as the ‘boundary map.’

The decision point is defined as the midpoint of the shortest line segment that touches both segments. The motivation for this scheme lies in the intuition (based on observing human classification strategies) that the relevant image and object information required to decide whether two segments should be merged is concentrated near the interface between the two segments.

3.1 Feature Descriptions

Boundary map statistics:

after identifying a set of pixels that constitute the interface between two segments, we compute a number of statistics of the boundary map values over these pixels: mean, median, moments (variance, skewness, kurtosis), quartiles, length, minimum value, and maximum value. We also compute these statistics from the first and second derivative of the boundary map. This follows many previous approaches that identify some type of interface between segments, measure statistics at boundary map locations along this interface, and use these statistics as features to train an agglomeration classifier 

[1, 11, 19]. The boundary map is obtained by averaging the edges of the affinity graph. As noted in Table 1, we consider the ensemble of these statistics (experiment 6) as a baseline feature set.

Size: the volumes of both segments, and their log value.

Proximity: a scalar giving the shortest distance from a voxel assigned to one segment to a voxel assigned to the other segment.

Growth: segments are isolated within the component mask and are grown via a seeded watershed transform until they share a catchment basin. The affinity graph value at which this occurs yields the first growth feature. The second growth feature is given by the distance from the decision point to the location at which the catchment basins merge.

Rays: lines are propagated from the centroid of a segment until they terminate [30]. The features describe the average distance these rays travel before termination under one of two conditions: the affinity graph value falls below a specified threshold, or the ray exits a mask defined by the union of the two segments. We seed rays from both segments and use five choices of affinity graph threshold. Our experiments used

rays uniformly distributed over the sphere.

Another type of ray feature describes the average distance rays travel through one segment when seeded from the other segment. Figure 2(a) shows an example of the rays used for this feature.

SIFT: scale-invariant feature transform (SIFT) descriptors are computed that summarize the image gradient magnitudes and orientations near the decision point [24]

. We cluster the descriptors using k-means with

clusters and represent each descriptor as a feature vector based on a soft vector quantization encoding. SIFT features are computed using both the image data and the affinity graph.

Angles: we compute two vectors, and , giving the orientation of each of the segments and a third vector,

, that points from the center of mass of one segment to the center of mass of the other. The orientation of each segment is computed from a smooth vector field determined by the largest eigenvector of a windowed segments’ second-moment matrix (see the Appendix of

[19] for details). Features include the length of , and the angles formed by and with . This procedure is repeated with downsampled object masks, and objects grown using the affinity graph watershed transform (as for growth features) with choices of threshold, yielding angle features in total. Intuitively, we expect that two segments should be merged if the orientation of one segment is parallel with the vector pointing to the other segment.

(a) Ray Features (b) Level Set Initialization (c) Level Set Evolution
Figure 2: Demonstration of ray and level set features on positive edge examples. Rays originate in the red segment and penetrate the blue segment. The surfaces in (b) show the initialization state of the level set, and the multiple green surfaces in (c) show the results after various amounts of level set evolution.

Level sets: a segment is eroded to produce a contour that initializes a level set [25]. It is then evolved under a speed function that should, ideally, result in the deformed segment moving towards and into the other segment if those segments belong together. The speed function determining the evolution consists of orientation and gradient vector flow fields.

The orientation vector field is computed from the primary eigenvalue of the second-image-moment matrix. This field serves to move the initial contour from one segment to the other and provides evidence for positive examples. The eigenvalues of the moment matrix describe how tubular, flat, or spherical each segment is. Therefore, we also compute the mean and standard deviation of the three eigenvalues yielding

orientation features.

A gradient vector flow (GVF) field [33] is computed from the boundary prediction map in a manner similar to [34]. This field can prevent the contour from crossing the boundary between segments and serves as evidence for negative examples. We compute the mean and standard deviation of the curl and divergence of the gradient vector field over the interface between segments, yielding GVF features.

The level set overlap feature is the number of pixels belonging to both the level set result and the other segment. This process is repeated in reverse (starting the evolution from the other segment). We use these two overlap quantities, along with the mean, minimum, maximum and absolute difference between the two results, to yield overlap features in total.

Shape diameter function: the local width of each segment, represented via statistics on the shape diameter function as defined in [29]. The shape diameter function has been widely used for 3d mesh analysis and segmentation. We include both moments (

features) and quantile-based statistics (


Shape context: the local shape of each segment using a 3d implementation of [3]. In particular, we consider the shape to be the set of all points inside the window and on the boundary of either segment. Shape context is computed using the window’s central point as a reference, and a histogram with radial, polar angle and azimuth angle bins. We cluster these quantities using k-means ( clusters) then represent the feature using soft vector quantization.

Training set Testing set
Exp. Feature Set Description ACC(%) AUC(%) ACC(%) AUC(%) Dim. Cost

Boundary Map (bm)

1 bm mean, median, interface len. 83.64 92.44 80.92 90.46 6 5.7
2 exp 1 + bm moments 85.27 93.39 82.68 91.64 9 5.7
3 exp 1 + bm quantiles 84.54 93.11 82.04 91.33 8 5.7
4 exp 1 + bm quantiles, min/max 85.03 93.36 82.69 91.63 10 7.9
5 exp 1 + bm deriv. mean, median 90.31 96.71 88.64 95.73 14 7.9
6 exp 1 + all bm deriv. stats 91.85 97.61 89.11 96.05 42 14.0
7 baseline ( exp 1:6) 92.30 97.85 89.41 96.05 49 14.0
8 baseline ( exp 1:6) boosting 92.17 97.88 88.56 95.36 49 -


9 exp 7 + growth 92.55 98.09 89.67 96.32 51 1.0
10 exp 7 + proximity 92.18 97.85 89.09 96.02 50 489.1
11 exp 7 + angles 95.74 99.25 89.65 96.27 82 13.0
12 exp 7 + size 93.31 98.43 90.28 96.61 53 1.9
13 exp 7 + rays 94.36 98.92 90.06 96.52 91 44.2
14 exp 7 + shape diam. quantiles 93.52 98.56 89.82 96.46 59 402.5
15 exp 7 + shape diam. moments 94.26 98.72 86.32 92.14 57 402.5
16 exp 7 + shape context 94.71 99.03 89.91 96.50 69 5.5
17 exp 7 + convex hull 93.47 98.56 89.97 96.74 57 8.7
18 exp 7 + level sets overlap 92.23 97.90 89.16 96.08 55 464.0
19 exp 7 + level sets gradient v.f. 93.20 98.28 89.59 96.17 53 35.0
20 exp 7 + level sets orientation 93.74 98.61 90.13 96.75 55 229.4


21 exp 7 + SIFT soft v.q. 99.04 99.93 88.75 95.67 149 56.0
22 exp 7 + image moments 93.16 98.29 89.26 96.12 53 4.1
23 exp 7 + image deriv. stats 95.58 99.22 89.09 95.61 85 5.8
24 exp 7 + image stats 96.42 99.43 88.85 95.73 94 5.8
25 all hand-designed ( exp 1:24) 99.98 99.98 92.33 97.61 363 -
Table 1: Classification experiments with hand-designed features. For each experiment, we provide the training and test accuracy (ACC), area under the ROC curve (AUC), number of total dimensions in the input feature vector, and relative computation time for each individual feature. The notation ‘exp +’ denotes that the feature set from experiment was added (i.e., set union) to the feature set in that experiment. All experiments except 8 used a drop-out multilayer perceptron as the classifier.

Convex hull: the number of pixels in the convex hull of each segment contained inside and outside of the segment and the log values of these quantities [2].

3.2 Classification Experiments and Results

We performed a variety of classification experiments in which we varied the set of hand-designed features provided to the classifier, as summarized in Table 1. The ‘Cost’ column represents the wall-clock time taken to compute each feature set, normalized by the time taken for the fastest feature (‘growth’). Figure 3 shows the precision-recall curve for experiments using hand-designed features as well as results from experiments using the feature learning schemes described in subsequent sections.

As our classifier, we use a drop-out multilayer perceptron ( hidden units, weight updates, rectified linear hidden units) [16], but also present results using a decision-stump boosting classifier for comparison (experiment ).

Substantial improvement in performance results as the feature set increases from a simple set of features derived from boundary map values (experiment 1: % test set classification accuracy) to the combined set of all hand-designed features (experiment 25: % accuracy). Interestingly, when considered in isolation, some of the simplest features, such as size and convex hull, provide some of the largest improvements in accuracy. However, using all the hand-designed features together yields significantly higher accuracy and improved precision as compared to any individual feature.

(a) Hand-Designed and Learned Experiments (b) Unsupervised Learning Experiments
Figure 3: Precision-recall curves comparing , and (a) hand-designed, end-to-end, and unsupervised feature learning schemes (b) different pooling schemes for unsupervised features. Unsupervised representation learning combined with dynamic pooling (unsupervised dyn–obj dyn–bnd) yields comparable performance to an ensemble of all hand-designed features (all hand–designed), while combining both learned and hand-designed features yields the best performance (unsupervised dyn–obj dyn–bnd, hand-designed).

4 Learned Features

In this section, we describe two data-driven feature representations. In contrast to hand-designed features, these representations do not require domain knowledge specialized to the data set being considered, and can therefore be easily adapted to new types of data. In addition, they are tuned to the statistics of the particular problem being solved, and may therefore prove to be complementary to or exceed the performance of hand-designed features.

4.1 End-to-end Learning

A naive but powerful approach is to simply provide the raw input signal values to the classifier. In such an approach, the classifier generally consists of multiple non-linear processing layers, and the classifier is tasked with mapping the raw input signal to intermediate hidden representations that improve overall classification performance. This approach, sometimes called ‘end-to-end’ learning, has achieved state-of-the-art performance on a variety of vision problems using multi-layer perceptrons and convolutional neural networks 

[6, 22].

We implement the end-to-end learning approach in the context of 3d agglomeration by creating, for each edge, a feature vector that contains image, segment, and boundary information within a 3d bounding box centered around the ‘decision point’ (as defined at the beginning of Section 3). Specifically, we provide raw image values from the electron microscopy data, boundary map values, and two binary segment masks. A particular mask is non-zero only where a given segment belonging to the edge is present.

A multiscale representation of the region around the decision point can be obtained by extracting the raw voxel values using multiple windows of varying radii. Further, to control the dimensionality of the input when using a large window radius, the subvolume of raw values can be downsampled by some factor in each spatial dimension. As a result, for a particular scale consisting of a bounding box of radius and downsampling of , the total dimensionality of the feature vector is .

4.2 Unsupervised Feature Learning

End-to-end learning can be particularly difficult when the size of the training set is limited (relative to the dimensionality of the data), as the classifier must discover useful patterns and invariances in the original data representation from a limited amount of supervised signal. However, the original (unlabeled) data itself can be useful as an additional signal, by learning representations that are capable of reproducing the data. These ‘unsupervised’ approaches learn feature representations by optimizing models that reconstruct the raw data in the presence of various forms of regularization such as a bottleneck or sparsity [15, 28].

Dynamic Object pooling
Dynamic Boundary pooling
Segments Pooling region Rendering
Figure 4: Examples of dynamic pooling: the top row shows object pooling for the positive edge segments shown in Figure 1, and the bottom row shows boundary pooling for the negative edge segments. The left column shows 2d - slices of the segmentation, and the center column shows the corresponding raw image data with an overlay of the pooling region, where the dynamic pooling regions correspond to using a window of radius 10 voxels, and the total slice area corresponds to the context needed to generate the feature representation for all locations in the window. The right column gives a rendering of the 3d pooling regions, where the pooling window is given by the bounding box indicated by dashed lines.

We experimented with using the unsupervised feature learning and extraction module used in DAWMR networks and adapting it to the agglomeration task. The core of this module consists of vector quantization (VQ) of

patches of the data, where the dictionary is learned using orthogonal matching pursuit (OMP-1), and encoding is performed using soft-thresholding with reverse polarity. This core component is performed at two scales (original resolution and downsampling by two in each spatial dimension), and a foveated representation is produced by concatenating the encoding produced at a center location with a max-pooled encoding over all locations within a radius of two of the center. Therefore, a

support region is used to produce the representation centered at a given voxel.

A straightforward method of adapting this feature representation to the problem of 3d agglomeration is to simply extract the feature representation at the decision point, which we will refer to as simply the ‘midpoint’ feature. Similar to end-to-end learning described above, the input data to the feature learning and extraction module consists of the raw image values, boundary map values, and a single binary segment mask that is non-zero only where either segment belonging to the edge is present. (We found that a single binary segment mask gave comparable performance to using two separate masks as used in end-to-end learning.)

However, the agglomeration task of deciding whether or not to merge two segments likely requires a greater context than the boundary prediction problem that the DAWMR feature representation was originally designed for. Therefore, we also consider extracting the foveated feature representation from every location within a fixed-radius window of the decision point, and average-pool these features. We refer to this as ‘static-all’ pooling, and concatenate this feature with the midpoint feature to obtain the ‘midpoint + all’ feature set.

We further introduce the notion of dynamic pooling, where the region to pool over is dependent on the segments themselves. For instance, rather than average pooling over all features within a window as in ‘all’ pooling, we can restrict the average pooling to be over only features corresponding to locations in either of the two segments (within a fixed-radius window of the decision point). This procedure, which we term ‘dyn-obj’ dynamic object pooling, may improve results over ‘all’ pooling by ignoring locations that are irrelevant to the agglomeration decision.

Another approach to dynamic pooling is to focus on those locations whose interpretation would change as a result of the agglomeration decision. In particular, the interpretation of those locations ‘in-between’ the two segments would change depending on whether the two segments were merged into a single object or kept as two separate segments. Therefore, we introduce the notion of dynamic pooling along the boundary between two segments, which we refer to as ‘dyn-bnd’ pooling. This is done by dilating each segment by a fixed amount (in our experiments, by half the radius used for the window around the decision point), and then considering those locations in the intersection of the two dilated segments. Both dynamic pooling methods are illustrated in Figure 4.

Finally, similar to end-to-end learning above, we consider multiscale dynamic pooling representations given by extracting features within windows of differing radii.

4.3 Classification Experiments and Results

Training set Testing set
Exp. Feature Set Description ACC(%) AUC(%) ACC(%) AUC(%) Dim. Cost


1 =5, =1, 20 hidden units 98.87 99.64 82.62 92.36 5324 7.6
2 =5, =1 100.0 99.98 84.34 93.53 5324 7.6
3 =5, =1, 400 hidden units 100.0 99.98 85.02 93.62 5324 7.6
4 exp 2 + (=10, =2) 100.0 99.98 84.50 93.75 10,648 12.6
5 exp 3 + (=10, =2) 100.0 99.98 85.54 93.99 10,648 12.6


6 midpoint 100.0 99.98 87.84 95.19 8000 9.9
7 midpoint + static-all (=10) 100.0 99.98 88.85 95.89 16,000 371.7
8 midpoint + dyn-obj (=10) 100.0 99.98 89.65 96.28 16,000 368.0
9 dyn-bnd (=10) 100.0 99.98 91.24 96.96 8000 371.2
10 dyn-obj + dyn-bnd (=4 + =10) 100.0 99.98 91.38 97.14 16,000 246.6
Table 2: Classification experiments with learned features. Dynamic pooling strategies (dyn-obj and dyn-bnd) are critical to achieving accuracy levels competitive with hand-designed features.

Results using the two data-driven representations above are presented in Table 2 and Figure 3(a). As with the hand-designed features, classification was performed using a drop-out multilayer perceptron, with hidden units unless otherwise specified. The end-to-end features outperform rudimentary boundary map features for the test set but not the larger feature set. The unsupervised feature set achieves much better test set performance, approaching that of the all hand-designed features. Figure 3(b) demonstrates the improvements that dynamic pooling methods can achieve.

5 Training Set Augmentation

Next, we describe experiments designed to improve generalization performance through synthetic augmentation of the training set. The motivation behind this methodology comes from work such as Decoste and Schölkopf [9] and Drucker et al[10], which create ‘virtual’ examples by applying some set of transformations to examples in the original training set and use these examples during classifier training. Thus, the training procedure is more likely to produce a classifier invariant to the given transformations. In this work, we experiment with ‘swap,’ ‘isometry,’ and ‘jitter’ augmentations of the training data.

The ‘swap’ transformations exchange the identities of the first and second segments (i.e. swapping the ordering). The ‘isometry’ augmentation considers all possible isometries of the underlying data. The image data is slightly anisotropic, as the -axis corresponds to the milling direction in FIB-SEM, orthogonal to the imaging plane (see Section 2, Electron microscopy images). Distance-preserving maps of the data therefore include four rotations of the - plane, reflection of the -axis, and reflection of the -axis. In total, these transforms form a group of order 16, equivalent to the isometries of a square prism, or . Finally, ‘jitter’ augmentations slightly shift the location of the decision point. In this work, our experiments use different decision points, where the original decision point is offset by all combinations of in all coordinates.

5.1 Augmentation Results

We experimented with hand-designed, end-to-end, and unsupervised features using training set augmentation. We also explored using a more powerful classifier with two hidden layers; this deeper classifier could be especially important when augmentation is used, as the amount of training data increases dramatically.

Table 4 in the supplementary gives the full results for experiments with different types of augmentation. Although augmented training examples had a slightly detrimental effect on the classification results when using hand-designed features, the end-to-end features benefited significantly by using all augmentation types simultaneously (thus expanding training set size from to examples). The two-layer MLP classifier further improved performance for end-to-end features using all augmentations. Overall, however, even after including augmentations in the end-to-end experiments, generalization performance was still much worse as compared to unsupervised feature experiments performed without augmentation. The unsupervised feature experiments that included augmentations saw minimal effects on generalization performance.

6 Feature Combination and Selection

We explore combining various learned feature schemes with hand-designed features, with the hypothesis that the hand-designed features may more easily capture higher-level or non-linear edge or segment characteristics than the learned methods. We use all training set augmentations for these experiments, since this case markedly improved the end-to-end feature learning approach. For computational reasons, we omitted the most expensive hand-designed features, namely, SIFT features, shape diameter (moments, and quantiles), level set overlap, and level set orientation. Proximity was not omitted because it is necessary for other aspects of the pipeline. Results of these experiments are given in Table 3. Test set accuracy improves for both end-to-end and unsupervised learned features when used in combination with hand-designed features, though the improvement is more marked for end-to-end features.

Training set Testing set
Exp. Feature Set Description ACC(%) AUC(%) ACC(%) AUC(%) Dim. Training Ex.
1 hand-designed  + end-to-end 95.06 99.07 92.09 97.74 5546 12,572,928
2 hand-designed  + unsupervised 100.0 99.98 92.21 97.67 16,222 1,571,616
Table 3: Classification experiments using a combination of hand-designed and learned features.

7 Discussion

We have demonstrated that features derived purely from learning algorithms can provide highly informative representations for a classification task involving 3d objects. The key innovation in achieving this result was a type of dynamic pooling that selectively pools feature representations from different spatial locations in a manner (dynamically) dependent on the shape of the underlying objects involved in the classification. We were able to implement this strategy in a straightforward way using an unsupervised learning approach, as the feature learning phase was separated from the encoding stage in which the pooling is performed.

These methods and results are a starting point for further work involving feature learning methods applied to 3d objects. In particular, the results motivate a more sophisticated end-to-end strategy that also incorporates dynamic pooling. Learning such an architecture will be more involved than in the unsupervised case, as the variations in spatial pooling (from one example to the next) will need to be incorporated into the learning algorithm.

Another open question is whether learning architectures for these types of problems would benefit from more complicated non-linearities or recurrent/recursive structure; some of the hand-designed features that appear to provide predictive benefit are based on highly non-linear iterative methods (e.g, level sets) or ray-tracing (e.g., ray features and shape diameter function), both of which are computations that might be difficult for a typical multilayer network architecture to emulate. Adding specific representational capacity motivated by these hand-designed strategies while preserving the ability to train most details of the architecture could offer a superior approach.

Acknowledgements: We thank Zhiyuan Lu for sample preparation, Shan Xu and Harald Hess for FIB-SEM imaging, and Corey Fisher and Chris Ordish for data annotation.


  • [1] B. Andres, U. Koethe, M. Helmstaedter, W. Denk, and F. Hamprecht. Segmentation of SBFSEM volume data of neural tissue by hierarchical classification. In

    Proceedings of the 30th DAGM symposium on Pattern Recognition

    , pages 142–152. Springer, 2008.
  • [2] E. Bas, M. G. Uzunbas, D. Metaxas, and E. Myers. Contextual grouping in a concept : a multistage decision strategy for EM segmentation. In ISBI, pages 1–8, 2012.
  • [3] S. Belongie, J. Malik, and J. Puzicha. Shape matching and object recognition using shape contexts. IEEE PAMI, 24(4):509–522, Apr. 2002.
  • [4] Y. Bengio. Learning deep architectures for ai. Foundations and trends® in Machine Learning, 2(1):1–127, 2009.
  • [5] D. Chklovskii, S. Vitaladevuni, and L. Scheffer. Semi-automated reconstruction of neural circuits using electron microscopy. Current Opinion in Neurobiology, 2010.
  • [6] D. C. Cireşan, U. Meier, L. M. Gambardella, and J. Schmidhuber. Deep, big, simple neural nets for handwritten digit recognition. Neural computation, 22(12):3207–3220, 2010.
  • [7] A. Coates, A. Karpathy, and A. Ng. Emergence of object-selective features in unsupervised feature learning. In Advances in Neural Information Processing Systems 25, pages 2690–2698, 2012.
  • [8] A. Coates, A. Y. Ng, and H. Lee. An analysis of single-layer networks in unsupervised feature learning. In

    International Conference on Artificial Intelligence and Statistics

    , pages 215–223, 2011.
  • [9] D. Decoste and B. Schölkopf.

    Training Invariant Support Vector Machines.

    Machine Learning, 46:161–190, 2002.
  • [10] H. Drucker, R. Schapire, and P. Simard. Boosting Performance in Neural Networks. International Journal of Pattern Recognition and Artificial Intelligence, 07(04):705–719, Aug. 1993.
  • [11] C. Farabet, C. Couprie, L. Najman, and Y. LeCun. Scene parsing with multiscale feature learning, purity trees, and optimal covers. arXiv preprint arXiv:1202.2160, 2012.
  • [12] C. Fowlkes, D. Martin, and J. Malik. Learning affinity functions for image segmentation: Combining patch-based and gradient-based approaches. In

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2003)

    , 2003.
  • [13] Y. Freund and L. Mason.

    The alternating decision tree learning algorithm.

    In Machine Learning: Proceedings of the Sixteenth International Conference, pages 124–133, 1999.
  • [14] M. Helmstaedter, K. L. Briggman, and W. Denk. 3D structural imaging of the brain with photons and electrons. Current Opinion in Neurobiology, 18(6):633–641, 2008.
  • [15] G. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504, 2006.
  • [16] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.
  • [17] G. B. Huang and V. Jain. Deep and wide multiscale recursive networks for robust image labeling. arXiv preprint arXiv:1310.0354, 2013.
  • [18] V. Jain, H. Seung, and S. Turaga. Machines that learn to segment images: a crucial technology for connectomics. Current opinion in neurobiology, 2010.
  • [19] V. Jain, S. C. Turaga, K. Briggman, M. N. Helmstaedter, W. Denk, and H. S. Seung. Learning to agglomerate superpixel hierarchies. In Advances in Neural Information Processing Systems, pages 648–656, 2011.
  • [20] E. Kalogerakis, A. Hertzmann, and K. Singh. Learning 3d mesh segmentation and labeling. ACM Transactions on Graphics (TOG), 29(4):102, 2010.
  • [21] G. Knott, H. Marchman, D. Wall, and B. Lich. Serial section scanning electron microscopy of adult brain tissue using focused ion beam milling. Journal of Neuroscience, 28(12):2959, 2008.
  • [22] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • [23] B. Li, A. Godil, M. Aono, X. Bai, T. Furuya, L. Li, R. López-Sastre, H. Johan, R. Ohbuchi, C. Redondo-Cabrera, et al. Shrec’12 track: Generic 3d shape retrieval. In Proceedings of the 5th Eurographics conference on 3D Object Retrieval, pages 119–126. Eurographics Association, 2012.
  • [24] D. G. Lowe. Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision, 60(2):91–110, Nov. 2004.
  • [25] R. Malladi, J. A. Sethian, and B. C. Vemuri. Shape modeling with front propagation: a level set approach. IEEE PAMI, 17(2):158–175, 1995.
  • [26] Y. Marc’Aurelio Ranzato, L. Boureau, and Y. LeCun.

    Sparse feature learning for deep belief networks.

    Advances in neural information processing systems, 20:1185–1192, 2007.
  • [27] J. Nunez-Iglesias, R. Kennedy, T. Parag, J. Shi, and D. B. Chklovskii.

    Machine learning of hierarchical clustering to segment 2D and 3D images.

    In CVPR, pages 1–15, 2013.
  • [28] B. A. Olshausen et al. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381(6583):607–609, 1996.
  • [29] L. Shapira, A. Shamir, and D. Cohen-Or. Consistent mesh partitioning and skeletonisation using the shape diameter function. The Visual Computer, 24(4):249–259, Jan. 2008.
  • [30] K. Smith, A. Carleton, and V. Lepetit. Fast Ray Features for Learning Irregular Shapes. IEEE Computer Vision, pages 397–404, 2009.
  • [31] J. W. Tangelder and R. C. Veltkamp. A survey of content based 3d shape retrieval methods. Multimedia tools and applications, 39(3):441–471, 2008.
  • [32] S. C. Turaga, J. F. Murray, V. Jain, F. Roth, M. Helmstaedter, K. Briggman, W. Denk, and H. S. Seung. Convolutional networks can learn to generate affinity graphs for image segmentation. Neural Computation, 22(2):511–538, 2010.
  • [33] C. Xu and J. L. Prince. Snakes, shapes, and gradient vector flow. IEEE Transactions on Image Processing, 7(3):359–69, Jan. 1998.
  • [34] Z. Yang, J. A. Bogovic, A. Carass, M. Ye, P. C. Searson, and J. L. Prince. Automatic cell segmentation in fluorescence images of confluent cell monolayers using multi-object geometric deformable model. In SPIE Medical Imaging, volume 8669, page 866904, Mar. 2013.

Appendix A Supplementary: Edge examples

Positive Edge
Negative Edge
Rendering Slice
Figure 5: Complex 3d segment shapes and spatial relationships makes agglomeration more challenging. The interdigitation of the segments in the positive example create a complex interface. The negative edge demonstrates that very thin structures can abut large segments, creating a interface relative to the smaller object. Furthermore, since much of the larger segment lies outside of the decision window, global properties (e.g., orientation) cannot be computed.

Appendix B Supplementary: Error Analysis

In this section, we explore potential causes of classifier errors and make suggestions for future improvements of the pipeline. Specifically, we examine a set of edges, half of which were false positives, half of which were false negatives, for which the both the MLP classifier and human expert were confident. We manually examined the segments, image, and affinity graph for these edges in an attempt to glean potential patterns that might help drive future improvements.

(a) False positive (b) False negative
Figure 6: Examples of false positive and false negative edges for experiment 25 in Table 1. The yellow line in (a) shows the true boundary between cells that have been undersegmented.

One characteristic that seems common among errors is the presence of ‘undersegmentation,’ the presence of segments that overlap more than one true object. Undersegmentation appear in of the error cases; of those examples are false positives. It is possible that these errors are due to segments that erroneously grow across cell boundaries. This can cause segments to become adjacent when the true objects are not, thereby confusing the classifier. An example of this phenomenon is shown in Figure 6(a).

Another property of some errors seems to be that they occur near boundaries of internal cell structures, such as mitochondria. Figure 6(b) shows an example of such a false negative edge. Notice that in this example, the red segment lies inside a mitochondrion, the blue segment consists of part of the cell outside the mitochondrion, and the two segments share a mutual boundary.

The patterns of error we observed above suggest some improvements for future work. First, the undersegmentations could be ameliorated either by refinements to the boundary prediction or to the procedure that generates segments from the boundaries. Segments that are too large could cause some of the false positives we have observed, and suggests that using a more conservative oversegmentation scheme that yields smaller objects might be preferable. Of course, whether this approach would cause false negatives would need exploration.

Secondly, the errors occurring near mitochondrial and other intra-cellular boundaries suggest that our methodology might benefit from a framework that explicitly identifies the locations of these problem areas. This new information could improve agglomeration, boundary prediction, or both.

Appendix C Supplementary: Precision-Recall Plots

Figure 7(a) shows the precision-recall curves in the low recall region. Experiments with very high training-set accuracy do not achieve high precision on the testing set due to overfitting. Figure 7(b) shows the precision-recall curves for end-to-end feature learning. Including training set augmentation improves performance much more than a multi-scale approach. Combining end-to-end features with hand-designed features improved the end-to-end features significantly.

(a) Low-recall region (b) End-to-end
Figure 7: Precision-recall curves in (a) low-recall regions and (b) for the end-to-end learned feature scheme.

Appendix D Supplementary: Full Augmentation Results

Training set Testing set
Aug. Exp. Feature Set Description ACC(%) AUC(%) ACC(%) AUC(%) Dim. Training Ex.


1 All Hand-sel. 99.98 99.98 91.04 96.33 363 14,552
2 End-end 100.0 99.98 84.34 93.53 5324 14,552
3 Unsup. Learned 100.0 99.98 91.38 97.14 16,000 14,552


4 All Hand-sel. 100.0 99.98 90.91 96.39 363 29,104
5 End-end 100.0 99.98 84.93 93.07 5324 29,104
6 Unsup. Learned invariant to segment order


7 All Hand-sel. 99.98 99.98 91.13 96.49 363 232,832
8 End-end 97.91 99.77 84.85 93.30 5324 232,832
9 Unsup. Learned 99.84 99.98 91.36 97.27 16,000 232,832


10 All Hand-sel. 100.0 99.98 90.06 95.18 363 392,904
11 End-end 99.96 99.98 85.37 93.51 5324 392,904
12 Unsup. Learned 100.0 99.98 91.35 97.18 16,000 392,904


13 End-end 89.48 96.32 86.38 94.55 5324 12,572,928
14 Unsup. Learned 100.0 99.98 91.57 97.41 16,000 1,571,616


15 All Hand-sel. (MLP2) 100.0 99.98 91.07 96.01 363 14,552
16 End-end (MLP2) 100.0 99.98 84.48 92.67 5324 14,552
17 Unsup. Learned (MLP2) 100.0 99.98 91.77 97.13 16,000 14,552


18 End-end (MLP2) 91.99 97.54 88.00 95.34 5324 12,572,928
19 Unsup. Learned (MLP2) 99.98 99.98 91.50 97.24 16,000 1,571,616
Table 4: Classification experiments using augmented training data and MLP’s with two hidden layers. The hand-designed feature set is comparable to experiment in Table 1, end-to-end features are are comparable to experiment in Table 2, and unsupervised features are comparable to experiment in Table 2. For computational reasons, we omit the ‘all augmentations’ experiment using all hand-designed features.