Appearance-based Medial Axis Transform for natural images
We introduce Appearance-MAT (AMAT), a generalization of the medial axis transform for natural images, that is framed as a weighted geometric set cover problem. We make the following contributions: i) we extend previous medial point detection methods for color images, by associating each medial point with a local scale; ii) inspired by the invertibility property of the binary MAT, we also associate each medial point with a local encoding that allows us to invert the AMAT, reconstructing the input image; iii) we describe a clustering scheme that takes advantage of the additional scale and appearance information to group individual points into medial branches, providing a shape decomposition of the underlying image regions. In our experiments, we show state-of-the-art performance in medial point detection on Berkeley Medial AXes (BMAX500), a new dataset of medial axes based on the BSDS500 database, and good generalization on the SK506 and WH-SYMMAX datasets. We also measure the quality of reconstructed images from BMAX500, obtained by inverting their computed AMAT. Our approach delivers significantly better reconstruction quality with respect to three baselines, using just 10 annotations are available at https://github.com/tsogkas/amat .READ FULL TEXT VIEW PDF
Appearance-based Medial Axis Transform for natural images
Symmetry is a ubiquitous property in the natural world, with a well-established role in human vision. Humans instinctively recognize and use symmetry to analyze complex scenes, as it facilitates the encoding of shapes and their discrimination and recall from memory [7, 34, 52]
. In the context of computer vision,local symmetry is of particular interest, because of its robustness to viewpoint changes and its connection to salient structures, such as object parts. This intuition is fundamental to many milestones in object representation theory, including generalized cylinders , superquadrics , geons , and shock graphs .
Fundamental notions of local symmetry were introduced decades ago by Blum in the context of binary shapes with the medial axis transform (MAT) [11, 12]. The MAT is a powerful shape abstraction, and provides a compact representation that preserves topological properties of the input shape. These properties are invariant to translation, rotation, scaling, articulation, and their locality offers robustness to occlusion. The MAT has been very effective in reducing the computational complexity of algorithms for various tasks, including shape matching  and recognition , mesh editing [26, 55], and shape manipulation . For these reasons many researchers have tried to achieve a good balance between MAT sparsity and reconstruction quality [46, 25].
Extending the notion of the MAT to natural images can correspondingly benefit applications that rely on a sparse set of highly informative keypoints/landmarks, such as registration , retrieval [44, 5]
, pose estimation and body tracking, and structure from motion . It could also assist segmentation by enforcing region-based constraints through their medial point representatives , and by providing a practical alternative to manual scribbles/seeds for interactive segmentation [13, 32, 20, 27]. Another interesting application is artistic rendering of images:  use approximate medial axes to simulate brush strokes and generate a painting-like version of the input photograph.
Unfortunately, the MAT has not found widespread use in tasks involving natural images, due to the lack of a generalization that accommodates color and texture. Previous works have mostly attacked medial point detection [49, 38], which amounts to determining the locations of points lying on medial axes but not the scale of the respective medial disks. The type of axes considered is also typically constrained to make the problem more concrete:  only considers elongated structures, on either foreground objects or background;  focuses on object skeletons, ignoring background structures. These methods lack another key characteristic of the MAT: medial point locations alone do not provide sufficient information to reconstruct the input.
In this paper we introduce the first “complete” MAT for natural images, dubbed Appearance-MAT (AMAT). First, we provide a new definition in the context of natural images by framing MAT as a weighted geometric set cover (WGSC) problem. Our definition is centered around the MAT invertibility property and elicits a straightforward criterion for quality assessment, in terms of the reconstruction of the input image. Second, our algorithm associates each medial point with scale as well as local appearance information that can be used to reconstruct the input. Thus, the AMAT encompasses all the fundamental features of its binary counterpart. Third, we describe a simple bottom-up grouping scheme that exploits the additional scale and appearance information to connect points into medial branches. These branches correspond to meaningful image regions, and extracting them can support image segmentation and object proposal generation, while offering a shape decomposition of the underlying structure as well.
Being bottom-up in nature, our method does not assume object-level knowledge. It computes medial axes of both foreground and background structures, yielding a compact representation that only uses of the image pixels. Yet, this sparse set of points carries most of image signal, differing from other sparse image descriptions, edge maps, which strip the input of all appearance information.
We perform experiments in medial point detection on a new dataset of medial axes, the Berkeley-Medial AXes (BMAX500), which is built on the popular BSDS500 dataset, showing state-of-the-art performance. We also measure the quality of reconstructions obtained by inverting the AMAT of images from the same dataset, using a variety of standard image quality metrics. We compare with three reconstruction baselines: one built on the medial point detection algorithm from  and two built from the ground-truth segmentations in BSDS500. Our method significantly outperforms the baselines in terms of reconstruction quality, while attaining a compression ratio.
The outline of the paper is as follows: we start by reviewing related work on medial axis extraction for binary shapes and natural images in Section 2. In Section 3 we describe our approach. Section 4 includes implementation details and in Section 5 we present our results. Finally, in Section 6 we conclude and discuss ideas for future directions.
Blum introduced the medial axis transform, or skeleton, of 2D shapes in his seminal works [11, 12]. Since then, researchers have developed algorithms for reliable and efficient medial axis extraction, its extension to 3D shapes, and its application to computer vision tasks.
Siddiqi define shocks as the singularities of a curve evolution process acting on the boundaries of a shape, and they organize them into a directed, acyclic shock graph . Shock graphs were successfully used in shape matching , recognition , and database indexing . Bone graphs  offer improved stability and a more intuitive representation of an object’s parts, by identifying and analyzing ligature structures. Visual part correspondences are also established and used to measure part and aggregated shape similarity in . The correspondence of skeleton branches to object parts is further explored in [28, 6]. More recently, Stolpner deal with the problem of approximating a 3D solid via a union of overlapping spheres .
The value of the MAT has been equally appreciated by the graphics community, where object shapes are routinely represented as point clouds or triangular meshes. Giesen  introduced the scale axis transform, a skeletal shape representation that yields a hierarchy of successively simplified skeletons, which are obtained by multiplicative scaling of the MAT’s radii. Li  use quadratic error minimization to compute an accurate linear approximation of the MAT, called Q-MAT. They show experiments on medial axis simplification where they reduce the number of nodes of an initial medial mesh by three orders of magnitude, while preserving good surface reconstruction. A comprehensive compilation of medial methods and their applications in the binary setting can be found in .
Compared to the binary setting, the number of works on medial axis detection for natural images is rather limited. Levinstein  detect symmetric parts of objects by learning to merge adjacent deformable, maximally inscribed disks, modeled as superpixels. Learned attachment relations are then used to combine detected parts into coarse skeletal representations. Lee extend that work by introducing a deformable disk model that can capture curved and tapered parts, and also add continuity constraints to the medial point grouping process . In other works medial point detection is posed as a classification problem where pixels are labeled as “medial” or “not-medial”, inspired by similar methods for boundary detection . Tsogkas and Kokkinos use multiple instance learning (MIL) to deal with the unknown scale and orientation during training , while Shen adapt a CNN with side outputs  for object skeleton extraction 
. All these approaches exploit appearance information by incorporating a machine learning algorithm.
Our work can be regarded as lying at the intersection of previous work on binary and natural images. From a technical standpoint, it shares more similarities with binary methods, for instance , which solves the set cover problem for volumes in the 3D space. At the same time, it can be applied to real images, without assuming a figure-ground segmentation, but it also demonstrates unique characteristics. Our method does not involve learning, and is not constrained in detecting a particular subset of medial axes as [49, 38]. It also complements existing methods by augmenting point locations with scale and appearance descriptions, which are necessary for reconstructing the input.
Consider a 2D binary shape, , like the one in Figure 2, and its boundary . The medial axis of is the set of points that are centers of the maximally inscribed (medial) disks, bitangent to in the interior of the shape. The medial (disk) radius is the distance between and the points where the disk touches . The process of mapping to the set of pairs is called the medial axis transform (MAT). Given these pairs, we can reconstruct as a union of overlapping disks that sweep-out its interior by “expanding” a value of one (1) inside the area covered by each medial disk.
We argue that a MAT for real images should satisfy a similar principle: given the MAT of an image, we should be able to “invert” it, reconstructing the image itself. There are several reasons why extending this idea to real images is a challenging task: natural images depict complex scenes, cluttered with numerous objects, instead of just a single foreground shape. Moreover, unlike binary images, real images exhibit complicated color and texture distributions. Nevertheless, we can exploit image redundancies and assume that an image is composed of many small regions of relatively uniform appearance. This is the same assumption that underlies most superpixel algorithms which break up an image into non-overlapping patches, while respecting perceptually meaningful region boundaries [39, 24, 1].
In the rest of the paper we denote a disk of radius , centered at point , as . For brevity, we often refer to such a disk as a -disk or -disk. is a collection of such disks of varying centers and radii, . The intersection of a -disk with an image is a disk-shaped region of the image, and is denoted by . Finally, we use to denote function composition, and for an appropriate error metric (, the norm).
Consider an RGB image , and a disk-shaped region . Let be a function that maps
to a vector; we call the encoding of . Now let be a function that maps back to a disk patch . We call the decoding function. In the general case, and will be lossy mappings, which means that the reconstruction error . Using the above, we define the AMAT as the set of tuples , such that:
In Section 3.1 we discuss constraining .
Our framework allows to take any form; for example, could be a histogram representation of color in and could return the mode of the distribution. In this paper we opt for simplicity: computes the mean of each color channel “summarizing” , in a vector . Conversely, constructs an approximation by replicating in the respective disk-shaped area. When the -disk is fully enclosed in a uniform region the reconstruction error is low, whereas when the disk crosses a strong image boundary, the encoding cannot accurately represent the underlying image region, resulting in a higher error.
Note that the definition in Equation (1) suggests conceptual similarities with superpixel representations. Selecting the points is equivalent to covering the input image with disk-shaped superpixels. Minimizing the total reconstruction error implies that these “superdisks” do not cross region boundaries, as this would incur a high reconstruction error, as shown in Figure 2. However, there are two important differences: First, in our case a canonical shape (disk) is used, whereas superpixels can have any form. Second, our disks are overlapping, in contrast to standard, non-overlapping superpixels.
Using canonical shapes helps achieve sparsity of the final MAT. Disks are optimal in that sense, as they are rotationally invariant and are fully defined using only their center and radius. By contrast, a free-form element requires storing coordinates of all its boundary points. On the other hand, using one shape and no overlap would not reduce reconstruction quality, but it would result in disjointed medial points instead of smooth, connected medial axes.
The geometric set cover is the extension of the well studied set cover problem, in a geometric space. Here we only consider the case of a two-dimensional space and we particularly focus on the weighted version of the problem, which is defined as follows: Consider a universe of points and subsets , called ranges. A common choice for is intersections of with simple shape primitives, such as disks or rectangles.
Now assume that each element in is associated with a non-negative weight or cost . Solving the WGSC problem amounts to finding a sub-collection that covers the entire (all elements of are contained in at least one set in ), while having the minimum total cost ; the total cost is simply the sum of costs of individual elements in . WGSC is a strongly NP-hard problem for which polynomial-time approximate solutions (PTAS) exist. The interested reader can find more details on WGSC and related algorithms in [31, 50, 19, 14].
The AMAT formulation lends itself naturally to a WGSC interpretation. The spatial support of an input image , is the universe of points. As we consider the set of -disks with chosen from a finite set . The -disks can be placed at any position such that is fully contained in . We also assign a cost to each -disk, . Note that for brevity, we drop the subscripts and simply use . We provide more details regarding computation of in Section 4.
As Equation (1) suggests, the goal is to find a subset of disks that cover the entire image, while maintaining a low total reconstruction cost. A trivial solution would be to select each pixel as a disk of radius , in which case , and ; each pixel can be perfectly represented by its mean value. Such a solution is of no practical use. Staying true to the spirit of the MAT, we seek a solution that is sparse (low number of medial points ), while being able to adequately reconstruct the input image. One possible way to do this would be to agree on a fixed “budget” of points, and look for the optimal solution, given . However, choosing an acceptable can be a nuisance, as its value can vary significantly from image to image.
In the original MAT, sparsity is implicitly induced through the use of maximal disks, touching the shape boundary at two or more points. Extending the maximality principle to real images is not straightforward because color and texture boundaries are not robustly defined. Relying on the output of an edge extraction algorithm is not a viable option either, as it would make our method sensitive to errors from which it would be impossible to recover.
Instead, we choose to regularize the minimization criterion in Equation (1) by adding a scale-dependent term to the costs . This way we favor the selection of larger disks at each point, as long as is not “too” large with respect to the error incurred by picking instead of . Selecting a high value for leads to a sparser solution with higher total reconstruction error, whereas a low value for aims for a better reconstruction, by utilizing more, smaller disks to cover . Figure 2 (right) shows a toy example of these two cases and Figure 3 shows how varying progressively removes details in a real image, keeping only the coarser structures.
There are many polynomial-time-approximate-solution (PTAS) algorithms for the vanilla set cover problem and its geometric variants. Here we use the simple, greedy algorithm described in , adapted for the weighted case. The steps of our method are described in Algorithm 1.
We start by computing the costs for all possible disks . We define the effective cost of as , where is the number of new pixels covered by (pixels that have not been covered by a previously selected disk). Starting from an empty set , we pick the disk with the lowest and add it to the solution, removing the area from the remaining pixels to be covered. We also adjust the cost of all disks that intersect with , because each disk should be penalized only for the new pixels it is covering. This process is repeated until all image pixels have been covered by at least one disk.
The scale and appearance associated with each medial point provide a rich description that can be used to group points belonging to the same region into medial branches. The beneficial effects of grouping in low-level vision tasks have been observed in previous works [16, 57, 21, 33]. In our case, grouping pixels into branches can help us refine the final medial axis, by aggregating consensus from neighboring points, and break the image into meaningful regions.
We group detected medial points using an agglomerative scheme that starts at fine scales and progressively merges together nearby points at coarser scales. Our grouping criterion relies on proximity in scale-space and appearance
. Intuitively, points that lie close have higher probability of belonging to the same branch. We also expect that the scale of points will changegradually along a branch, so points that lie close to each other but have very different radii should probably not be grouped together. Finally, two points should not be grouped if their encodings are very dissimilar, regardless of their proximity in scale-space.
We initialize branches as the connected components of the AMAT output. Starting at a scale , we consider one branch at a time, and examine all other branches within a neighborhood of size and a scale neighborhood . If two branches coexist in this scale-space neighborhood and their average encodings (summed along the branch curve) are similar, they are merged. The grouping algorithm terminates when all scales have been considered.
The output of our algorithm captures mostly region centerlines but there are still imperfections in the form of noisy, disconnected medial point responses or “lumps”, instead of thin contours. Such imperfections are expected because of the approximate solution to the minimization problem of Equation (1) and the use of a discrete grid.
Grouping MAT points into branches makes it possible to process each branch individually, enabling the correction of these errors post hoc. We perform simple morphological operations (dilation and thinning) on the points of each branch to merge neighboring and isolated pixels together, while removing redundant responses. We also adjust the scales of the medial points, to ensure that the medial disks corresponding to the simplified structure span the same image area. Because grouped branches correspond to relatively homogeneous regions, reconstruction results after simplification are practically identical. Examples of simplified medial axes are illustrated in Figure 4.
Using a simple error metric such as MSE to compute
is not effective since disks with low MSE scores do not necessarily respect image boundaries. We propose the following alternative heuristic: First, we convert the RGB image to the CIELAB color space which is more suitable for measuring perceptual distances. Then, we define the cost ofas
Intuitively, a low cost implies that the encoding is representative of all disks that are fully contained in , hence is not crossing any region boundaries.
The main motivation behind the choice of simple functions , was simplicity and computational efficiency. Such functions also allow us to inject certain desired characteristics in the AMAT solution, such as appearance uniformity and alignment with boundaries.
However, natural images often contain high-frequency textures or noise, which can lead to the accumulation of large errors in Equation (2), and promote the selection of disks that do not correspond to perceptually coherent regions. Simple processing techniques (, Gaussian filtering) can reduce noise but they also degrade image boundaries and blend together neighboring regions.
To alleviate this problem, we “simplify” the input image before extracting the AMAT, using a method that smooths high frequency regions, while preserving important edges . In practice, this preprocessing produces an image that is perceptually very similar to the original, but without high-frequency textures that can cause the greedy algorithm to fail by placing disks at undesired locations.
Generating the reconstruction of a single disk-shaped region, , is trivially achieved by replicating . However, since medial disks overlap, most pixels in the image domain will be covered by multiple disks with different encodings. We resolve this ambiguity in a simple way: while computing the AMAT, we keep track of the number of disks each pixel is covered by; this quantity is called depth in the context of the set cover problem. We then use the average of all disks covering a point with depth as its reconstructed value:
For the smoothing algorithm we use the default values and that the authors suggest for natural images . Regarding the scale cost term described in Section 3.1, we found that is a value that strikes a good balance between reconstruction quality and sparsity of the generated medial axis. The maximum radius must be finite to keep complexity manageable, but large enough to capture large uniform structures in the image. Based on the size of images used in our experiments we used scales, excluding to force disks to be larger than single pixels; thus .
Computing requires computing differences for all disks in . If is large, this number can grow quickly, yielding complexity. However, the most time-consuming step is the greedy approximation algorithm: At each iteration we cover at most pixels, but we also have to update the costs of all overlapping disks. This has complexity. One could parallelize the procedure by partitioning an image, simultaneously processing individual parts, and combining the results. Our single-thread MATLAB implementation takes for the AMAT, grouping, and simplification steps, on a image.
We evaluate the performance of our method on two tasks: i) localization of medial points in an image; and ii) generating accurate reconstructions of images, given their AMAT.
We want to emphasize the difference between the problem we are addressing and the objectives pursued in other works. In  the authors focus on detecting local reflective symmetries of elongated structures, and they build a dataset with annotations of segments in the BSDS300 that fit this description. As a result, a large portion of the segments in BSDS is not used in performance evaluation. In  the authors are explicitly interested in extracting object skeletons, completely ignoring background structures. Although extracting object skeletons may be convenient for some tasks, it does not constitute a generalized notion of MAT.
In our work we do not make such distinctions. The central idea behind the AMAT is to be able to reproduce the full input image, so we view all parts of the image as equally important. This is also the reason we choose BSDS500 as a basis for constructing medial axes annotations. BSDS500 contains multiple segmentations for each image, offering higher probability of capturing segments at varying scales, making it more relevant to the problem we are trying to solve than datasets with object-level annotations.
Following , we individually apply a skeletonization algorithm  to binary masks of all segments in a given segmentation, extracting segment skeletons. The medial axis ground-truth for the image is formed by taking the union of all the segment skeletons, and this process is repeated for all available annotations (usually 5-7 per image). To conduct a fair comparison, we retrain the CG+BG+TG variant (MIL-color) from  on BMAX500. We also tried to retrain the CNN used in , but the outputs we obtained were too noisy, and of no practical use. We hypothesize that this is because of the lack of consensus among the multiple ground-truth maps available for each image, which leads to convergence problems for the network; this has been previously reported in . We evaluate performance using the standard precision, recall and F-measure metrics, and show the superior results of our method in Table 1. Note that our algorithm outputs binary skeletons, so plotting a PR-curve by varying a score threshold is not applicable in our case. “Human” performance is defined in the same manner as in [30, 49]. For all methods, detections within a distance of of the image diagonal from a ground-truth positive are considered as true positives. We show qualitative results of the medial axes and the grouped branches in Figure 4.
As an additional baseline we compute skeletons after running Arbelaez’s segmentation algorithm [3, 4] at scales 0.2 (F=0.61), 0.3 (F=0.58), 0.4 (F=0.54), 0.5 (F=0.5). We point out that the performance of UCM + skeletonization depends critically on the threshold selection. The optimal threshold is not known a-priori and, given a desired level of skeleton detail, the appropriate value varies from image to image. By contrast, AMAT’s scale parameter is more intuitive to select and provides image-independent control of skeleton detail.
We also evaluate the performance of the AMAT on two additional datasets: WH-SYMMAX  (F=0.44) and SK506  (F=0.33). We compare with the pretrained FSDS  evaluating only on foreground skeletons, since our approach does not distinguish foreground from background. FSDS performs better than AMAT (F=0.67 and F=0.45 respectively). This is unsurprising, given that FSDS is a supervised method trained on these datasets in a way that allows it to take advantage of rich, object-specific information. However, this specialization comes at a cost: FSDS cannot generalize well to structures it has not seen before, which is evident when running it on BMAX500 (F=0.34 F=0.56 for AMAT).
We now assess the quality of reconstructions we obtain by inverting the computed AMAT of images from the BSDS500 dataset. We compare with a baseline reconstruction algorithm based on the MIL approach of 
(after retraining MIL-color on BMAX500). Their method uses features extracted in rectangular areas to produce a map of medial point strength at 13 scales and 8 orientations, for each pixel. A single confidence value for each point is derived through a noisy-or operation, which does away with scale and orientation information. As a surrogate, in our experiments we associate each point with the scale/orientation combination that has the highest score.
The scheme we use to create a crude reconstruction with their approach is the following: We start by sorting medial point scores in decreasing order and we pick the highest-scoring point. The rectangular region at the respective scale and orientation is then marked as covered, and the process is repeated until the whole image has been reconstructed. Similarly to our own method, point encodings are the mean RGB values within the rectangle, and local reconstructions are computed by averaging overlapping encodings. We also compare with two more baselines: one obtained by considering ground-truth (GT) segments in BSDS500 and representing them by their mean RGB values (GT-seg); and a second, obtained through the GT skeletons and radii in BMAX500 (GT-skel). For the latter, we use the reconstruction process described in Section 4.
We have defined the first complete medial axis transform for natural images. Our approach bridges the gap between MAT methods for binary shapes and medial axis/local symmetry detection methods for real images. We have demonstrated state-of-the-art performance in medial point detection and shown that we can produce a high-quality rendering of the input image using as few as of its pixels.
That said, it is important to note that AMAT is not designed to be optimal for either of these tasks. Instead, it is designed to strike a balance between two conflicting objectives: i) capturing an image’s salient structures (in the form of medial axes and their respective scale/appearance information); ii) providing an accurate reconstruction of the original image from this abstracted representation. Therefore, performance should be assessed on both objectives jointly.
We also want to emphasize that AMAT is a purely bottom-up algorithm, completely unsupervised and train-free. We consider this an important advantage of our approach, as it means that it can generalize well and in a predictable way to new datasets, without the need for additional tuning. Despite the lack of training, we have shown that it performs surprisingly well, and can even be competitive with supervised methods fine-tuned to specific datasets.
In future work, our goal is to parameterize our method to accommodate the relative roles of shape and appearance, and allow for flexible hierarchical grouping of medial branches to support segmentations of varying granularities. Furthermore, although our current choice of favors simplicity and compactness at the cost of texture, our framework can accommodate any encoding/decoding functions. Designing alternatives to better capture and reconstruct texture, or for specific discriminative tasks, is another exciting future direction.
This work was funded by NSERC. We thank Ioannis Gkioulekas for his valuable suggestions and feedback.