The diffuse interface model by Bertozzi and Flenner combines methods for diffusion on graphs with efficient partial differential equation techniques to solve binary segmentation problems. As with other methods inspired by physical phenomena[Bertozzi et al., 2007, Jung et al., 2007, Li and Kim, 2011], it requires the minimization of an energy expression, specifically the Ginzburg-Landau (GL) energy functional. The formulation generalizes the GL functional to the case of functions defined on graphs, and its minimization is related to the minimization of weighted graph cuts [Bertozzi and Flenner, 2012]
. In this sense, it parallels other techniques based on inference on graphs via diffusion operators or function estimation[Coifman et al., 2005, Chung, 1997, Zhou and Schölkopf, 2004, Szlam et al., 2008, Wang et al., 2008, Bühler and Hein, 2009, Szlam and Bresson, 2010, Hein and Setzer, 2011].
Multiclass segmentation methods that cast the problem as a series of binary classification problems use a number of different strategies: (i) deal directly with some binary coding or indicator for the labels [Dietterich and Bakiri, 1995, Wang et al., 2008]
, (ii) build a hierarchy or combination of classifiers based on the one-vs-all approach or on class rankings[Hastie and Tibshirani, 1998, Har-Peled et al., 2003] or (iii) apply a recursive partitioning scheme consisting of successively subdividing clusters, until the desired number of classes is reached [Szlam and Bresson, 2010, Hein and Setzer, 2011]. While there are advantages to these approaches, such as possible robustness to mislabeled data, there can be a considerable number of classifiers to compute, and performance is affected by the number of classes to partition.
In contrast, we propose an extension of the diffuse interface model that obtains a simultaneous segmentation into multiple classes. The multiclass extension is built by modifying the GL energy functional to remove the prejudicial effect that the order of the labelings, given by integer values, has in the smoothing term of the original binary diffuse interface model. A new term that promotes homogenization in a multiclass setup is introduced. The expression penalizes data points that are located close in the graph but are not assigned to the same class. This penalty is applied independently
of how different the integer values are, representing the class labels. In this way, the characteristics of the multiclass classification task are incorporated directly into the energy functional, with a measure of smoothness independent of label order, allowing us to obtain high-quality results. Alternative multiclass methods minimize a Kullback-Leibler divergence function[Subramanya and Bilmes, 2011] or expressions involving the discrete Laplace operator on graphs [Zhou et al., 2004, Wang et al., 2008].
This paper is organized as follows. Section 2 reviews the diffuse interface model for binary classification, and describes its application to semi-supervised learning. Section 3 discusses our proposed multiclass extension and the corresponding computational algorithm. Section 4 presents results obtained with our method. Finally, section 5 draws conclusions and delineates future work.
2 Data Segmentation With the Ginzburg-Landau Model
The diffuse interface model [Bertozzi and Flenner, 2012] is based on a continuous approach, using the Ginzburg-Landau (GL) energy functional to measure the quality of data segmentation. A good segmentation is characterized by a state with small energy. Let be a scalar field defined over a space of arbitrary dimensionality, and representing the state of the system. The GL energy is written as the functional
with denoting the spatial gradient operator, a real constant value, and a double well potential with minima at :
Segmentation requires minimizing the GL functional. The norm of the gradient is a smoothing term that penalizes variations in the field . The potential term, on the other hand, compels to adopt the discrete labels of or , clustering the state of the system around two classes. Jointly minimizing these two terms pushes the system domain towards homogeneous regions with values close to the minima of the double well potential, making the model appropriate for binary segmentation.
The smoothing term and potential term are in conflict at the interface between the two regions, with the first term favoring a gradual transition, and the second term penalizing deviations from the discrete labels. A compromise between these conflicting goals is established via the constant . A small value of denotes a small length transition and a sharper interface, while a large weights the gradient norm more, leading to a slower transition. The result is a diffuse interface between regions, with sharpness regulated by .
It can be shown that in the limit this function approximates the total variation (TV) formulation in the sense of functional () convergence [Kohn and Sternberg, 1989], producing piecewise constant solutions but with greater computational efficiency than conventional TV minimization methods. Thus, the diffuse interface model provides a framework to compute piecewise constant functions with diffuse transitions, approaching the ideal of the TV formulation, but with the advantage that the smooth energy functional is more tractable numerically and can be minimized by simple numerical methods such as gradient descent.
The GL energy has been used to approximate the TV norm for image segmentation [Bertozzi and Flenner, 2012]
and image inpainting[Bertozzi et al., 2007, Dobrosotskaya and Bertozzi, 2008]. Furthermore, a calculus on graphs equivalent to TV has been introduced in [Gilboa and Osher, 2008, Szlam and Bresson, 2010].
Application of Diffuse Interface Models to Graphs
An undirected, weighted neighborhood graph is used to represent the local relationships in the data set. This is a common technique to segment classes that are not linearly separable. In the -neighborhood graph model, each vertex
of the graph corresponds to a data point with feature vector, while the weight is a measure of similarity between and . Moreover, it satisfies the symmetry property . The neighborhood is defined as the set of closest points in the feature space. Accordingly, edges exist between each vertex and the vertices of its -nearest neighbors. Following the approach of [Bertozzi and Flenner, 2012], we calculate weights using the local scaling of Zelnik-Manor and Perona [Zelnik-Manor and Perona, 2005],
Here, defines a local value for each , where is the position of the th closest data point to , and is a global parameter.
It is convenient to express calculations on graphs via the graph Laplacian matrix, denoted by . The procedure we use to build the graph Laplacian is as follows.
Compute the similarity matrix with components defined in (3). As the neighborhood relationship is not symmetric, the resulting matrix is also not symmetric. Make it a symmetric matrix by connecting vertices and if is among the -nearest neighbors of or if is among the -nearest neighbors of [von Luxburg, 2006].
Define as a diagonal matrix whose th diagonal element represents the degree of the vertex , evaluated as
Calculate the graph Laplacian: .
Generally, the graph Laplacian is normalized to guarantee spectral convergence in the limit of large sample size [von Luxburg, 2006]. The symmetric normalized graph Laplacian is defined as
Data segmentation can now be carried out through a graph-based formulation of the GL energy. To implement this task, a fidelity term is added to the functional as initially suggested in [Dobrosotskaya and Bertozzi, 2010]. This enables the specification of a priori information in the system, for example the known labels of certain points in the data set. This kind of setup is called semi-supervised learning (SSL). The discrete GL energy for SSL on graphs can be written as [Bertozzi and Flenner, 2012]:
In the discrete formulation, is a vector whose component represents the state of the vertex , is a real constant characterizing the smoothness of the transition between classes, and is a fidelity weight taking value if the label (i.e. class) of the data point associated with vertex is known beforehand, or if it is not known (semi-supervised).
Equation (6) may be understood as an example of the more general form of an energy functional for data classification,
where the norm is a regularization term and is a fidelity term. The choice of the regularization norm has non-trivial consequences in the final classification accuracy. Attractive qualities of the norm include allowing classes to be close in a metric space, and obtain segmentations for nonlinearly separable data. Both of these goals are addressed using the GL energy functional for SSL.
Minimizing the functional simulates a diffusion process on the graph. The information of the few labels known is propagated through the discrete structure by means of the smoothing term, while the potential term clusters the vertices around the states and the fidelity term enforces the known labels. The energy minimization process itself attempts to reduce the interface regions. Note that in the absence of the fidelity term, the process could lead to a trivial steady-state solution of the diffusion equation, with all data points assigned the same label.
The final state of each vertex is obtained by thresholding, and the resulting homogeneous regions with labels of and constitute the two-class data segmentation.
3 Multiclass Extension
The double-well potential in the diffuse interface model for SSL flows the state of the system towards two definite labels. Multiple-class segmentation requires a more general potential function that allows clusters around more than two labels. For this purpose, we use the periodic-well potential suggested by Li and Kim [Li and Kim, 2011],
where denotes the fractional part of ,
and is the largest integer not greater than .
This periodic potential well promotes a multiclass solution, but the graph Laplacian term in Equation (6) also requires modification for effective calculations due to the fixed ordering of class labels in the multiple class setting. The graph Laplacian term penalizes large changes in the spatial distribution of the system state more than smaller gradual changes. In a multiclass framework, this implies that the penalty for two spatially contiguous classes with different labels may vary according to the (arbitrary) ordering of the labels.
This phenomenon is shown in Figure 1. Suppose that the goal is to segment the image into three classes: class 0 composed by the black region, class 1 composed by the gray region and class 2 composed by the white region. It is clear that the horizontal interfaces comprise a jump of size 1 (analogous to a two class segmentation) while the vertical interface implies a jump of size 2. Accordingly, the smoothing term will assign a higher cost to the vertical interface, even though from the point of view of the classification, there is no specific reason for this. In this example, the problem cannot be solved with a different label assignment. There will always be an interface with higher costs than others independent of the integer values used.
Thus, the multiclass approach breaks the symmetry among classes, influencing the diffuse interface evolution in an undesirable manner. Eliminating this inconvenience requires restoring the symmetry, so that the difference between two classes is always the same, regardless of their labels. This objective is achieved by introducing a new class difference measure.
3.1 Generalized Difference Function
The final class labels are determined by thresholding each vertex , with the label set to the nearest integer:
The boundaries between classes then occur at half-integer values corresponding to the unstable equilibrium states of the potential well. Define the function to represent the distance to the nearest half-integer:
A schematic of is depicted in Figure 2. The function is used to define a generalized difference function between classes that restores symmetry in the energy functional. Define the generalized difference function as:
Thus, if the vertices are in different classes, the difference between each state’s value and the nearest half-integer is added, whereas if they are in the same class, these differences are subtracted. The function corresponds to the tree distance (see Fig. 2). Strictly speaking, is not a metric since it does not satisfy . Nevertheless, the cost of interfaces between classes becomes the same regardless of class labeling when this generalized distance function is implemented.
The GL energy functional for SSL, using the new generalized difference function , is expressed as
Note that could also be used in the fidelity term, but for simplicity this modification is not included. In practice, this has little effect on the results.
3.2 Computational Algorithm
The GL energy functional given by (13) may be minimized iteratively, using gradient descent:
where is a shorthand for , represents the time step and the gradient direction is given by:
The gradient of the generalized difference function is not defined at half integer values. Hence, we modify the method using a greedy strategy: after detecting that a vertex changes class, the new class that minimizes the smoothing term is selected, and the fractional part of the state computed by the gradient descent update is preserved. Consequently, the new state of vertex is the result of gradient descent, but if this causes a change in class, then a new state is determined.
Specifically, let represent an integer in the range of the problem, i.e. , where is the number of classes in the problem. Given the fractional part resulting from the gradient descent update, define . Find the integer that minimizes , the smoothing term in the energy functional, and use as the new vertex state. A summary of the procedure is shown in Algorithm 1 with denoting the maximum number of iterations.
The performance of the multiclass diffuse interface model is evaluated using a number of data sets from the literature, with differing characteristics. Data and image segmentation problems are considered on synthetic and real data sets.
4.1 Synthetic Data
A synthetic three-class segmentation problem is constructed following an analogous procedure used in [Bühler and Hein, 2009] for “two moon” binary classification, using three half circles (“three moons”). The half circles are generated in . The two top circles have radius and are centered at and . The bottom half circle has radius and is centered at . We sample 1500 data points (500 from each of these half circles) and embed them in . The embedding is completed by adding Gaussian noise with to each of the 100 components for each data point. The dimensionality of the data set, together with the noise, make this a nontrivial problem.
Three-class segmentation. Left: spectral clustering. Right: multiclass GL (adaptive).
The difficulty of the problem is illustrated in Figure 3, where we use both spectral clustering decomposition and the multiclass GL method. The same graph structure is used for both methods. The symmetric graph Laplacian is computed based on edge weights given by (3), using nearest neighbors and local scaling based on the closest point. The spectral clustering results are obtained by applying a -means algorithm to the first eigenvectors of the symmetric graph Laplacian. The average error obtained, over 100 executions of spectral clustering, is 20% (). The figure displays the best result obtained, corresponding to an error of .
The multiclass GL method was implemented with the following parameters: interface scale , step size and number of iterations . The fidelity term is determined by labeling 25 points randomly selected from each class (5% of all points), and setting the fidelity weight to for those points. Several runs of the procedure are performed to isolate effects from the random initialization and the arbitrary selection of fidelity points. The average error obtained, over 100 runs with four different fidelity sets, is 5.2% (). In general terms, the system evolves from an initially inhomogeneous state, rapidly developing small islands around fidelity points that become seeds for homogeneous regions and progressing to a configuration of classes forming nearly uniform clusters.
The multiclass results were further improved by incrementally decreasing to allow sharper transitions between states as in [Bertozzi and Flenner, 2012]. With this approach, the average error obtained over 100 runs is reduced to 2.6% (). The best result obtained in these runs is displayed in Figure 3 and corresponds to an average error of 2.13%. In these runs, is reduced from to in decrements of 10%, with iterations performed per step. The average computing time per run in this adaptive technique is 1.53s in an Intel Quad-Core @ 2.4 GHz, without any parallel processing.
For comparison, we note the results from the literature for the simpler two moon problem (, noise). The best errors reported include: 6% for p-Laplacian [Bühler and Hein, 2009], 4.6% for ratio-minimization relaxed Cheeger cut [Szlam and Bresson, 2010], and 2.3% for binary GL [Bertozzi and Flenner, 2012]. While these are not SSL methods the last of these does involve other prior information in the form of a mass balance constraint. It can be seen that both of our procedures, fixed and adaptive , produce high-quality results even for the more complex three-class segmentation problem. Calculation times are also competitive with those reported for the binary case (0.5s - 50s).
4.2 Image Segmentation
As another test setup, we use a grayscale image of size , taken from [Jung et al., 2007, Li and Kim, 2011] and composed of 5 classes: black, dark gray, medium gray, light gray and white. This image contains structure, such as an internal hole and junctions where multiple classes meet. The image information is represented through feature vectors defined as , with and corresponding to coordinates of the pixel and equal to the intensity of the pixel. All of these are normalized so as to obtain values in the range .
The graph is constructed using nearest neighbors and local scaling based on the closest point. We use parameters , and . We then choose 1500 random points (4% of the total) for the fidelity term, with . Figure 4 displays the original image with the randomly selected fidelity points (top left), and the five-class segmentation. Each class image shows in white the pixels identified as belonging to the class, and in black the pixels of the other classes. In this case, all the classes are segmented perfectly with an average run time of 59.7s. The method of Li and Kim [Li and Kim, 2011] also segments this image perfectly, with a reported run time of 0.625s. However, their approach uses additional information, including a pre-assignment of specific grayscale levels to classes, and the overall densities of each class. Our approach does not require these.
4.3 MNIST Data
The MNIST data set available at http://yann.lecun.com/exdb/mnist/ is composed of 70,000 images of size , corresponding to a broad sample of handwritten digits 0 through 9. We use the multiclass diffuse interface model to segment the data set automatically into 10 classes, one per handwritten digit. Before constructing the graph, we preprocess the data by normalizing and projecting into 50 principal components, following the approach in [Szlam and Bresson, 2010]. No further steps, such as smoothing convolutions, are required. The graph is computed with nearest neighbors and local scaling based on the closest points.
An adaptive variant of the algorithm is implemented, with parameters , , decrement 10%, , and 40 iterations per step. For the fidelity term, 7,000 images (10% of total) are chosen, with weight . The average error obtained, over 20 runs with four different fidelity sets, is 7% (
). The confusion matrix for the best result obtained, corresponding to a 6.86% error, is given in Table1: each row represents the segmentation obtained, while the columns represent the true digit labels. For reference, the average computing time per run in this adaptive technique is 132s. Note that, in the segmentations, the largest mistakes made are in trying to distinguish digits 4 from 9 and 7 from 9.
For comparison, errors reported using unsupervised clustering algorithms in the literature are: 12.9% for p-Laplacian [Bühler and Hein, 2009], 11.8% for ratio-minimization relaxed Cheeger cut [Szlam and Bresson, 2010], and 12.36% for the multicut version of the normalized 1-cut [Hein and Setzer, 2011]
. A more sophisticated graph-based diffusion method applied in a semi-supervised setup (transductive classification), with function-adapted eigenfunctions, a graph constructed with 13 neighbors, and self-tuning with the 9th neighbor reported in[Szlam et al., 2008] obtains an error of 7.4%. Results with similar errors are reported in [Liu et al., 2010]. Thus, the performance of the multiclass GL on this data set improves upon other published results, while requiring less preprocessing and a simpler regularization of the functions on the graph.
|Obtained / True||0||1||2||3||4||5||6||7||8||9|
We have proposed a new multiclass segmentation procedure, based on the diffuse interface model. The method obtains segmentations of several classes simultaneously without using one-vs-all or alternative sequences of binary segmentations required by other multiclass methods. The local scaling method of Zelnik-Manor and Perona, used to construct the graph, constitutes a useful representation of the characteristics of the data set and is adequate to deal with high-dimensional data.
Our modified diffusion method, represented by the non-linear smoothing term introduced in the Ginzburg-Landau functional, exploits the structure of the multiclass model and is not affected by the ordering of class labels. It efficiently propagates class information that is known beforehand, as evidenced by the small proportion of fidelity points (4% - 10% of dataset) needed to perform accurate segmentations. Moreover, the method is robust to initial conditions. As long as the initialization represents all classes uniformly, different initial random configurations produce very similar results. The main limitation of the method appears to be that fidelity points must be representative of class distribution. As long as this holds, such as in the examples discussed, the long-time behavior of the solution relies less on choosing the “right” initial conditions than do other learning techniques on graphs.
State-of-the-art results with small classification errors were obtained for all classification tasks. Furthermore, the results do not depend on the particular class label assignments. Future work includes investigating the diffuse interface parameter . We conjecture that the proposed functional converges (in the -convergence sense) to a total variational type functional on graphs as approaches zero, but the exact nature of the limiting functional is unknown.
This research has been supported by the Air Force Office of Scientific Research MURI grant FA9550-10-1-0569 and by ONR grant N0001411AF00002.
- [Allwein et al., 2000] Allwein, E. L., Schapire, R. E., and Singer, Y. (2000). Reducing multiclass to binary: A unifying approach for margin classifiers. Journal of Machine Learning Research, 1:113–141.
- [Bertozzi et al., 2007] Bertozzi, A., Esedoḡlu, S., and Gillette, A. (2007). Inpainting of binary images using the Cahn-Hilliard equation. IEEE Transactions on Image Processing, 16(1):285–291.
- [Bertozzi and Flenner, 2012] Bertozzi, A. L. and Flenner, A. (2012). Diffuse interface models on graphs for classification of high dimensional data. Multiscale Modeling and Simulation, 10(3):1090–1118.
- [Bühler and Hein, 2009] Bühler, T. and Hein, M. (2009). Spectral clustering based on the graph -Laplacian. In Bottou, L. and Littman, M., editors, Proceedings of the 26th International Conference on Machine Learning, pages 81–88. Omnipress, Montreal, Canada.
- [Chung, 1997] Chung, F. R. K. (1997). Spectral graph theory. In Regional Conference Series in Mathematics, volume 92. Conference Board of the Mathematical Sciences (CBMS), Washington, DC.
- [Coifman et al., 2005] Coifman, R. R., Lafon, S., Lee, A. B., Maggioni, M., Nadler, B., Warner, F., and Zucker, S. W. (2005). Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps. Proceedings of the National Academy of Sciences, 102(21):7426–7431.
[Dietterich and Bakiri, 1995]
Dietterich, T. G. and Bakiri, G. (1995).
Solving multiclass learning problems via error-correcting output
Journal of Artificial Intelligence Research, 2(1):263–286.
- [Dobrosotskaya and Bertozzi, 2008] Dobrosotskaya, J. A. and Bertozzi, A. L. (2008). A wavelet-laplace variational technique for image deconvolution and inpainting. IEEE Trans. Image Process., 17(5):657–663.
- [Dobrosotskaya and Bertozzi, 2010] Dobrosotskaya, J. A. and Bertozzi, A. L. (2010). Wavelet analogue of the Ginzburg-Landau energy and its gamma-convergence. Interfaces and Free Boundaries, 12(2):497–525.
- [Gilboa and Osher, 2008] Gilboa, G. and Osher, S. (2008). Nonlocal operators with applications to image processing. Multiscale Modeling and Simulation, 7(3):1005–1028.
- [Har-Peled et al., 2003] Har-Peled, S., Roth, D., and Zimak, D. (2003). Constraint classification for multiclass classification and ranking. In S. Becker, S. T. and Obermayer, K., editors, Advances in Neural Information Processing Systems 15, pages 785–792. MIT Press, Cambridge, MA.
- [Hastie and Tibshirani, 1998] Hastie, T. and Tibshirani, R. (1998). Classification by pairwise coupling. In Advances in Neural Information Processing Systems 10. MIT Press, Cambridge, MA.
- [Hein and Setzer, 2011] Hein, M. and Setzer, S. (2011). Beyond spectral clustering - tight relaxations of balanced graph cuts. In Shawe-Taylor, J., Zemel, R., Bartlett, P., Pereira, F., and Weinberger, K., editors, Advances in Neural Information Processing Systems 24, pages 2366–2374.
[Jung et al., 2007]
Jung, Y. M., Kang, S. H., and Shen, J. (2007).
Multiphase image segmentation via Modica-Mortola phase transition.SIAM J. Appl. Math, 67(5):1213–1232.
- [Kohn and Sternberg, 1989] Kohn, R. V. and Sternberg, P. (1989). Local minimizers and singular perturbations. Proc. Roy. Soc. Edinburgh Sect. A, 111(1-2):69–84.
- [Li and Kim, 2011] Li, Y. and Kim, J. (2011). Multiphase image segmentation using a phase-field model. Computers and Mathematics with Applications, 62:737–745.
- [Liu et al., 2010] Liu, W., He, J., and Chang, S.-F. (2010). Large graph construction for scalable semi-supervised learning. Proceedings of the 27th International Conference on Machine Learning.
- [Subramanya and Bilmes, 2011] Subramanya, A. and Bilmes, J. (2011). Semi-supervised learning with measure propagation. Journal of Machine Learning Research, 12:3311–3370.
- [Szlam and Bresson, 2010] Szlam, A. and Bresson, X. (2010). Total variation and cheeger cuts. In Fürnkranz, J. and Joachims, T., editors, Proceedings of the 27th International Conference on Machine Learning, pages 1039–1046. Omnipress, Haifa, Israel.
- [Szlam et al., 2008] Szlam, A. D., Maggioni, M., and Coifman, R. R. (2008). Regularization on graphs with function-adapted diffusion processes. Journal of Machine Learning Research, 9:1711–1739.
- [von Luxburg, 2006] von Luxburg, U. (2006). A tutorial on spectral clustering. Technical Report TR-149, Max Planck Institute for Biological Cybernetics.
- [Wang et al., 2008] Wang, J., Jebara, T., and Chang, S.-F. (2008). Graph transduction via alternating minimization. Proceedings of the 25th International Conference on Machine Learning.
- [Zelnik-Manor and Perona, 2005] Zelnik-Manor, L. and Perona, P. (2005). Self-tuning spectral clustering. In Saul, L. K., Weiss, Y., and Bottou, L., editors, Advances in Neural Information Processing Systems 17. MIT Press, Cambridge, MA.
- [Zhou et al., 2004] Zhou, D., Bousquet, O., Lal, T. N., Weston, J., and Schölkopf, B. (2004). Learning with local and global consistency. In Thrun, S., Saul, L. K., and Schölkopf, B., editors, Advances in Neural Information Processing Systems 16, pages 321–328. MIT Press, Cambridge, MA.
- [Zhou and Schölkopf, 2004] Zhou, D. and Schölkopf, B. (2004). A regularization framework for learning from graph data. In Workshop on Statistical Relational Learning. International Conference on Machine Learning, Banff, Canada.