Construction of the similarity matrix for the spectral clustering method: numerical experiments

04/24/2019 ∙ by Paola Favati, et al. ∙ University of Pisa Consiglio Nazionale delle Ricerche 0

Spectral clustering is a powerful method for finding structure in a dataset through the eigenvectors of a similarity matrix. It often outperforms traditional clustering algorithms such as k-means when the structure of the individual clusters is highly non-convex. Its accuracy depends on how the similarity between pairs of data points is defined. Two important items contribute to the construction of the similarity matrix: the sparsity of the underlying weighted graph, which depends mainly on the distances among data points, and the similarity function. When a Gaussian similarity function is used, the choice of the scale parameter σ can be critical. In this paper we examine both items, the sparsity and the selection of suitable σ's, based either directly on the graph associated to the dataset or on the minimal spanning tree (MST) of the graph. An extensive numerical experimentation on artificial and real-world datasets has been carried out to compare the performances of the methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Clustering, a key step for many data mining problems, can be applied to a variety of different kinds of documents as long as a distance measure can be assigned to define the similarity among the data objects. A clustering technique classifies the data into groups, called

clusters, in such a way that the objects belonging to a same cluster are more similar to each other than to the objects belonging to different clusters. Clearly, a specific formulation of the clustering problem depends considerably on the metric that is assigned.

We assume here that the data objects are the distinct points of a subset of . It is standard to measure the distances in by using norms. Both the 1-norm, known as Manhattan, and the 2-norm, known as Euclidean, are suitable, while the -norm is not (see [1], Sec. 3.2.1.4). We use the Euclidean distance . The given data objects are represented in a natural way by an undirected weighted complete graph , having nodes and the distances as weights on the edges. We assume also that the required number of clusters is a priori fixed, with . Thus, a clustering corresponds to a partitioning of the indices .

To solve the clustering problem many different methods which work directly on the space have been devised, but they are not entirely satisfying. For example, the widely used Euclidean-based -means is very sensitive to the initialization and, applied directly to the original data, cannot deal with clusters that are nonlinearly separable in , has difficulties in taking into account other features like the cardinality or the density of each cluster, and in some situations can be very slow [2]. For this reason another approach can be followed using a graph-based procedure, which performs the clustering by constructing the so called similarity graph , having the same nodes of and possibly a subset of the edges of . The weights on the edges of are defined by using a similarity function to model the local neighborhood relationships between the data points, i.e. a higher similarity is assigned to pairs of closely related objects than to objects which are only weakly related. The nonnegative symmetric adjacency matrix associated to is the so-called similarity matrix. By using a spectral clustering algorithm the subspace , where the original data live, is mapped to a new subspace . The transformed representation gives better results than the original one, because the corresponding procedures are more flexible, can adjust to varying local densities and discover arbitrarily shaped clusters ([1], Sec. 6.7.1). Then the clustering in the subspace is computed and mapped back to . Spectral clustering algorithms have become very popular (see [3] for a thorough survey). In the experiments we choose the algorithm described in [4].

The similarity graph changes depending whether only distances among points are accounted for or density considerations are applied (see the literature review in [5] for a list of the different approaches). In general the procedures which exploit specifically the density are more expensive.

A similarity function widely used in literature to construct the similarity matrix is the Gaussian one. It depends on a scale parameter whose choice can be crucial. A possible way to select suggests to run the spectral algorithm repeatedly for different values of and select the one that provides the best clustering according to some criterium. Besides the obvious increase in the computational cost, the drawback of this procedure is the difficulty of choosing a reliable quality measure.

In this paper we take into consideration procedures for setting suitable values to without performing multiple runs. These procedures use techniques based either directly on a subgraph of associated to the dataset or on the minimal spanning tree (MST) of . We focus on aspects like the sparsity level of the similarity matrix and the computational cost of its construction. The paper is organized as follows: after a very concise review of the spectral clustering method, of the performance indices and of the minimum spanning tree in Section 2, in Section 3 the different types of sparsity of the similarity matrices, the role of a scale parameter in their construction and the different resulting methods we take into consideration are outlined. Finally, in Section 4 the results of the experimentation on both artificial and real-world datasets are presented and discussed.

2 Preliminaries

The idea behind a graph-based procedure is to convert the representation of objects given in the original space into a transformed representation in a new -dimensional feature space , being the required number of clusters, in such a way that the similarity structure of the data is preserved. To this aim, the clustering problem is reformulated making use of the similarity graph . The weight on the edge connecting the th and the th nodes depends mainly on their distance and is large for close points, small for far-away points. Other aspects, like for example the nearby density, could be taken into consideration. For a correct application of the spectral algorithm, the graph should be undirected and connected [6]. Section 3 is dedicated to the description of the similarity graphs we consider.

2.1 The normalized spectral clustering algorithm

Once the similarity matrix associated to graph has been constructed, the spectral clustering can be performed. Spectral clustering algorithms are based on the eigen-decomposition of Laplacian matrices. A detailed description of many such algorithms can be found in [3].

The unnormalized Laplacian matrix is , where is the degree matrix of , i.e. the diagonal matrix whose principal entry is given by . Matrix can be seen as a local average of the similarity and is used as a normalization factor for . An exposition of the properties of matrix and of its normalized versions, on which spectral algorithms rely, can be found in the tutorial [6]. In our experimentation we use the algorithm proposed by [4] which exploits the normalized Laplacian

This algorithm runs as follows:

  1. construct matrix ,

  2. compute the matrix , ,…, , where the are the mutually orthogonal eigenvectors of which correspond to the largest eigenvalues (ordered downward).

  3. perform a row normalization on to map the points to a unit hypersphere, by setting

    For , points of components , , are obtained. The new feature space , which is a subspace of , is so identified.

  4. compute the clustering of the points of , which is mapped back to the original points of .

The graph-based approach allows discovering clusters of arbitrary shape. This property is not specific of the particular clustering method used in the final phase 4. Since the transformed representation enhances the clustering properties of the data, even the simple -means can be successfully applied. In our experimentation we apply the clustering algorithm described in [7], which is based on a sequence of rotations and results to be fast and efficient.

Since the spectral clustering algorithm performs the dimensional reduction through the eigenvectors of , the computation of the eigenvectors (using Lanczos algorithm) can have substantial computational complexity for large and . The sparsity of matrix is then an important issue from this point of view.

2.2 The performance index

To measure the clustering validity one can rely on both internal or external indices (see [8] and its references). Internal indices generally evaluate the clustering in terms of compactness and separability, measured through Euclidean distances. When the data belong to and and the clusters are not convex or are badly separated or have different densities, these indices might not comply with what the human eye would suggest.

External indices evaluate how well the obtained clustering matches the target, i.e. the assumed clustering of the data. The results obtained in the experimentation of Section 4 have been evaluated by means of the external Normalized Mutual Information index (NMI) [9]. Three other external indices have also been used for comparison purpose, namely the purity index [10], the Rand index [11] and the Clustering Error [12]. Since their grading on the set of all the methods that will be introduced in Section 3 was fully matching, in Section 4 we report only the clustering evaluations based on the NMI index.

Let be the target clustering, the computed clustering of components and define , and , for and . The number counts how many points in the th cluster of belong also to the th cluster of . The NMI is given by

In the literature there is also a version where the normalization is performed with the geometrical mean of the entropies instead of the arithmetical one. In the case

, the algorithm is considered failed.

The larger the value NMI, the better the clustering performance according to the assumed target. Perfectly matching clusterings have NMI=1, but lower indices can still be considered acceptable when the difference is due to few far-away points, possibly outliers

. The treatment of outliers is not banal, as can be seen from the last three datasets in Figure

1

, where it is not easy to decide whether the points of the two bridges should be considered outliers or belonging to the clusters of the other points (and which one) or even should form two separated clusters. For this reason, no specific treatment for the outliers detection is performed in the experiments.

2.3 Minimum spanning tree

Given a connected undirected weighted graph with nodes, its minimum spanning tree (MST) is a subgraph which has the same nodes, but connected by no more than edges totalling the minimal weighting. Classical Prim’s algorithm [13] computes the MST with a computational cost of order if the graph is dense and of order , being the number of edges, if the graph is sparse, provided that adjacency lists are used for the graph representation and a Fibonacci heap is used as a working data structure [13].

Prim’s algorithm starts from an arbitrary root node, constructs a tree which grows until it spans all the nodes and returns the

-vector

such that is the parent of the th node. MST is unique if there are no ties in the pairwise distances. Since the graph is connected, also its MST is connected. From this property it follows that the graph obtained by removing all edges longer than the longest edge of its MST is still connected [6].

Clustering procedures which exploit the MST of a graph have been proposed [14, 15]. They act by removing from the graph the edges with the largest distances, until only disconnected components are obtained, avoiding to drop edges which lead to subsets of a single element considered outliers. Anyway, MST based clustering algorithms are known to be highly instable in presence of noise and/or outliers [16].

We are interested in MST not as a clustering procedure, but as a technique for extracting suitable values of the parameters that will be used in the construction of the similarity graphs, as described in the next section.

3 The methods

Sparsity is an important issue from the computational point of view: a sparse graph can be represented by adjacency lists which allow both a reduction of the required storage and the implementation of more efficient matrix-vector product algorithms. Hence, when is large, a sparse representation of the similarity among the data conveying sufficient information to get an acceptable clustering should be used.

The more immediate sparsity structures are obtained from distance-based graphs which take into account the pairwise distances of the points in . The sparse graph obtained by dropping from the complete graph some selected edges is denoted by and the similarity graph, which has the same nodes and edges of and weights defined through a similarity function, is denoted by . The two cases will be considered:

 the full case when , i.e. no edge is dropped;

 the sparse case where is a proper subgraph of obtained according to a chosen sparsity model.

The ratio , where is the number of dropped edges from , measures the sparsity level. The greater , the higher the sparsity.

3.1 The sparsity model

In this paper we consider the distance-based sparsity obtained through techniques like the ones outlined in [6] where the sparsity level is imposed through a sparsity parameter. The two following techniques are here considered.

(a) -neighbor sparsity, where is a selected threshold. In the sparse graph the nodes and are connected only if their pairwise distance is not greater than . As noted in [6], if is set equal to the largest weight of the MST of , then graph results connected, but this choice could give a too large if some very far-away outliers are present. For this reason smaller values should be chosen. We denote these methods as -methods.

(b) -nearest neighbor sparsity, where is the sparsity parameter. In the sparse directed graph node is connected to node if is among the -nearest neighbors of . Since the nearest neighborhood relationship is not symmetric, a symmetrization step must be provided to get an undirected graph. The symmetrization can be performed by connecting nodes and with an undirected edge

(b1) if both points and are among the -nearest neighbors of the other one. We denote these methods as -methods (the stands for mutual).

(b2) if either point or point is among the -nearest neighbors of the other one. We denote these methods as -methods (the stands for non mutual).

Contrary to -neighbor sparse graphs, -nearest neighbor sparse graphs allow an a-priori control of the achievable sparsity level. Moreover, -nearest neighbor sparse graphs allow an easier detection of outliers. In fact, if before symmetrization no directed edge exists from the node to the node for all , the point could be recognized as an outlier.

If the similarity graph obtained applying techniques (a) and (b) results to be unconnected, because too many connections have been lost, an aggregation step must be performed to guarantee sufficient connection.

For what concerns the -methods, in the experimentation we set equal to the mean of the distances of each point from its th closest neighbor. In any case the chosen integer depends on but is much smaller than . In [6] the value is suggested, on the basis of asymptotic connectivity considerations. In our experiments we consider also the value which leads to a lower sparsity of the similarity graph without a significant increase of the computational cost for the similarity matrix construction (see Subsection 3.4).

3.2 The similarity function

Let denote the set of edges of the sparse graph obtained according one of the techniques described above. Obviously the edge and for . A simple definition of the similarity function would be for , but this function is not satisfactory in presence of very close points. Similarity functions considered in literature are:

the unit similarity function

(1)

and the Gaussian similarity function

(2)

where the scale parameter controls the decay rate of the distances.

The values are the weights of the similarity graph . The similarity matrix is the adjacency matrix of , hence its entries are for and zero otherwise.

When the Gaussian similarity function (2) is used, a correct choice of may be critical for the efficiency of the spectral algorithm. A too small value of would give weights very close to 0, so that all the points would appear equally far-away. On the contrary, a too large value of would give weights very close to 1, so that all the points would appear equally close. In both cases it would be difficult to discriminate between close and distant points. Hence an intermediate value between the smallest and the largest , with , must be chosen.

Instead of a global , different local scales can be used for the different objects . Then (2) is replaced by

(3)

in order to tune pairwise distances according to the local statistics of the surrounding neighbors [17]. These local scales are suggested especially for high-dimensional problems with large behavior variations, where the choice (3) is trusted in better results than (2).

A first simple idea for determining a reasonable value of takes into account the decay rate of the exponential function whose inflection point discriminates between a first part of rapid decay and a second flatter part. This suggests to look for gaps in the curve of the distances sorted in descending order. If we can individuate a large gap, we can exploit it as a suitable value for .

Another idea takes into account the histogram of the distances . If the data form clusters, the histogram is multi-modal and, if is chosen around the first mode, the affinity values of the points which form a cluster can be expected to be significantly larger than others. This suggests to choose for a value somewhat smaller than the first mode of the histogram. In [18] this technique is said to work well for spherical-like clusters.

Although simple, these ideas are difficult to implement, because of the difficulty of detecting the right gap or mode. We have verified that their applicability is in effect restricted to a small number of cases, and for this reason we have discarded them.

To find a suitable value for the scale parameter, it is suggested in literature to run the spectral clustering algorithm repeatedly for different values of and select the one that provides the best clustering according to a chosen quality measure. For example, in [4] it is suggested to choose for the value which gives the tightest clusters in . To implement this technique, a decreasing sequence could be used, starting with a value of smaller than . However, the drawback of this procedure is the choice of the quality measure to use, which might be a non-monotone function of . In this case a reliable determination of an acceptable could not be obtained using only a small number of tries. Moreover, a test based on an internal performance index, like for example the DB index [19], does not guarantee that the final clustering is consistent with what a human would set as a target.

For a single run of the spectral clustering algorithm, a careful a-priori selection of should be considered and we suggest in the next subsection to exploit either the minimum spanning tree of , denoted by MST, or directly . In the first case MST is assumed as a reliable representation of the data and is set equal to its largest weight. In the second case, we suggest for local ’s the distance from of its th nearest neighbor, being a chosen index much smaller than . In [17] the index is suggested, but this choice, independent from , appears somewhat arbitrary. Our choice of ’s will be described in the next Section. From local scales a single global can be obtained by averaging the ’s.

If the similarity functions (2) or (3) are used, the graph might result to be sparse in practice also when no particular sparsity structure is explicitly imposed on , depending on the magnitude of the scale parameters, due to a possible fast decay of the exponential function. So, when edges with negligible weights are dropped and the adjacency lists are used to represent , for all computational purposes this graph can be considered sparse.

3.3 Definition of the methods

By the term “method” we intend the combination of a sparsity model and a similarity function (either unit, global Gaussian or local Gaussian). The different similarity matrices so constructed are then processed by the spectral algorithm.

 In the full case, no sparsity is imposed on , i.e. . In the following we refer to these methods as -methods. The unit similarity function is not used.

—  Method implements the similarity function (2). In [6] the largest weight of MST is suggested as . In our experimentation this choice appeared often too large, matrix resulted nearly full and gave really poor clusterings. Hence if results larger than the mean of all the pairwise distances , we set equal to .

—  Method implements the similarity function (3). The th local scale is set equal to the distance from of its th nearest neighbor, with . The sparsity level obtained with this value of is too low in most cases of the datasets considered in the experimentation, so the value , which would give lower sparsity levels, has not been used.

—  Method implements the similarity function (2) with equal to the mean of the previous ’s.

 In the sparse case, let and MST be the edge sets of and of MST, respectively. We set

(4)

The name of a method indicates the sparsity model with a subscript which indicates the similarity function: specifically, subscript 1 indicates similarity (1), subscript 2 indicates similarity (2) with , subscript 3 indicates similarity (3) with , subscript 4 indicates similarity (2) with . For example denotes the -method applied with the unit similarity function and denotes the -method applied with local Gaussian function (3). Each method is applied with both and .

On the whole, the methods considered for the experiments are .

3.4 Computational costs

We now consider the computational cost of the described methods, under the assumption that all the distances are available. Hence the cost for the construction of is not accounted for. When the graph is sparse, adjacency lists are used for the graph representation. The cost to construct the similarity matrix is given by , where

is the cost of sparsifying ,

is the cost of computing the scale parameters,

is the cost of evaluating the similarity function.

For -methods no sparsity is performed, hence and since the similarity function is evaluated on a nearly full matrix. For method , is of order as it follows from Subsection 2.3. Also for methods and , , indeed, given an integer , computational steps suffice to select the th nearest neighbor to a point and to identify the set of the th nearest neighbors, for (see [13]).

For the same reason, the detection of the edges of for -methods has a cost . For method the costs and are zero. The number of edges of is not a-priori quantifiable, and . For method and . For methods and both and are of order . In any case all the costs result to be of order .

For -methods and -methods, the number of edges of is of order and their selection has a cost of order . When the unit similarity function is used, the costs and are zero. Otherwise, both and are of order .

In brief, all the above described methods have costs of order . The methods which exploit the sparsity of the graphs have computational costs dominated by the cost of the sparsifying step.

4 The experimentation

The experiments have been conducted on an Intel(R)Core(TM)i7-4770CPU @3.40GHz using double precision arithmetic.

4.1 The datasets

Datasets belonging to four sets , , , have been considered. The target clustering is known for every dataset.

A, B. Sets and contain 12 artificial datasets of which can be found in the Spatial Data Mining Project of [20]. Their clusters have circular or noncircular shape, uniform or nonuniform density. In Figure 1 they are shown according to growing difficulty. Different gray levels evidence the assumed target clusters. The first 9 datasets belong to set and have outliers and well separated clusters; the last 3 datasets belong to set , have bridges which connect different clusters and lend themselves to possible different targets. The datasets have from 76 to 289 points, normalized in such a way that the maximum distance between points of each figure is equal to 1.

                                                                        

[0.2cm]

                                                                        

[0.2cm]

                                                                     

[0.2cm]

Figure 1: Artificial datasets generated in .

R. Set contains 6 artificial datasets generated in which consist of two interlacing rings with increasing data dispersion and have 900 points each (see three of them in Figure 2), and the 2 datasets generated in with 200 points each which can be found in [5] under the name I-.

Figure 2: Artificial datasets generated in .

U. Set contains 4 real-world datasets iris, wine, vote and seeds which have been taken from the UCI database [21].

Dataset iris

is a typical test case for many statistical classification techniques in machine learning. It consists of samples from three species of iris, with four features measured for each sample: the length and the width of sepals and petals. Iris contains

points in and 3 clusters are expected. Actually, the dataset contains only two well separated clusters: one cluster contains the measurements of only one species, while the other cluster contains the measurements of the other two species and is not evidently separable without further supervised information.

Dataset wine consists of samples of chemical analysis of wines derived from three different cultivars. The analysis determined 13 attributes for each sample. Wine contains points in and 3 clusters are expected.

Dataset vote consists of 435 samples. Each sample lists the votes (a vote can be ”yes”, ”no” or can be missing) given by one U.S. congressman (out of 168 republican and 267 democrat congressmen) on 16 different questions. Vote contains points in and 2 clusters are expected.

Dataset seeds consists of samples measuring the geometrical properties of the kernels of three different varieties of wheat: Kama, Rosa and Canadian, randomly selected for the experiment. A soft X-ray technique was used to construct seven real-valued attributes. Seeds contains points in and 3 clusters are expected.

4.2 Results

The methods listed in Subsection 3.3 have been applied to each dataset obtaining almost always clusterings with the expected number of clusters. For and , the th method has been applied to the th dataset, with for the set , for the set , for the set , for the set . The following quantities have been computed:
– the sparsity level of the similarity graph ,
– the accuracy , measured by the NMI index of the clustering .

For each set , , and , the averaged sparsity level and the averaged accuracy of the th method,

are computed varying the datasets in the set.

The sparsity level is measured with respect to the machine zero , i.e. , where is the number of elements in which are smaller than . When the Gaussian similarity function is used, the reference to instead of the real zero makes significant sparsity levels even in the case of a (theoretical) dense graph, as can be seen in Table 1. Except for the -methods, vary negligibly when the methods belong to the same class, pointing out that once a sparsity model has been chosen, the number of edges in is almost unaffected by the choice of . The index stands for any .

method Set A Set B Set R Set U
52% 74% 67% 44%
19% 24% 40% 24%
16% 18% 39% 18%
92% 95% 93% 94%
94% 95% 98% 95%
96% 97% 99% 97%
88% 92% 90% 91%
90% 91% 94% 91%
93% 94% 97% 95%
Table 1: Averaged sparsity levels of the methods.

Let denote the rank of the th method, i.e. the position of , in the not increasing sequence . Methods with equal accuracy receive the same ranking number according to the standard competition “1224” policy (i.e. the ranking of each method is equal to 1 plus the number of the methods which have a larger accuracy). Table 2 shows the averaged accuracies and the corresponding ranking . From Tables 1 and 2, it appears that there is not a strict relation between the sparsity level of a method and its accuracy, in the sense that the accuracy seems to depend on “which” more than on “how many” edges are retained in the graph . For example on class , the -methods with both values of get good results while the -methods with which have an intermediate sparsity level, obtain poor results.

Set A Set B Set R Set U
method
0.87 11 0.59 27 0.84 11 0.52 27
0.73 26 0.69 18 0.76 26 0.58 1
0.72 27 0.72 11 0.72 27 0.58 1
0.88 9 0.76 6 0.78 20 0.56 11
0.84 13 0.76 6 0.78 20 0.56 11
0.88 9 0.75 8 0.78 20 0.56 11
0.87 11 0.75 8 0.83 12 0.56 11
0.77 21 0.74 10 0.77 23 0.57 7
0.83 19 0.68 20 0.77 23 0.57 7
0.80 20 0.69 18 0.77 23 0.57 7
0.84 13 0.68 20 0.82 16 0.58 1
0.92 7 0.68 20 0.83 12 0.55 21
0.95 6 0.71 12 0.85 7 0.56 11
0.92 7 0.68 20 0.83 12 0.56 11
1.00 1 0.71 12 0.93 5 0.56 11
0.84 13 0.85 1 0.85 7 0.54 26
0.84 13 0.85 1 0.85 7 0.55 21
0.84 13 0.85 1 0.85 7 0.55 21
0.84 13 0.85 1 0.86 6 0.55 21
0.76 23 0.71 12 0.82 16 0.58 1
0.76 23 0.68 20 0.83 12 0.58 1
0.76 23 0.66 26 0.82 16 0.58 1
0.77 21 0.67 25 0.82 16 0.57 7
0.98 3 0.70 17 0.94 1 0.56 11
0.98 3 0.71 12 0.94 1 0.56 11
0.98 3 0.71 12 0.94 1 0.56 11
1.00 1 0.78 5 0.94 1 0.55 21
Table 2: Averaged accuracies and ranking of the methods.

Figures 3 and 4 plot the averaged accuracies shown in Table 2, for , and -methods, grouped according to the value of . From these figures it appears evident that the choice of the similarity function influences more the averaged accuracy, when .

Figure 3: Averaged accuracies for , and -methods.
Figure 4: Averaged accuracies for , and -methods.

It is worth noting that the averaged accuracy of all the different methods for problems of class is very low and the range of its values is very small. This class of problems is not worthwhile to analyze and compare the considered methods. Restricting ourselves to the classes of problems , and , from Table 2 it appears that the -methods and the -methods are outperformed by -methods on problems and and by -methods on problems . For this reason we limit our analysis to and -methods.

Since the averaged values of the accuracy are not sufficient to show the different features of these methods, we analyze the behavior of selected methods on selected problems. We choose a representative method for both and -methods. More precisely, due to their good performance on average, the methods and , for the two values and are selected.

method
1 0.54 0.35 0.38 0.9 0.96 0.52 0.24
1 1 1 0.91 0.47 0.75 0.81 0.81
0.68 0.58 0.35 0.85 0.9 0.8 0.57 0.51
1 1 1 0.95 0.47 0.93 0.81 0.81
Table 3: Accuracies of methods and on selected problems.

In Table 3 we list the accuracies of the four methods under investigation on three problems of class , namely , , , on all problems of class and on the two problems of class . On all the remaining problems of class and these four methods find the target clusters. By inspection of Table 3, it appears that the -methods always outperform -methods except for problems with bridges. For -methods the choice of the radius of the neighbors can be crucial especially when there are clusters with different densities. In order to better understand the meaning of Table 3, see Figures 5, 6, 7 and 8 where critical clustering outcomes are plotted and the corresponding values of accuracy are given.

Figure 5: Clustering obtained for problems (left), (middle) and (right) by applying method , with .
Figure 6: Clustering obtained for problem by applying method (left), and (right) with .
Figure 7: Clustering obtained for problem by applying method (left), and (right) with .
Figure 8: Clustering obtained for problem by applying method (left), and (right) with .

Figure 5 shows the poor performance of method with on problems , and . While for problem the choice of a smaller radius allows finding the target clustering, for problems and also the choice produces analogous results, may be due to nonuniform densities. Figure 6 shows the clustering obtained by method (on the left) having irrelevant differences from the target one and the wrong clustering obtained by method (on the right) for problem which has short bridges. Similar results are obtained for . Figures 7 and 8 show that the choice of is critical for both methods and on problem which has bridges and clusters of nonuniform densities. To further investigate the negative influence of the bridges, we have removed bridges from the problems of class , obtaining the target clusterings with all the four methods under consideration. In this case the nonuniform densities of problems and do not affect the behavior of -methods, due to the sufficiently large distances among the clusters.

4.3 Conclusions

In this paper several techniques to construct the similarity matrix have been considered. The corresponding methods have been tested on different datasets by applying a normalized spectral algorithm. The methods based on the dense graph and those obtained by exploiting the non mutual -nearest neighbor sparsity are clearly outperformed by those based on the -neighbor sparsity and mutual -nearest neighbor sparsity. The performance of the more effective and -methods depends on the characteristics of the problems. Among the different -methods, the one with and Gaussian similarity function given in (2) with given in (4) appears to be the most reliable, even if in few cases it is outperformed by the corresponding -method.

References

  • [1] C.C. Aggarwal, Data Mining: the textbook. Springer, 2014.
  • [2] A. Vattany, k-means requires exponentially many iterations even in the plane, Discrete Comput. Geom., 45 (2011) pp. 596–616.
  • [3] M.C.V. Nascimento and A.C.P.L.F. de Carvalho, Spectral methods for graph clustering - A survey, Eur. J. Oper. Res., 211 (2011) pp. 221–231.
  • [4] A.Y. Ng, M.I. Jordan and Y. Weiss,

    On Spectral Clustering: Analysis and an algorithm

    , Adv. Neur. In., 14 (NIPS 2001).
  • [5] T. Inkaya, A parameter-free similarity graph for spectral clustering, Expert Syst. Appl., 42 (2015) pp. 9489–9498.
  • [6] U. von Luxburg, A Tutorial on Spectral Clustering, Stat. Comput., 17, (2007), arXiv:0711.0189 [cs.DS].
  • [7] S.X. Yu and J. Shi, Multiclass Spectral Clustering

    , Proceedings of the ninth IEEE Int’l Conf. on Computer Vision (ICCV)03), 2003.

  • [8] B. Desgraupes, Clustering Indiceshttps://cran.r-project.org/web/
    packages/clusterCrit/vignettes/clusterCrit.pdf
  • [9] A. Strehl and J. Ghosh, Cluster ensambles- a knowledge reuse framework for combining multiple partitions, J. Mach. Learn. Res., 3 (2003) pp. 583–617.
  • [10] C.D. Manning, P. Raghavan and H. Schütze, Introduction to Information Retrieval, Cambridge University Press, 2008.
  • [11] W.M. Rand, Objective criteria for the evaluation of clustering methods, J. Am. Stat. Assoc., Americal Statistical Association, 66 (9071) pp. 846–850.
  • [12] F. Bach and M. Jordan, Learning spectral clustering, Proc. NIPS’03, pp. 305–312.
  • [13] T.H. Cormen, C.E. Leiserson, R.L. Rivest and C. Stein, Introduction to Algorithms (third ed.). MIT Press, (2009).
  • [14] O. Grygorash, Y. Zhou and Z. Jorgensen, Minimum spanning tree based clustering algorithms

    , Proc. IEEE Int’l Conf. Tools with Artificial Intelligence, 2006, pp. 73–81.

  • [15] C. Zhong, D. Miao and P. Franti,

    Minimum spanning tree based split-and-merge: A hierarchical clustering Method

    , Inform. Sciences, 181 (2011) pp. 3397–3410.
  • [16] J.R. Slagle, C.L. Chang and R.T.C. Lee, Experiments with some cluster analysis algorithms, Pattern Recogn., 6 (1974) pp. 181–187.
  • [17] L. Zelnik-Manor and P. Perona, Self-tuning spectral clustering, Adv. Neur. In., 17 (2004), pp. 1601–1608.
  • [18] I. Fischer and J. Poland, New Methods for spectral clustering, Tech. Rep. IDSIA-12-04, 2004, USI-SUPSI
  • [19] D.L. Davis and D.W. Bouldin, A cluster separation measure, IEEE T. Pattern Anal., 1 1979, pp. 224–227.
  • [20] O. Sourina, Current projects in the home page of Olga Sourina, http://www.ntu.edu.sg/home/eosourina/projects.html
  • [21] M. Lichman, UCI Machine Learning Repository University of California, School of Information and Computer Science, Irvine, CA. http://archive.ics.uci.edu/ml.