Clustering refers to the assignment of unlabeled data points into clusters (groups) so that the points belonging to the same cluster are more similar to each other than those within different clusters. There are various types of clustering strategies, including crisp and fuzzy clustering. In crisp (or hard) clustering, a data point can belong to one and only one cluster, while in fuzzy clustering , a data point can belong to several clusters. Fuzzy clustering is very useful in many applications, e.g., the text categorization of various news into different clusters: a science, a business, and a sport cluster; where an article containing the keyword ”gold” could belong to all three clusters. Furthermore, it is also possible to open discussions with domain experts when using fuzzy clustering.
Clustering algorithms behave differently for different reasons. The first reason relates to dataset features such as geometry and the density distribution of clusters. The second reason is the choice of input parameters such as the fuzziness coefficient ( indicating that clustering is crisp and that clustering becomes fuzzy).
These parameters all affect the quality of clustering. To study how the choice of parameters impacts clustering quality, we need a quality criterion. For instance, when the dataset is well separated and has only two variables, a scatter plot can help determine the number of clusters. However, when the dataset has more than two variables, a good quality index is needed to compare various cluster configurations and choose the appropriate number of clusters.
Achieving a good clustering involves both minimizing intra-cluster distance (compactness) and maximizing inter-cluster distance (separability). A common issue in this process is that clusters are split up while they could be more compact. Many cluster quality indices have been proposed to address this problem for hard and fuzzy clustering, but none of them is always highly efficient .
Moreover, there is no real-life golden standard for clustering analysis, since various experts may have different points of views about the same data and express different constraints on the number and size of clusters. Thanks to a visual index, different solutions can be presented with respect to the data. Thus, experts can make a trade-off between their opinion and the best local solutions proposed by the visual index.
Hence, in this paper, we first review existing quality indices that are well-suited to fuzzy clustering, such as [3, 4, 5, 6, 7, 8]. Then, we propose an innovative, visual quality index for the well-known Fuzzy C-Means (FCM) method. Moreover, we compare our proposal with state-of-the-art quality indices from the literature on several numerical real-world and artificial datasets.
The remainder of this paper is organized as follows. Section 2 recalls the principles of fuzzy clustering. Section 3 surveys quality indices for fuzzy clustering. Section 4 details our visual quality index. Section 5 reports on the experimental comparison of our quality index against existing ones on different datasets. Finally, we conclude this paper and provide research perspectives in Section 6.
2 Principles of Fuzzy Clustering
Fuzzy inertia is a core measure in fuzzy clustering. Fuzzy inertia (Equation 1) is composed of fuzzy within-inertia (Equation 2) and fuzzy between-inertia (Equation 3). Membership coefficients of data point to cluster are usually stored in a membership matrix that is used to calculate , and . Note that = + . Moreover, is not constant because it depends on . When changes, the values of and also change.
where is the number of instances, is the number of clusters, is the fuzziness coefficient (by default, ), is the center of the cluster , is the grand mean (the arithmetic mean of all data – Equation 4), and function computes the squared Euclidean distance.
FCM is a common method for fuzzy clustering that adapts the principle of the K-Means algorithm. FCM, proposed by  and extended by , applies on numerical data. Since numerical data are the most common case, we choose to experiment our proposals with FCM.
The aim of the FCM algorithm is to minimize . It starts by choosing data points as initial centroids of the clusters. Then, membership matrix values (Equation 5) are assigned to each data point in the dataset. Centroids of clusters are updated based on Equation 6 until a termination criterion is reached successfully. In FCM, this criterion can be a fixed number of iterations , e.g., . Alternatively, a threshold can be used, e.g., . Then, the algorithm stops when / - .
3 Fuzzy Clustering Quality Indices
According to Wang et al. , there are two groups of quality indices. Quality indices in the first group are based only on membership values. They notably include partition coefficient index  (Equation 7; ; to be maximized) and Chen and Linkens’ index  (Equation 8; ; to be maximized). takes into consideration both compactness (first term of ) and separability (second term of ).
Quality indices in the second group associate membership values to cluster centers and data. They include an adaptation of the Ratio index to fuzzy clustering  (Equation 9; ; to be maximized), Fukuyama and Sugeno’s index  (Equation 10; ; to be minimized), and Xie and Beni’s index [7, 13] (Equation 11; ; to be minimized).
When the number of clusters increases, the value of quality indices mechanically increases, too. Then, the important question is: how useful is the addition of a new cluster? To answer this question, the most common solutions are penalization and the Elbow Rule .
The first way to penalize a quality index is to multiply it by a quantity that diminishes the index when the number of clusters increases. In this case, the main difficulty is to choose the penalty. For instance, the penalized version of is Calinski’s  (Equation 12; ; to be maximized), where the penalty is based on both the number of clusters and data points.
The second way to penalize a quality index is to evaluate index evolution relatively to the number of clusters, by considering the curve of the index’ successive values. The most appropriate value of can be determined visually by help of the Elbow Rule or algebraic calculation .
To construct a visual determination of the Elbow Rule, is represented on the horizontal axis and the considered quality index on the vertical axis. Then, we look for the value of where there is a change in the curve’s concavity. This change represents the optimal number of clusters . To construct an algebraic determination, let being the index value for clusters. The variation of before and after are compared. In case of a positive Elbow, the second difference is minimized. Yet, since the values before and after are used for calculation, the Elbow Rule can be applied to more than two clusters only.
Among all the above-stated quality indices, there is no single quality index that gives the best result for any dataset. Thus, there is room for a new quality index that is specifically tailored for fuzzy validation and helps the user choose the value of .
4 An Index Associated with a Visual Solution
Building a new quality index, we first consider to evaluate compactness and to evaluate separability. We can choose to calculate either , which is similar to except for the sign, or , which is similar to . Unfortunately, is not constant and . To take this particularity of fuzzy clustering into account, we propose to standardize by considering the Standardized Fuzzy Difference instead. Then, .
Adding a new cluster often improves clustering quality mechanically. Thus, many authors penalize the quality index with respect to (the smaller is, the greater the penalty), e.g., (Section 3). To obtain a penalized index,
is first linearly transformed in an index belonging to, obtaining the Transformed Standardized Fuzzy Difference (Equation 13; ; to be maximized). Finally, by penalizing as , we obtain the Penalized Standardized Fuzzy Difference (Equation 14; ; to be maximized).
Instead of penalizing the quality index, another solution is to visualize the search for the best number of clusters . First solution is to apply the Elbow Rule to . is plotted with respect to in Figure 1(a). The drawback of this method is that the horizontal axis corresponds to an arithmetic scale of values, which is not satisfying. To fix this problem, we suggest to plot with respect to , which we call Visual . Our aim is not to give an automatic solution, but to help the user visually choose the most appropriate value. The visualization we propose is shown in Figure 1(b), where the blue line plots with respect to , the full red line is the diagonal that corresponds to the best solutions () such that , and the dashed red line connects the origin to each point associated with values. The smaller the angle between the full red line and the dashed red line, the better is the solution. As the value of increases, the angle between the dashed red line and the diagonal decreases. Then, we choose the value of beyond which the decrease becomes negligible. This value is considered as the optimal number of clusters. For example, in Figure 1(b), a first solution could be , a better solution , and it is not very interesting to consider .
5 Experimental Validation
In this section, we compare our proposals , , Visual TSFD and the use of the Elbow Rule to state-of-the-art clustering quality indices for FCM-like clustering algorithms, i.e., , , , and (Section 3).
In our experiments, the FCM algorithm is parameterized with its default settings: termination criterion and default fuzziness coefficient . All clustering quality indices are coded in Python version 2.7.4.
Quality indices are compared on ten real-life datasets (Table 1; IDs 1-10) from the UCI Machine Learning Repository111http://archive.ics.uci.edu/ml/ and seven artificial datasets (Table 1; IDs 11-17). In real-life datasets, the true number of clusters is assimilated to the number of labels. Although using the number of labels as the number of clusters is debatable, it is acceptable if the set of descriptive variables explains the labels well. In artificial datasets, the number of clusters is known by construction. Moreover, we created new artificial datasets by introducing overlapping and noise to some of the existing datasets, such as E1071-3 , Ruspini  and E1071-5  (Table 1; IDs 12-14). To create a new dataset, new data points are introduced, and each must be labeled. To obtain a dataset with overlapping, we modify the construction of the E1071 artificial datasets . In the original datasets, there are three or five clusters of equal size (50). Cluster
is generated according to a Gaussian distribution
. To increase overlapping in the three clusters while retaining the same cluster size, we only change the standard deviation from 0.3 to 0.4. Then, there is no labeling problem. To introduce noise in a dataset, we add in each cluster noisy points generated by a Gaussian variable around each label gravity center. Noisy data are often generated by distributions with positive skewness. For example, in a two-dimensional dataset, for each label, we add points that are far away from the corresponding gravity center, especially on the right-hand side, which generally contains the most points. Then, we draw a random numberbetween 0 and 1. If , the point is attributed to the left-hand side. Otherwise, the point is attributed to the right-hand side. This method helps obtain noisy data that are times smaller and times greater, respectively, than the expected value for the considered label. This process is applied to the Ruspini dataset .
5.2 Experimental Results
In our experiments, all validation indices (Sections 3 and 4) are applied on all the datasets from Table 1. Moreover, since presenting all the results would take too much space, we retain only the best results for each index (even excluding ).
|# of wins for real-life datasets||0||1||3||2||3||0||3||5|
|# of wins for artificial datasets||4||5||4||5||5||4||4||6|
|Total # of wins||4||6||7||7||8||4||7||11|
As shown in Table 1, it is more difficult to predict an appropriate number of clusters for real-life datasets than for artificial datasets. Considering all indices, the average rate of success is indeed 21% in the case of real data, against 66% in the case of artificial data. Whatever the type of data, Visual TSFD outperforms the other indices, with 5 wins out of 10 in the case of real datasets, and 6 wins out of 7 in the case of artificial datasets. The worst results are obtained with and (0/10 and 4/7 wins each). The other indices achieve intermediary results. In addition, when the value given by Visual TSFD is erroneous, it is quite close to the expected , in contrast to , our closest competitor (Table 1; Wine, Glass, Segmentation, Ecoli and Bensaid). For example, the optimal number of clusters should be 6 for the Glass dataset. , Visual TSFD’s results are 5 and 7. Furthermore, we compare in Figures 2 and 3 Visual TSFD and the plot obtained with the Elbow Rule (which is labeled Elbow ) with respect to , on a sample of both real-life and artificial datasets bearing different characteristics, i.e., Glass, Vehicle, Ecoli, Ruspini, Ruspini_noised and E1071-5-overlapped (Table 1). As is clearly visible from Figures 2 and 3, Visual TSFD gives a better visual idea than Elbow . Elbow indeed highlights values of 3 or 4, while the blue plot systematically indicates larger values.
Eventually, since our work aims at real-life datasets, there is no ground truth or golden standard for clustering analysis. In such a context, Visual TSFD has the advantage of providing options to experts instead of outputting a single value. This makes our method more flexible that the existing ones in real-life scenarios.
6 Conclusion and Perspectives
In this paper, we propose a novel quality index for FCM called Visual TSFD, which provides an overview of fuzzy clustering with respect to the number of clusters. We compare Visual TSFD to several clustering quality methods from the literature and experimentally show that it outperforms existing methods on various datasets. Furthermore, Visual TSFD can also be used in the case of categorical data with Fuzzy K-Medoids . Thus, Visual TSFD allows to deal with heterogeneous datasets, which makes our method a simple but noteworthy contribution, in our opinion. As a result, our next step is to design an ensemble fuzzy clustering method based on Visual TSFD that would deal with both numerical and categorical data.
This project is supported by the Rhône Alpes Region’s ARC 5: “Cultures, Sciences, Sociétés et Médiations” through A. Öztürk’s Ph.D. grant.
-  Ruspini, E.H.: Numerical methods for fuzzy clustering. Information Sciences 2(3) (1970) 319–350
-  Pal, N.R., Bezdek, J.C.: Correction to” on cluster validity for the fuzzy c-means model”[correspondence]. IEEE transactions on fuzzy systems 5(1) (1997) 152–153
-  Bezdek, J.C.: Cluster validity with fuzzy sets. (1973)
-  Chen, M.Y., Linkens, D.A.: Rule-base self-generation and simplification for data-driven fuzzy models. In: Fuzzy Systems, 2001. The 10th IEEE International Conference on. Volume 1., IEEE (2001) 424–427
-  Caliński, T., Harabasz, J.: A dendrite method for cluster analysis. Communications in Statistics-theory and Methods 3(1) (1974) 1–27
-  Fukuyama, Y., Sugeno, M.: A new method of choosing the number of clusters for the fuzzy c-mean method. In: Proc. 5th Fuzzy Syst. Symp., 1989. (1989) 247–250
-  Xie, X.L., Beni, G.: A validity measure for fuzzy clustering. IEEE Transactions on pattern analysis and machine intelligence 13(8) (1991) 841–847
-  Zhang, D., Ji, M., Yang, J., Zhang, Y., Xie, F.: A novel cluster validity index for fuzzy clustering based on bipartite modularity. Fuzzy Sets and Systems 253 (2014) 122–137
MacQueen, J., et al.:
Some methods for classification and analysis of multivariate
In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability. Volume 1., Oakland, CA, USA. (1967) 281–297
-  Dunn, J.C.: A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. (1973)
-  Bezdek, J.C., Ehrlich, R., Full, W.: Fcm: The fuzzy c-means clustering algorithm. Computers & Geosciences 10(2-3) (1984) 191–203
-  Wang, W., Zhang, Y.: On fuzzy cluster validity indices. Fuzzy sets and systems 158(19) (2007) 2095–2117
-  Pal, N.R., Bezdek, J.C.: On cluster validity for the fuzzy c-means model. IEEE Transactions on Fuzzy systems 3(3) (1995) 370–379
-  Cattell, R.B.: The scree test for the number of factors. Multivariate behavioral research 1(2) (1966) 245–276
-  Dimitriadou, E., Dolničar, S., Weingessel, A.: An examination of indexes for determining the number of clusters in binary data sets. Psychometrika 67(1) (2002) 137–159
-  Meyer, D., Dimitriadou, E., Hornik, K., Weingessel, A., Leisch, F., Chang, C.C., Lin, C.C., Meyer, M.D.: Package ‘e1071’. Version 1.6-8 (2017)
-  Bensaid, A.M., Hall, L.O., Bezdek, J.C., Clarke, L.P., Silbiger, M.L., Arrington, J.A., Murtagh, R.F.: Validity-guided (re) clustering with applications to image segmentation. IEEE Transactions on Fuzzy Systems 4(2) (1996) 112–123
-  Park, H.S., Jun, C.H.: A simple and fast algorithm for k-medoids clustering. Expert systems with applications 36(2) (2009) 3336–3341