1 Introduction
Clustering refers to the assignment of unlabeled data points into clusters (groups) so that the points belonging to the same cluster are more similar to each other than those within different clusters. There are various types of clustering strategies, including crisp and fuzzy clustering. In crisp (or hard) clustering, a data point can belong to one and only one cluster, while in fuzzy clustering [1], a data point can belong to several clusters. Fuzzy clustering is very useful in many applications, e.g., the text categorization of various news into different clusters: a science, a business, and a sport cluster; where an article containing the keyword ”gold” could belong to all three clusters. Furthermore, it is also possible to open discussions with domain experts when using fuzzy clustering.
Clustering algorithms behave differently for different reasons. The first reason relates to dataset features such as geometry and the density distribution of clusters. The second reason is the choice of input parameters such as the fuzziness coefficient ( indicating that clustering is crisp and that clustering becomes fuzzy).
These parameters all affect the quality of clustering. To study how the choice of parameters impacts clustering quality, we need a quality criterion. For instance, when the dataset is well separated and has only two variables, a scatter plot can help determine the number of clusters. However, when the dataset has more than two variables, a good quality index is needed to compare various cluster configurations and choose the appropriate number of clusters.
Achieving a good clustering involves both minimizing intracluster distance (compactness) and maximizing intercluster distance (separability). A common issue in this process is that clusters are split up while they could be more compact. Many cluster quality indices have been proposed to address this problem for hard and fuzzy clustering, but none of them is always highly efficient [2].
Moreover, there is no reallife golden standard for clustering analysis, since various experts may have different points of views about the same data and express different constraints on the number and size of clusters. Thanks to a visual index, different solutions can be presented with respect to the data. Thus, experts can make a tradeoff between their opinion and the best local solutions proposed by the visual index.
Hence, in this paper, we first review existing quality indices that are wellsuited to fuzzy clustering, such as [3, 4, 5, 6, 7, 8]. Then, we propose an innovative, visual quality index for the wellknown Fuzzy CMeans (FCM) method. Moreover, we compare our proposal with stateoftheart quality indices from the literature on several numerical realworld and artificial datasets.
The remainder of this paper is organized as follows. Section 2 recalls the principles of fuzzy clustering. Section 3 surveys quality indices for fuzzy clustering. Section 4 details our visual quality index. Section 5 reports on the experimental comparison of our quality index against existing ones on different datasets. Finally, we conclude this paper and provide research perspectives in Section 6.
2 Principles of Fuzzy Clustering
Fuzzy inertia is a core measure in fuzzy clustering. Fuzzy inertia (Equation 1) is composed of fuzzy withininertia (Equation 2) and fuzzy betweeninertia (Equation 3). Membership coefficients of data point to cluster are usually stored in a membership matrix that is used to calculate , and . Note that = + . Moreover, is not constant because it depends on . When changes, the values of and also change.
(1) 
(2) 
(3) 
where is the number of instances, is the number of clusters, is the fuzziness coefficient (by default, ), is the center of the cluster , is the grand mean (the arithmetic mean of all data – Equation 4), and function computes the squared Euclidean distance.
(4) 
FCM is a common method for fuzzy clustering that adapts the principle of the KMeans algorithm
[9]. FCM, proposed by [10] and extended by [11], applies on numerical data. Since numerical data are the most common case, we choose to experiment our proposals with FCM.The aim of the FCM algorithm is to minimize . It starts by choosing data points as initial centroids of the clusters. Then, membership matrix values (Equation 5) are assigned to each data point in the dataset. Centroids of clusters are updated based on Equation 6 until a termination criterion is reached successfully. In FCM, this criterion can be a fixed number of iterations , e.g., . Alternatively, a threshold can be used, e.g., . Then, the algorithm stops when /  .
(5) 
(6) 
3 Fuzzy Clustering Quality Indices
According to Wang et al. [12], there are two groups of quality indices. Quality indices in the first group are based only on membership values. They notably include partition coefficient index [3] (Equation 7; ; to be maximized) and Chen and Linkens’ index [4] (Equation 8; ; to be maximized). takes into consideration both compactness (first term of ) and separability (second term of ).
(7) 
(8) 
where .
Quality indices in the second group associate membership values to cluster centers and data. They include an adaptation of the Ratio index to fuzzy clustering [5] (Equation 9; ; to be maximized), Fukuyama and Sugeno’s index [6] (Equation 10; ; to be minimized), and Xie and Beni’s index [7, 13] (Equation 11; ; to be minimized).
(9) 
(10) 
(11) 
When the number of clusters increases, the value of quality indices mechanically increases, too. Then, the important question is: how useful is the addition of a new cluster? To answer this question, the most common solutions are penalization and the Elbow Rule [14].
The first way to penalize a quality index is to multiply it by a quantity that diminishes the index when the number of clusters increases. In this case, the main difficulty is to choose the penalty. For instance, the penalized version of is Calinski’s [5] (Equation 12; ; to be maximized), where the penalty is based on both the number of clusters and data points.
(12) 
The second way to penalize a quality index is to evaluate index evolution relatively to the number of clusters, by considering the curve of the index’ successive values. The most appropriate value of can be determined visually by help of the Elbow Rule or algebraic calculation [15].
To construct a visual determination of the Elbow Rule, is represented on the horizontal axis and the considered quality index on the vertical axis. Then, we look for the value of where there is a change in the curve’s concavity. This change represents the optimal number of clusters . To construct an algebraic determination, let being the index value for clusters. The variation of before and after are compared. In case of a positive Elbow, the second difference is minimized. Yet, since the values before and after are used for calculation, the Elbow Rule can be applied to more than two clusters only.
Among all the abovestated quality indices, there is no single quality index that gives the best result for any dataset. Thus, there is room for a new quality index that is specifically tailored for fuzzy validation and helps the user choose the value of .
4 An Index Associated with a Visual Solution
Building a new quality index, we first consider to evaluate compactness and to evaluate separability. We can choose to calculate either , which is similar to except for the sign, or , which is similar to . Unfortunately, is not constant and . To take this particularity of fuzzy clustering into account, we propose to standardize by considering the Standardized Fuzzy Difference instead. Then, .
Adding a new cluster often improves clustering quality mechanically. Thus, many authors penalize the quality index with respect to (the smaller is, the greater the penalty), e.g., (Section 3). To obtain a penalized index,
is first linearly transformed in an index belonging to
, obtaining the Transformed Standardized Fuzzy Difference (Equation 13; ; to be maximized). Finally, by penalizing as , we obtain the Penalized Standardized Fuzzy Difference (Equation 14; ; to be maximized).(13) 
(14) 
Instead of penalizing the quality index, another solution is to visualize the search for the best number of clusters . First solution is to apply the Elbow Rule to . is plotted with respect to in Figure 1(a). The drawback of this method is that the horizontal axis corresponds to an arithmetic scale of values, which is not satisfying. To fix this problem, we suggest to plot with respect to , which we call Visual . Our aim is not to give an automatic solution, but to help the user visually choose the most appropriate value. The visualization we propose is shown in Figure 1(b), where the blue line plots with respect to , the full red line is the diagonal that corresponds to the best solutions () such that , and the dashed red line connects the origin to each point associated with values. The smaller the angle between the full red line and the dashed red line, the better is the solution. As the value of increases, the angle between the dashed red line and the diagonal decreases. Then, we choose the value of beyond which the decrease becomes negligible. This value is considered as the optimal number of clusters. For example, in Figure 1(b), a first solution could be , a better solution , and it is not very interesting to consider .
5 Experimental Validation
In this section, we compare our proposals , , Visual TSFD and the use of the Elbow Rule to stateoftheart clustering quality indices for FCMlike clustering algorithms, i.e., , , , and (Section 3).
In our experiments, the FCM algorithm is parameterized with its default settings: termination criterion and default fuzziness coefficient . All clustering quality indices are coded in Python version 2.7.4.
5.1 Datasets
Quality indices are compared on ten reallife datasets (Table 1; IDs 110) from the UCI Machine Learning Repository^{1}^{1}1http://archive.ics.uci.edu/ml/ and seven artificial datasets (Table 1; IDs 1117). In reallife datasets, the true number of clusters is assimilated to the number of labels. Although using the number of labels as the number of clusters is debatable, it is acceptable if the set of descriptive variables explains the labels well. In artificial datasets, the number of clusters is known by construction. Moreover, we created new artificial datasets by introducing overlapping and noise to some of the existing datasets, such as E10713 [16], Ruspini [1] and E10715 [16] (Table 1; IDs 1214). To create a new dataset, new data points are introduced, and each must be labeled. To obtain a dataset with overlapping, we modify the construction of the E1071 artificial datasets [16]. In the original datasets, there are three or five clusters of equal size (50). Cluster
is generated according to a Gaussian distribution
. To increase overlapping in the three clusters while retaining the same cluster size, we only change the standard deviation from 0.3 to 0.4. Then, there is no labeling problem. To introduce noise in a dataset, we add in each cluster noisy points generated by a Gaussian variable around each label gravity center. Noisy data are often generated by distributions with positive skewness. For example, in a twodimensional dataset, for each label, we add points that are far away from the corresponding gravity center, especially on the righthand side, which generally contains the most points. Then, we draw a random number
between 0 and 1. If , the point is attributed to the lefthand side. Otherwise, the point is attributed to the righthand side. This method helps obtain noisy data that are times smaller and times greater, respectively, than the expected value for the considered label. This process is applied to the Ruspini dataset [1].5.2 Experimental Results
In our experiments, all validation indices (Sections 3 and 4) are applied on all the datasets from Table 1. Moreover, since presenting all the results would take too much space, we retain only the best results for each index (even excluding ).
ID  Datasets 





1  Wine  178  3  2  2  8  12  8  2  3  5  
2  Iris  150  3  2  2  3  3  3  2  3  3  
3  Seeds  210  3  2  3  3  3  3  2  3  3  
4  Glass  214  6  2  2  12  12  12  2  4  5,7  
5  Vehicle  846  4  2  2  2  2  5  2  3  4,5  
6  Segmentation  2310  7  2  4  4  4  12  12  3  7,8  
7 

360  15  2  18  16  16  18  2  14  14,16  
8  Ecoli  336  8  2  3  3  3  12  3  3  3,7  
9  Yeast  1484  10  2  2  5  2  12  2  4  7,8  
10  WineQualityRed  1599  6  2  2  6  7  6  2  3  6  
11  Bensaid [17]  49  3  3  3  9  11  11  3  3  5  
12  E10713 [16]  150  3  3  3  3  3  3  3  3  3  
13  Ruspini [1]  75  4  4  4  4  4  4  4  3  4  
14  E10715 [16]  250  5  2  5  4  5  5  2  3  5  
15 

150  3  2  3  3  2  3  2  3  3  
16 

95  4  4  12  4  4  4  4  4  4  
17 

250  5  2  2  4  5  4  2  3  5  
# of wins for reallife datasets  0  1  3  2  3  0  3  5  
# of wins for artificial datasets  4  5  4  5  5  4  4  6  
Total # of wins  4  6  7  7  8  4  7  11 
As shown in Table 1, it is more difficult to predict an appropriate number of clusters for reallife datasets than for artificial datasets. Considering all indices, the average rate of success is indeed 21% in the case of real data, against 66% in the case of artificial data. Whatever the type of data, Visual TSFD outperforms the other indices, with 5 wins out of 10 in the case of real datasets, and 6 wins out of 7 in the case of artificial datasets. The worst results are obtained with and (0/10 and 4/7 wins each). The other indices achieve intermediary results. In addition, when the value given by Visual TSFD is erroneous, it is quite close to the expected , in contrast to , our closest competitor (Table 1; Wine, Glass, Segmentation, Ecoli and Bensaid). For example, the optimal number of clusters should be 6 for the Glass dataset. , Visual TSFD’s results are 5 and 7. Furthermore, we compare in Figures 2 and 3 Visual TSFD and the plot obtained with the Elbow Rule (which is labeled Elbow ) with respect to , on a sample of both reallife and artificial datasets bearing different characteristics, i.e., Glass, Vehicle, Ecoli, Ruspini, Ruspini_noised and E10715overlapped (Table 1). As is clearly visible from Figures 2 and 3, Visual TSFD gives a better visual idea than Elbow . Elbow indeed highlights values of 3 or 4, while the blue plot systematically indicates larger values.
Eventually, since our work aims at reallife datasets, there is no ground truth or golden standard for clustering analysis. In such a context, Visual TSFD has the advantage of providing options to experts instead of outputting a single value. This makes our method more flexible that the existing ones in reallife scenarios.
6 Conclusion and Perspectives
In this paper, we propose a novel quality index for FCM called Visual TSFD, which provides an overview of fuzzy clustering with respect to the number of clusters. We compare Visual TSFD to several clustering quality methods from the literature and experimentally show that it outperforms existing methods on various datasets. Furthermore, Visual TSFD can also be used in the case of categorical data with Fuzzy KMedoids [18]. Thus, Visual TSFD allows to deal with heterogeneous datasets, which makes our method a simple but noteworthy contribution, in our opinion. As a result, our next step is to design an ensemble fuzzy clustering method based on Visual TSFD that would deal with both numerical and categorical data.
Acknowledgments
This project is supported by the Rhône Alpes Region’s ARC 5: “Cultures, Sciences, Sociétés et Médiations” through A. Öztürk’s Ph.D. grant.
References
 [1] Ruspini, E.H.: Numerical methods for fuzzy clustering. Information Sciences 2(3) (1970) 319–350
 [2] Pal, N.R., Bezdek, J.C.: Correction to” on cluster validity for the fuzzy cmeans model”[correspondence]. IEEE transactions on fuzzy systems 5(1) (1997) 152–153
 [3] Bezdek, J.C.: Cluster validity with fuzzy sets. (1973)
 [4] Chen, M.Y., Linkens, D.A.: Rulebase selfgeneration and simplification for datadriven fuzzy models. In: Fuzzy Systems, 2001. The 10th IEEE International Conference on. Volume 1., IEEE (2001) 424–427
 [5] Caliński, T., Harabasz, J.: A dendrite method for cluster analysis. Communications in Statisticstheory and Methods 3(1) (1974) 1–27
 [6] Fukuyama, Y., Sugeno, M.: A new method of choosing the number of clusters for the fuzzy cmean method. In: Proc. 5th Fuzzy Syst. Symp., 1989. (1989) 247–250
 [7] Xie, X.L., Beni, G.: A validity measure for fuzzy clustering. IEEE Transactions on pattern analysis and machine intelligence 13(8) (1991) 841–847
 [8] Zhang, D., Ji, M., Yang, J., Zhang, Y., Xie, F.: A novel cluster validity index for fuzzy clustering based on bipartite modularity. Fuzzy Sets and Systems 253 (2014) 122–137

[9]
MacQueen, J., et al.:
Some methods for classification and analysis of multivariate
observations.
In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability. Volume 1., Oakland, CA, USA. (1967) 281–297
 [10] Dunn, J.C.: A fuzzy relative of the isodata process and its use in detecting compact wellseparated clusters. (1973)
 [11] Bezdek, J.C., Ehrlich, R., Full, W.: Fcm: The fuzzy cmeans clustering algorithm. Computers & Geosciences 10(23) (1984) 191–203
 [12] Wang, W., Zhang, Y.: On fuzzy cluster validity indices. Fuzzy sets and systems 158(19) (2007) 2095–2117
 [13] Pal, N.R., Bezdek, J.C.: On cluster validity for the fuzzy cmeans model. IEEE Transactions on Fuzzy systems 3(3) (1995) 370–379
 [14] Cattell, R.B.: The scree test for the number of factors. Multivariate behavioral research 1(2) (1966) 245–276
 [15] Dimitriadou, E., Dolničar, S., Weingessel, A.: An examination of indexes for determining the number of clusters in binary data sets. Psychometrika 67(1) (2002) 137–159
 [16] Meyer, D., Dimitriadou, E., Hornik, K., Weingessel, A., Leisch, F., Chang, C.C., Lin, C.C., Meyer, M.D.: Package ‘e1071’. Version 1.68 (2017)
 [17] Bensaid, A.M., Hall, L.O., Bezdek, J.C., Clarke, L.P., Silbiger, M.L., Arrington, J.A., Murtagh, R.F.: Validityguided (re) clustering with applications to image segmentation. IEEE Transactions on Fuzzy Systems 4(2) (1996) 112–123
 [18] Park, H.S., Jun, C.H.: A simple and fast algorithm for kmedoids clustering. Expert systems with applications 36(2) (2009) 3336–3341
Comments
There are no comments yet.