Clustering as an important data analysis technique to find information underlining the unlabelled data, is widely applied in different areas including data mining, machine learning and information retrieval. Different clustering algorithms based on various theories have been developed over the past decadesjain2010data ; filippone2008survey ; xu2005survey . To handle large data which is too large to be stored, incremental clustering approaches have been proposed. In incremental clustering, data is processed chunk by chunk, which is a packet of the data. The key idea of these approaches is to find representatives to represent each cluster in each data chunk and final data analysis is carried out based on those identified representatives from all the chunks. Most of the incremental clustering approaches are applied to handle single view data in which each object is represented by single kind of features. However, large multi-view data becomes prevalent nowadays because huge amount of data can be collected easily every day from different sources or represented by different features. For example, one image can be represented by different kinds of features including color, texture and histogram .etc. which can be considered as different views. As each view can provide distinctive yet complementary information, mining large multi-view data is important for different parties including enterprises and government organizations. Under this circumstance, there is a need to develop incremental clustering approaches to handle both large and multi-view data. For large multi-view data clustering, the key challenge is that how to categorize the data set in the incremental clustering framework by making good use of related information from multiple views of the data. In this paper we propose a new incremental clustering approach called incremental minimax optimization based fuzzy clustering (IminimaxFCM) to handle large multi-view data. In IminimaxFCM, the entire data set is processed chunk by chunk. As every data object in each chunk has multiple views, minimax optimization is applied to integrate multiple views to achieve consensus results.
For large data clustering, different strategies have been applied in the literature. For example random sampling strategy is used in CLARA Kaufman2009 and CLARANS Ng:1994:EEC:645920.672827 to handle large data. Summarization strategy is applied in BIRCH livny1996birch . In guha2003clustering , the incremental clustering strategy is used for stream data based on k-Median called LSEARCH is developed using chunk-based processing style. In chitta2011approximate ; chitta2012efficient
random sampling and random Fourier maps based kernel k-means are proposed to perform large scale kernel clustering. Except the hard clustering discussed above, similar strategies are also used in soft or fuzzy clustering approaches in order to handle more real world data sets where data objects may not be well separated. It has been discussed that soft clustering such as those popular ones in the literaturemei2012fuzzy may capture the natural structure of a data set more accurately. Each object in a data set may belong to all clusters with various degrees of memberships. For example, incremental clustering strategy is also applied in an online Nonnegative Matrix Factorization approach wang2011efficient for large scale document clustering. In addition, several incremental fuzzy clustering algorithms based on the well known Fuzzy c means (FCM) Bezdek:1981:PRF:539444 and Fuzzy c medoids (FCMD) krishnapuram2001low have been developed respectively. The popular algorithms include the single-pass FCM (SPFCM) hore2007single , online FCM (OFCM) hore2008online ; hore2009scalable , online fuzzy c medoids (OFCMD) labroche2010new and history based online fuzzy c medoids (HOFCMD) labroche2010new . To improve the clustering performance, incremental fuzzy clustering with multiple medoids wang2014incremental is proposed. However, both hard and soft clustering approaches above belong to single view clustering approaches in which the data is represented by single view features or single relational matrix, hence they are not suitable to handle large multi-view data.
To handle large multi-view data, multi-view clustering technique can be integrated into the incremental clustering framework. There are many multi-view clustering approaches have been proposed in the literature. Three strategies are mainly applied among the existing approaches. The first strategy is to formulate an objective function by minimizing the disagreements of different views and optimize it directly to get the consensus clustering results 2011NIPScosc ; 2012ICDMmvKernelKmeans ; 2013IJCAIRmultiviewKmeans . The second strategy applied in 2012cvpraffinity ; 2013AAAIconvexSubspaceMultiview contains two steps as follows. First, a unified representation (view) is identified or learned. Then the existing clustering algorithm such as K-means macqueen1967someng2002spectral is used to get the final clustering result. In the third strategy bruno2009multiview ; greene2009matrix , each view of the data is processed independently and the consensus clustering result is achieved in an additional step based on the result of each view.
The above approaches are also based on hard clustering in which each object is assigned to only one cluster. However in many real world applications more information about the cluster assignment need to be captured. For example, in document categorization each document may belong to several topics with different degrees of attachments. Moreover, each document may be written by different languages or collected from various sources. Therefore, several soft or fuzzy clustering based multi-view clustering approaches are proposed. For example, Nonnegative Matrix Factorization (NMF) lee1999learning based approach has been proposed in liu2013multi . Fuzzy clustering based approaches are also proposed including CoFKM 2009COFKM and WV-Co-FCM 2014WVCOFCM . In wang2014multi , minimax optimization based multi-view spectral clustering is proposed to handle multi-view relational data. The experimental results have shown that minimax optimization helps to integrate different views more effectively. However, all these approaches assume that the data set can be stored and processed in batch, hence they are not suitable for clustering large multi-view data which may be too large to be stored in the memory.
To handle large multi-view vector data, we propose a new incremental clustering approach called incremental minimax optimization based fuzzy clustering (IminimaxFCM). To the best of our knowledge, this is the first research effort made in developing incremental fuzzy clustering approach for large multi-view data based on minimax optimization. Moreover, inspired by recent advancements in minimax optimization and fuzzy clustering, IminimaxFCM can integrate different views better and also produce soft assignment for each object. In IminimaxFCM, the data is processed one chunk at a time instead of loading the entire data into the memory for analysis. Multi-view centroids for each cluster in a data chunk are identified based on minimax optimization by minimizing the maximum disagreement of the weighted views. These multi-view centroids are then used as the representatives of the multi-view data to carry out the final data partition. The detailed formulation, derivation and an in-depth analysis of the approach are given in the paper. The experiments of IminimaxFCM on six real world data sets show that IminimaxFCM achieves better clustering accuracy than related incremental fuzzy clustering approaches.
The rest of the paper is organized as follows: in the next section, a review on the related incremental fuzzy approaches reported in the literature is highlighted. In section 3, the details of the proposed approach IminimaxFCM are presented. Comprehensive experiments on several real world data sets are conducted and the detailed comparison results with related techniques are reported in section 4. Finally, conclusions are drawn in section 5.
2 Related work
In this section, two related single view based incremental fuzzy clustering algorithms are reviewed. Their common and unique characteristics are discussed as well.
Where, is the fuzzifier, is the number of clusters, is the number of objects. is the membership of object in cluster and is the centroid of c-th cluster. As FCM is not able to handle data when it is too large to be stored in memory, SPFCM is proposed to process the data chunk by chunk. For each chunk, a set of centroids is calculated to represent the chunk with one centroid per cluster. In SPFCM, the centroids identified from the previous chunk are combined into next chunk and the final set of centroids for the entire data is generated after last chunk is processed. Instead of applying FCM directly on each chunk, weighted FCM(wFCM) is used for SPFCM to determine the final set of centroids.
The significant difference of SPFCM and OFCM is the way of handling the centroids of each chunk. In SPFCM, the weight for the centroid of each cluster in each chunk is calculated as follows:
Where, is the weight of centroid of the c-th cluster, is the number of objects in chunk, is the number of clusters, is the membership of object belongs to cluster and is the weight of object . For the first chunk of the data (), is assigned to 1 for every object and . From the second chunk of data (), and the weighted centroids are combined with the p-th chunk of data. The objects will be clustered by wFCM in which the weights of the objects in p-th chunk are all set to 1 and the weights of objects are calculated from previous chunk. These steps continue until last chunk of data is processed.
Similar with SPFCM, OFCM hore2008online is also developed based on FCM and the data is processed chunk by chunk. However, there are two significant differences between OFCM and SPFCM. The first difference is the way of handling the centroids of each chunk. In OFCM, every chunk is processed independently and an additional step is needed to identify the final set centroids for the entire data based on all the centroids identified from the chunks. The second difference is the way to calculate the weight of each cluster in a chunk. In OFCM, the weight is calculated as follows:
The weighted centroids are then input into wFCM to identify the final set of centroids for the entire data set.
As discussed above, both of the methods are applied for single view data which is represented by single view features, however they are not able to handle large multi-view data. In our method, we integrate multi-view clustering technique based on minimax optimization into incremental clustering framework to identify multi-view centroids for each chunk separately. We then identify the final set of centroids based on all the multi-view centroids from all the chunks. Next, we propose our novel incremental minimax optimization based fuzzy clustering approach called IminimaxFCM, including the detailed formulation, derivation and an in-depth analysis.
3 The proposed approach
In this section, we first propose two naive and simple incremental multi-view clustering approaches based on OFCM and SPFCM. We refer to these two approaches as NaiveMVOFCM and NaiveMVSPFCM. Next, we formulate the objective function of the proposed approach IminimaxFCM. In IminimaxFCM, the consensus clustering results and the multi-view centroids of each chunk are generated based on minimax optimization in which the maximum disagreements of different weighted views are minimized. Moreover, the weight of each view can be learned automatically in the clustering process. It is shown that the IminimaxFCM clustering is actually formulated as a minimax optimization problem of the cost function with constraints. The Lagrangian Multiplier method is applied to derive the updating rules. Finally we introduce the algorithm of IminimaxFCM including the detailed step. The time complexity of the algorithm will be discussed as well. In summary, multi-view centroids are identified by IminimaxFCM to represent each cluster in a chunk. The final set of multi-view centroids can be identified at last.
3.1 NaiveMVOFCM and NaiveMVSPFCM
We first propose NaiveMVOFCM and NaiveMVSPFCM in which OFCM and SPFCM are applied in each view of the data set to achieve a set of centroids to represent each view. Then the final label of each data object is determined based on the its distance to the centroids of each view. The details of the approach are shown in Algorithm 1 and Algorithm 2 respectively. Though these two methods are able to handle large multi-view data, the clustering performance may not good because they have not used the information among complementary views to achieve a consensus clustering result. Next, we propose the method called IMinimaxFCM by integrating different views based on minimax optimization.
|Algorithm 1: NaiveMVOFCM|
|Input: Data set of views with size|
|Cluster Number , stopping criterion , fuzzifier|
Output: Cluster Indicator q
|Identify centroids for each view based on|
|by using OFCM|
|Algorithm 2: NaiveMVSPFCM|
|Input: Data set of views with size|
|Cluster Number , stopping criterion , fuzzifier|
Output: Cluster Indicator q
|Identify centroids for each view based on|
|by using SPFCM|
The objective function based on minimax optimization is formulated for IminimaxFCM as follows:
In the objective function, is the membership matrix contains the consensus membership . is the centroid matrix of view contains the centroids to represent the data of view. Here is the dimension of the objects in view. can be considered as the cost of view which is the standard objective function of FCM. is the weight of view. The parameter controls the distribution of weights for different views. is the fuzzifier for fuzzy clustering which controls the fuzziness of the membership.
The goal is to conduct a minimax optimization on the objective function with the constraints to get the consensus membership and the multi-view centroids to represent the data. In this new formulation for multi-view clustering, the weights for each view are automatically determined based on minimax optimization, without specifying the weights by users. Moreover, by using minimax optimization, the different views are integrated harmonically by weighting each cost term differently.
It is difficult to solve the variables , and in (4) directly because (4) is nonconvex. However, as we observed that the objective function is convex w.r.t and and is concave w.r.t , therefore, similar to FCM, the alternative optimizaiton(AO) can be used to solve the optimization problem by solving one variable with others fixed.
3.3.1 Minimization: Fixing , and updating
Lagrangian Multiplier method is applied to solve the optimization problem of the objective function with constraints. The Lagrangian function considering the constraints is given as follows:
where the is the objective function of IminimaxFCM . and are the Lagrange multipliers. The condition for solving is as follows:
As shown in (12), the weight for each view is considered in the updating for .
3.3.2 Minimization: Fixing , and updating
3.3.3 Maximization: Fixing , and updating
Based on the Lagrangian Multiplier method, the condition for solving is as follows:
Here the cost term is the weighted distance summation of all the data points under view to its corresponding centroid. The larger the value of is, the larger cost this view will contribute to the objective function. From (16), we can see that the larger cost of view is, the higher value will be assigned to which leads to the maximum of the weighted cost. The maximum is minimized with respect to the membership and centroids in order to suppress the high cost views and achieve a harmonic consensus clustering result. Next, we present the details of IminimaxFCM algorithm.
3.4 IminimaxFCM Algorithm
The proposed IMinimaxFCM processes the large multi-view data chunk by chunk and the multi-view centroids for each chunk are identified based on minimax optimization. The framework of IMinimaxFCM is shown in Fig. 1.
As shown in the framework, the entire multi-view data is divided into chunks in which each data object has views. Then in Phase 1, the centroids with views are identified independently to represent each chunk based on Algorithm 2. In Phase 2, the final set of centroids are identified to represent the entire data set by using all the centroids identified from all the chunks. The details of algorithm are outlined as follows.
First, data set is partitioned into non-overlapping chunks of in which each object has views. In step 1, centroids set of each view are initialized. In step 2, the centroid matrix of each view for each chunk is calculated based on Algorithm 3 by integrating multiple views. Each column in centroid matrix is one centroid for a cluster in under view. In Algorithm 3, the data in each chunk with multiple views is processed. The multi-view centroids are identified by iteratively updating (14), (12) and (15) until convergence. After each of the chunk is processed, the multi-view centroids are combined into centroids set . In step 3, the final set of centroids is identified based on by using Algorithm 3. In the final step, cluster indicator q is determined for each object. is the cluster number which object belongs to. This is acquired by assigning object to the cluster whose centroids in all views has the smallest sum of distance to object .
|Input: Data set of views with size|
|Cluster Number , Chunk number , stopping criterion , fuzzifier|
Output: Cluster Indicator q
|1 Initialize Centroids set for each view ,|
|Calculate centroid matrix for chunk of each view|
|based on Algorithm 3|
|3 Calculate final centroid matrix for each view based on Algorithm 3|
|based on all the identified centroids of views|
|from all the chunks|
|Algorithm: Algorithm 3|
|Input: Data set of views with size , Cluster Number|
|stopping criterion , fuzzifier , Parameter|
|Output: multi-view centroids|
|1 Initialize consensus membership and for each view|
|5 Update using equation (14);|
|6 end for|
|7 end for|
|10 Update using equation (12);|
|11 end for|
|12 end for|
|14 Update using equation (15);|
|15 end for|
The time complexity of IMinimaxFCM is considering the chunk number M, the dimension of object D, the view number P and cluster number K. is the time complexity of processing M chunks of P views where is the number of objects in chunk m. is the time complexity of the final clustering and the label assignment where is the number of identified centroids from all the chunks.
4 Experimental results
In this section, experimental studies of the proposed approach are conducted on six real world data sets, including image and document data sets. The summary of the characteristics of the data sets is shown in Table. 1 including the number of objects, the number of classes, and different features with their dimensions. More details about the data sets are introduced in the following subsections. In the experiments, we compare IMinimaxFCM with two incremental single view based approaches SPFCM, OFCM to show that if using multiple views can improve the clustering performance. We also compare with two simple incremental multi-view based approaches NaiveMVOFCM and NaiveMVSPFCM to show that if minimax optimization based IMinimaxFCM performs better. The experiments implemented in Matlab were conducted on a PC with four cores of Intel I5-2400 with 8 gigabytes of memory.
4.1 Data sets
We compare the performance of the five algorithms on the following data sets.
Multiple features(MF)111This data set can be downloaded on https://archive.ics.uci.edu/ml/datasets/Multiple+Features.
: This data set consists of 2000 handwritten digit images (0-9) extracted from a collection of Dutch utility maps. It has 10 classes and each class has 200 images. Each object is described by 6 different views (Fourier coefficients, profile correlations, Karhunen-Love coefficients, pixel averages, Zernike moments, morphological features).
Image segmentation(IS) data set222This data set can be downloaded on https://archive.ics.uci.edu/ml/datasets/Image+Segmentation.: This data set is composed of 2310 outdoor images which has 7 classes. Each image is represented by 19 features. The features can be considered as two views which are shape view and RGB view. The shape view consists of 9 features which describe the shape information of each image. The RGB view consists of 10 features which describe the RGB values of each image.
Caltech7/20: These two data sets are two subsets of Caltech 101 image data set fei2004learning . Following the setting in dueck2007non , 7 widely used classes (1474 images) including Face, Motorbikes, Dolla-Bill, Garfield, Snoopy, Stop-Sign and Windsor-Chair are selected which is named Caltech7. A larger subset referred as Caltech20 (2386 images) is selected by choosing 20 classes including Face, Leopards, Motorbikes, Binocular, Brain, Camera, Car-Side, Dolla-Bill, Ferry, Garfield, Hedgehog, Pagoda, Rhino, Snoopy, Stapler, Stop-Sign, Water-Lilly, Windsor-Chair, Wrench and Yin-yang. Five kinds of features are extracted from all the images including 48 dimension Gabor feature, 40 dimension wavelet moments (WM), 254 dimension CENTRIST feature, 1984 dimension HOG feature, 512 dimension GIST feature, and 928 dimension LBP feature.
Reuters: This data set contains documents originally written in five different languages (English, French, German, Spanish and Italian) and their translations amini2009learning . This multilingual data set covers a common set of six classes. We use documents originally in English as the first view and their four translations as the other four views. We randomly sample 1500 documents from this collection with each of the 6 classes having 250 documents.
Forest333This data set can be downloaded on https://archive.ics.uci.edu/ml/datasets/Covertype.: This data set is from United States Geological Survey and United State Forest Service and it has 7 classes, 581012 objects. Each object is represented as a 54 dimensional feature vector. To construct multiple views, we divide 54 features to three views with each represent one kind of feature. The first view composes of the first 10 features which are quantitative features including the distances and degrees. The second view composes of the next 4 features which represent the wilderness area each tree belongs to. The third view composes of the last 40 features which represent the soil type of place where each tree grows.
4.2 Evaluation criterion
Three popular external criterions Accuracy mei2012fuzzy , F-measure larsen1999fast , and Normalized Mutual Information(NMI) strehl2003cluster are used to evaluate the clustering results, which measure the agreement of the clustering results produced by an algorithm and the ground truth. If we refer class as the ground truth, and cluster as the results of a clustering algorithm, the NMI is calculated as follows:
where is the total number of objects, and are the numbers of objects in the cluster and the class, respectively, and is the number of common objects in class and cluster
. For F-measure, the calculation based on precision and recall is as follows:
Accuracy is calculated as follows after obtaining a one-to-one match between clusters and classes:
where is the number of common objects in the cluster and its matched class . The higher the values of the three criterions are, the better the clustering result is. The value is equal to 1 only when the clustering result is same as the ground truth.
4.3 Experimental setting
For Reuters data set, same as the experimental setting in 2011NIPScosc , Probabilistic Latent Semantic Analysis(PLSA) hofmann1999probabilistic is applied to project the data to a 100-dimensional space and the clustering approaches are conducted on the low dimensional data. For OFCM and SPFCM, as they are single view based methods, the features of all views are concatenated as the input to OFCM and SPFCM. For initialization, we initialize the centroids by using the method adopted in krishnapuram2001low . The object which has the minimum distance to all the other objects is selected as the first centroid. The remaining centroids are chosen consecutively by selecting the objects that maximize their minimal distance with existing centroids. This helps the centroids distribute evenly in the data space to avoid converging to a bad local optimum. For NaiveMVOFCM and NaiveMVSPFCM, OFCM and SPFCM with the same initialization method are applied in each view of the data set to achieve a set of centroids to represent each view. For IminimaxFCM, we use two methods to initialize the consensus membership . In the first method, for each view, we first choose the centroids with the same method mentioned above. Then, if the centroid for cluster of view is object , we set , and set the membership of the other objects to the same cluster as . At last, the consensus membership is calculated as the average value of all the views. The detail steps of initialization are as follows. The second initialization method applies FCM to each view and generates the membership matrix for each view. Then same as the first method, the consensus matrix is calculated as = . We refer IminimaxFCM as IminimaxFCM1 and IminimaxFCM2 for the two initialization methods accordingly.
|Initialization for IminimaxFCM|
|Set the number of clusters , consensus membership matrix is|
|Calculate the first centroid for view:|
|Centroids set ;|
4.4 Results on data sets
We partition each data set randomly into equal sized chunks. The size of each chunk can be specified by the user. Normally it refers to a certain percentage of the entire data size. The size of the last chunk maybe smaller than the others if the entire data set can not be divided by the chunk size. We conduct experiments with chunk sizes as 1%, 2.5%, 5%, 10% and 25% of the entire data set size for MF, IS, Caltech7, Caltech20 and Reuters data sets. For Forest data set, limited by memory, smaller percentages are chosen for chunk sizes as 0.1%, 0.25%, 0.5%, 1% and 2.5% of the entire data set size. For each data set, the same fuzzifier m is set for all the approaches and 20 trials are run with random order of the input data. We calculate the mean, standard deviation of the values of accuracy, NMI and F-measure over 20 trials. The results of the six data sets are shown in Table.2, Table. 3, Table. 4, Table. 5, Table. 6 and Table. 7 respectively. From the tables we can see, IminimaxFCM1 and IminimaxFCM2 always produce the best partition every time with various chunk sizes.
Table. 2 (a), (b), (c) are the accuracy, NMI and F-measure results on MF data set respectively. The results show that our multi-view based proposed approaches IminimaxFCM1 and IminimaxFCM2 perform better than single concatenated view based approaches OFCM and SPFCM. Compared with OFCM, the improvements of IminimaxFCM1 on average of accuracy, NMI and F-measure over all the chunk sizes are , and and the improvements of IminimaxFCM2 are , and . Compared with SPFCM, the improvements of IminimaxFCM1 and IminimaxFCM2 are , , and , , respectively. Moreover, the results show that IminimaxFCM1 and IminimaxFCM2 by using minimax optimization to integrate different views also perform better than NaiveMVOFCM and NaiveMVSPFCM in which each view is processed individually without considering the complementary information among different views. Compared with NaiveMVOFCM, the improvements of IminimaxFCM1 and IminimaxFCM2 are , , and , , respectively. Compare with NaiveMVSPFCM, the improvements of IminimaxFCM1 and IminimaxFCM2 are , , and , , respectively. The results on the other data sets provide similar pattern in performance and the best performance always achieved by IminimaxFCM1 or IminimaxFCM2.