Clustering is an unsupervised learning technique that aims at partitioning data into a number of homologous groups (or clusters). However, traditional clustering methods typically provide a single clustering, and fail to reveal the diverse patterns underlying the data. In fact, several different clustering solutions may co-exist in a given problem, and each may provide a reasonable organization of the data, e.g., people can be assigned to different communities based on different roles; proteins can be categorized differently based on their amino acid sequences or their 3D structure. In these scenarios, it would be desirable to present multiple alternative clusterings to the users, as these alternative clusterings can explain the underlying structure of the data from different viewpoints.
To address the aforementioned problem, the research field of multi-clustering has emerged during the last decade. Naive solutions run a single clustering algorithm with different parameter values, or explore different clustering algorithms [Bailey2013]. These approaches may generate multiple clusterings with high redundancy, since they do not take into account the already explored clusterings. To overcome this drawback, two general strategies have been introduced. The first one simultaneously generates multiple clusterings, which are required to be different from each other [Jain, Meka, and Dhillon2008, Dang and Bailey2010]. The second one generates multiple clusterings in a greedy manner, and forces the new clusterings to be different from the already generated ones [Cui, Fern, and Dy2007, Hu et al.2015, Yang and Zhang2017].
Most of these multi-clustering methods consider multiple clusterings in the full feature space. However, as the dimensionality of the data increases, clustering methods encounter the challenge of the curse of dimensionality [Parsons, Haque, and Liu2004]. Furthermore, some features may be relevant to some clusterings but not others. This phenomenon is also observed in data with moderate dimensionality. Subspace clustering aims at finding clusters in subspaces of the original feature space, but it faces an exponential () search space and focuses on exploring only one clustering. Some approaches try to find alternative clusterings in a weighted feature space [Caruana et al.2006, Hu et al.2015] or in a transformed feature space [Cui, Fern, and Dy2007, Davidson and Qi2008]; however, the former methods cannot control well the redundancy between different clusterings, and the latter cannot find multiple orthogonal subspaces at the same time.
To overcome these issues, we propose an approach called Multiple Independent Subspace Clusterings (MISC) to explore diverse clusterings in multiple independent subspaces, one clustering for each subspace. During the first stage, MISC uses Independent Subspace Analysis (ISA) [Szabó, Póczos, and Lőrincz2012] to explore multiple pairwise-independent (i.e., non-redundant) subspaces by minimizing the mutual information among them, and seeks the number of independent subspaces via the minimum description length principle [Rissanen2007]. MISC automatically determines the number of clusters in each subspace via Bayesian -means [Welling2006], and groups the data embedded in each subspace using graph regularized semi-nonnegative matrix factorization [Ding, Li, and Jordan2010]. To group non-linearly separable data in a subspace, it further maps the data into a reproducing kernel Hilbert space via the kernel trick.
This paper makes the following contributions:
We introduce an approach called MISC to explore multiple clusterings in independent subspaces. MISC automatically computes the number of independent subspaces, which provide multiple individual views of the data.
MISC leverages graph regularized semi-nonnegative matrix factorization and kernel mapping to group non-linearly separable clusters, and can determine the number of clusters in each subspace.
Experimental results show that MISC can explore different clusterings in various subspaces, and it significantly outperforms other related and competitive approaches [Caruana et al.2006, Bae and Bailey2006, Cui, Fern, and Dy2007, Davidson and Qi2008, Jain, Meka, and Dhillon2008, Hu et al.2015, Yang and Zhang2017, Niu, Dy, and Jordan2010, Guan et al.2010, Niu, Dy, and Ghahramani2012].
Existing multi-clustering approaches can be classified into two categories depending on how they control redundancy, either based on clustering labels, or on feature space.
COALA (Constrained Orthogonal Average Link Algorithm) [Bae and Bailey2006] is the classic algorithm that controls redundancy through clustering labels. It transforms linked pairs of the reference clustering into cannot-link constraints, and then uses agglomerative clustering to find an alternative clustering. MNMF (Multiple clustering by Nonnegative Matrix Factorization) [Yang and Zhang2017] derives a diversity regularization term from the labels of existing clusterings, and then integrates this term with the objective function of NMF to seek another clustering. The performance of both COALA and MNMF heavily depends on the quality of already discovered clusterings. To alleviate this issue, other methods simultaneously seek multiple clusterings by minimizing the correlation between the labels of two distinct clusterings and by optimizing the quality of each clustering [Jain, Meka, and Dhillon2008, Wang et al.2018]. For example, De-means (Decorrelated -means) [Jain, Meka, and Dhillon2008] simultaneously learns two disparate clusterings by minimizing a -means sum squared error objective for the two clustering solutions, and by minimizing the correlation between the two clusterings. CAMI (Clustering for Alternatives with Mutual Information) [Dang and Bailey2010] optimizes a dual-objective function, in which the log-likelihood objective (accounting for the quality) is maximized, while the mutual information objective (accounting for the dissimilarity) of pairwise clusterings is minimized.
Multi-clustering solutions that explore multiple clusterings using a feature-based criterion have also been studied. Some of them assign weights to features. For example, MetaC (Meta Clustering) [Caruana et al.2006] first applies -means to generate a large number of base clusterings using weighted features based on the Zipf distribution [Zipf1949]
, and then obtains multiple clusterings via a hierarchical clustering ensemble. MSC (Multiple Stable Clusterings)[Hu et al.2015] detects multiple stable clusterings in each weighted feature space using the idea of clustering stability based on Laplacian Eigengap. Unfortunately, MSC cannot guarantee diversity among multiple clusterings, since it cannot control the redundancy very well. Other feature-wise multi-clusterings are based on transformed features. They use a data space to characterize the existing clusterings and try to construct a new feature space, which is either orthogonal to , or independent from . Once the novel feature space is constructed, any clustering algorithm can be used in this space to generate an alternative clustering. OSC (Orthogonal subspace clustering) [Cui, Fern, and Dy2007] transforms the original feature space into an orthogonal subspace using a projection framework based on the given clustering, and then groups the transformed data into different clusters. ADFT (Alternative Distance Function Transformation) [Davidson and Qi2008] adopts a distance metric learning technique [Xing et al.2003]
and singular value decomposition to obtain an alternative orthogonal subspace based on a given clustering. Thereafter, it obtains an alternative clustering by running the clustering algorithm in the new orthogonal feature space. mSC (Multiple Spectral Clusterings)[Niu, Dy, and Jordan2010] finds multiple clusterings by augmenting a spectral clustering objective function, and by using the Hilbert-Schmidt independence criterion (HSIC) [Gretton et al.2005] among multiple views to control the redundancy. NBMC (Nonparametric Bayesian Multiple Clustering) [Guan et al.2010] and NBMC-OFV (Nonparametric Bayesian model for Multiple Clustering with Overlapping Feature Views) [Niu, Dy, and Ghahramani2012] both employ a Bayesian model to explore multiple feature views and clusterings therein.
Feature-based multiple clustering methods typically seek a full space transformation matrix, or measure the similarity between samples in the full space. Therefore, their performance may be compromised with high-dimensional data. Furthermore, some data only show cluster structure on a subset of features. Given the above analysis, we advocate to separately explore diverse clusterings in independent subspaces, and introduce an approach called MISC. MISC first uses independent subspace analysis to obtain multiple independent subspaces, and then performs clustering in each independent subspace to achieve multiple clusterings. Extensive experimental results show that MISC can effectively uncover multiple diverse clusterings in each identified subspace.
MISC consists of two phases: (1) Finding multiple independence subspaces, and (2) Exploring a clustering in each subspace. In the following, we provide the details of each phase.
Independent Subspace Analysis
Blind Source Separation (BSS) is a classic problem in signal processing. Independent Component Analysis (ICA) is a statistical technique that can solve the BBS problem by decomposing complex data into independent subparts[Hyvärinen, Hoyer, and Inki2001]. Let’s consider a data matrix for samples with features. ICA describes as a linear mixture of sources, i.e., , where is the mixing matrix and corresponds to the source components. The source matrix represents
observations under multiple independent row vectors, i.e.,, where each corresponds to a source component.
Unlike ICA, which requires pairwise independence between all individual source components, Independence Subspace Analysis (ISA) aims at finding a linear transformation of the given data, and it yields several jointly independent source subspaces, each of which contains one or more source components. Let’s assume there areindependent subspaces; ISA seeks the corresponding source subspaces by minimizing the mutual information between pairwise subspaces as follows:
Various ISA solvers are available, and they vary in terms of the applied cost functions and optimization techniques [Szabó, Póczos, and Lőrincz2012]. For example, fastISA[Hyvärinen and Köster2006] seeks the mixing matrix by iteratively updating its rows in a fixed-point manner. Unfortunately, fastISA can only find equal-sized subspaces, while multiple clusterings may exist in subspaces of different sizes. Here we adopt a variant of ISA [Szabó, Póczos, and Lőrincz2012], which makes use of the “ISA separation principle”, stating that ISA can be solved by first performing ICA, and then searching and merging the components. As such, the independence between the groups is maximized, and the groups do not need to have an equal number of components. This ISA solution only needs to specify the number of subspaces , which is difficult to determine. To compute the number of subspaces, we use a greedy search strategy, which combines agglomerative clustering and Minimum Description Length (MDL) principle [Rissanen2007].
The first step of agglomerative clustering is to merge subspaces. Given two subspaces and , we compute their independence as follows:
where is the entropy cost to encode the objects in the subspace
using the probability-density function
, which can be obtained using kernel density estimation111https://bitbucket.org/szzoli/ite/downloads/. We compute of each pair of subspaces and merge the subspaces with the smallest . We repeat the above step until the number of subspaces , or all the .
We apply the MDL principle to determine the number of subspaces. MDL is widely used for model selection. Its core idea is to choose the model, which allows a receiver to exactly reconstruct the original data using the most succinct transmission. MDL balances the coding length of the model and the coding length of the deviations of the data from that model. More concretely, the coding cost for transmitting data together with a model is
When subspaces are merged in each iteration, we update . Finally, we choose the number of subspaces corresponding to the smallest . Concretely, we use the technique in [Rissanen2007, Ye et al.2016] to measure the length of the model and data coding as follows:
where is the number of samples, is the number of features, and is the probability-density function for each subspace. As a result, we obtain independent subspaces.
Exploring Multiple Clusterings
After obtaining multiple independent subspaces, we use Bayesian -means [Welling2006] to guide the computation of the number of clusters in each subspace. Bayesian -means adopts a variational Bayesian framework [Ghahramani and Beal1999] to iteratively choose the optimal number of clusters. We then perform Graph regularized Semi-NMF (GSNMF) to cluster data embedded in each subspace. GSNMF is an improvement upon SNMF by leveraging the geometric structure of samples to regularize the matrix factorization.
SNMF [Ding, Li, and Jordan2010] is a variant of the classical NMF [Lee and Seung1999]; it extends the application of traditional NMF from nonnegative inputs to mix-signed inputs. At the same time, it preserves the strong clustering interpretability. The objective function of SNMF can be formulated as follows:
where can be viewed as the cluster centroids, and is the soft cluster assignment matrix in the latent space. We can transform the soft clusters to hard clusters by clustering the index matrix .
Inspired by GNMF (Graph regularized Nonnegative Matrix Factorization) [Cai et al.2011], we make use of the intrinsic geometric structure of samples to guide the factorization of , and cascade it to . As a result, we obtain the following objective function for the graph regularized SNMF (GSNMF):
where denotes the trace of a matrix, is the regularization parameter; is the graph Laplacian matrix , is the weighted adjacency matrix of the graph [Cai et al.2011], is the diagonal degree matrix whose entries are the row sum of . By minimizing the graph regularized term, we assume that if and are close to each other, then their cluster labels and should be close as well.
However, GSNMF, similarly to NMF and SNMF, does not perform well with data that are non-linearly separable in input space. To avoid this potential issue, we consider mapping the data points onto a Reproducing kernel Hilbert space , and reformulate Eq. (7) as follows:
This formulation makes it difficult to compute and , since they depend on the mapping function . To solve this problem, we add constraints on the basis vectors . As such, the basis matrix can be further formulated as the combination of weighted-samples , in which is the weight matrix. Eq. (8) can be rewritten as follows:
Through kernel mapping, KGSNMF can properly cluster, not only linearly separable data, but also non-linearly separable ones.
Optimization: We follow the idea of standard NMF to optimize and by an alternating optimization technique. Particularly, we alternate the optimization of and , while fixing the other as constant. For simplicity, we use to represent .
Optimizing with respect to is equivalent to optimizing the following function:
To embed the constraint , we introduce the Lagrange multiplier :
Letting the partial derivative , we obtain
Based on the Karush-Kuhn-Tucker (KKT) [Boyd and Vandenberghe2004] complementarity condition , we have:
Eq. (13) leads to the following updating formula for :
where we separate the positive and negative parts of by setting
Similarly, we can get the updating formula for :
By iteratively applying Eqs. (14) and (15) in each independent subspace, we can obtain the optimized and . Each obtained from each subspace corresponds to one clustering. As such, we obtain clusterings from independent subspaces.
Algorithm 1 presents the whole MISC procedure. Line 1 computes the source matrix via independent component analysis; Line 2 merges the subspaces according to Eq.(2) and using agglomerative clustering, and saves the MDL () for each merge; Line 3 chooses the best set of subspaces with the minimum MDL () through a sorting operation; Lines 5-9 cluster data for each subspace though KGSNMF; Lines 10-18 give the procedure of KGSNMF.
The complexity of ISA is and the complexity of MDL is (for each merge). Since we need to merge the subspaces for at most times, the overall time complexity of the first stage is . For the second stage, MISC takes time to construct the -nearest neighbor graph. Assuming the multiplicative updates stop after iterations and the number of clusters is , then the cost for KGSNMF is . In summary, the overall time complexity of MISC is .
Experiments on synthetic data
We first conduct two types of experiments on synthetic data, the first type of experiments is to prove that MISC can find multiple independent subspaces, and the second type is to prove that our KGSNMT has a better clustering performance than SNMF.
The first synthetic data contains four subspaces consisting of samples with features: the first subspace contains four clusters, corresponding to the shapes of the digits ‘2’,‘0’,‘1’,‘9’ (Fig. 1(a)); the second subspace also contains four clusters, corresponding to the shapes of the letters ‘A’ (three shapes) and‘I’ (Fig. 1(b)
); the third one contains six clusters generated by a Gaussian distribution (Fig.1(c)); the last one contains two clusters, which are non-linearly separable (Fig. 1(d)
). To ensure the non-redundancy among the four subspaces, we randomly permute the sample index in each subspace before merging them into a full space. Note that the synthetic data is diverse; it includes subspaces with the same scale, such as the first and the second subspaces, as well as subspaces with different scales, such as the second, third, and fourth subspaces. We choose the Gaussian heat kernel as the kernel function and the kernel width is set to the standard variance. Following the set of GNMF in [Cai et al.2011], we use 0-1 weighting and adopt the neighborhood size to compute the graph adjacency matrix , and then set in Eq. (8). We apply MISC on the first synthetic dataset and plot the found subspace views and clustering results in the last four subfigures of Fig. 1.
The first view shown in Fig. 1(e) corresponds to the second original subspace; the second view shown in Fig. 1(f) corresponds to the first original subspace; the third view shown in Fig. 1(g) corresponds to the third original subspace; and the fourth view shown in Fig. 1(h) corresponds to the fourth original subspace. Due to the ISA procedure, the original feature space has been normalized and converted into the new space, so the four original subspaces are similar to the four subspaces found by MISC, but not identical. The relative position of each cluster in the new subspace is still the same as before, but the new subspaces are rotated and stretched because ICA tries to find subspaces which are linear combinations of the original ones. For each subspace, we use KGSNMF to cluster the data. KGSNMF correctly identifies the clusters for the first, third, and fourth views; the second one is approximately close to the original one. Since KGSNMF accounts for the intrinsic geometric structure and for non-linearly separable clusters, it obtain good clustering results on both non-linearly separable and spherical clusters.
The second and third synthetic datasets are collected from the Fundamental Clustering Problem Suite (FCPS)222http://www.uni-marburg.de/fb12/datenbionik/downloads/FCPS. We use them to investigate whether KGSNMF achieves a better clustering performance than SNMF. Atom, the second synthetic dataset, consists of samples with three features. It contains two nonlinearly separable clusters with different variance as shown in Fig. 2. Lsun, the third synthetic dataset, consists of samples with two features. It contains three clusters with different variance and inner-cluster distance as shown in Fig. 3. We choose a Gaussian kernel and set for KGSNMF and GSNMF as before. The clustering results on Atom are plotted in Fig. 2, and we can see that both KSNMF and KGSNMF correctly separate the two clusters, while -means, SNMF, and GSNMF do not. This is because the introduced kernel function could map the nonlinearly separable space to a high-dimensional linearly separable space. The clustering results for Lsun are shown in Fig. 3. -means, SNMF, GSNMF, and KSNMF do not cluster the data very well. -means, SNMF, and GSNMF are all influenced by the distribution of the clusters at the bottom. KSNMF can mitigate the impact, but it still cannot perfectly separate the clusters, whereas KGSNMF can do the job correctly. Overall, KSNMF achieves good clustering results especially on nonlinearly separable clusters, such as on Atom. The impact of different structures could be alleviated to some extent on both linearly and nonlinearly separable data. The embedded graph regularized term can better represent the details of the intrinsic geometry of the data; as such KGSNMF obtains better clustering results than KSNMF.
Experiments on real-world datasets
We test MISC on four real-world datasets wildly used for multiple clustering, including a color image dataset, two gray image datasets, and a text dataset.
Amsterdam Library of Object Images dataset. The ALOI dataset333http://aloi.science.uva.nl/ consists of images of 1000 common objects taken from different angles and under various illumination conditions. We have chosen four objects: green box, red box, tennis ball, and red ball, with different colors and shapes from different viewing directions for a total of 288 images (Fig. 4). Following the preprocessing in [Dalal and Triggs2005], we extracted 840 features444https://github.com/adikhosla/feature-extraction and further applied Principle Component Analysis (PCA) to reduce the number of features to 49, which retain more than 90% variance of the original data.
Dancing Stick Figures dataset. The DSF dataset [Günnemann et al.2014] consists of 900 samples of images with random noise across nine stick figures. (Fig. 5). The nine raw stick figures are obtained by arranging in three different positions the upper and lower body; this provides two views for the dataset. As for the ALOI, we also applied PCA, and retained more than 90% of the data’s variance as preprocessing.
CMUface dataset. The CMUface dataset555http://archive.ics.uci.edu/ml/datasets.html contains 640 grey images of 20 individuals with varying poses (up, straight, right, and left). As such, it can be clustered either by identity or by pose. Again, we apply PCA to reduce the dimensionality while retaining more than 90% of the data’s variance.
WebKB dataset The WebKB dataset666http://www.cs.cmu.edu/ webkb/ contains html documents from four universities: Cornell University; University of Texas, Austin; University of Washington; and University of Wisconsin, Madison. The pages are additionally labeled as being from 4 categories: course, faculty, project, and student. We preprocessed the data by removing rare words, stop words, and words with a small variance, retaining 1041 samples and 456 words.
We compare MISC with MetaC, MSC, OSC, COALA, De-means, ADFT, MNMF, mSC, NBMC, and NBMC-OFV (all methods are discussed in the related work section). The input parameters of these algorithms were set or optimized as suggested by the authors. We also set the number of subspaces as 2 and the number of clusters as that of true labels of CMUface and WebKB datasets, respectively.
, and use the widely-known F1-measure (F1) and normalized mutual information (NMI) to evaluate the quality of the clusterings. Since we don’t know which view the clustering corresponds to, we compare each clustering with the true label under each view, and finally compute the confusion matrix and report the results (average of ten independent repetitions) in Table1.
Fig. 6 shows the two clusterings found by MISC on the ALOI dataset: one reveals the subspace corresponding to shape (Fig. 6(a)), and the other subspace corresponding to color (Fig. 6(b)). Similarly, Fig. 7 gives the two clusterings of MISC on the DSF dataset: one reveals the subspace corresponding to the upper-body (Fig. 7(a)), and the other subspace representing the lower-body (Fig. 7(b)). Fig. 8 provides two clusterings of MISC on the CMUface dataset: one represents the clustering according to ‘identity’ (Fig. 8(a)) and the other according to ‘pose’ (Fig. 8(b)). All the figures confirm that MISC is capable of finding meaningful clusterings embedded in the respective subspaces.
MISC gives the best results across both evaluation metrics on each view for ALOI and DSF. Although the competitive algorithms can also find two different clusterings on these two datasets, the corresponding F1 and NMI values are smaller (by at least 20%) than those of MISC. The reason is that MISC first uses ISA to convert the full feature space into two independent subspaces, and then clusters the data in each subspace. In contrast, De-means and MNMF find two clusterings in the full feature space, and don’t perform well when the actual clusterings are embedded in subspaces. In addition, although ADFT and OSC do explore the second clustering with respect to a feature weighted subspace or a feature-transformed subspace, this clustering is still affected by the reference one, which is computed in the full-space. In contrast, the second clustering explored by MISC is independent from the first one, and has a meaningful interpretation.
MISC does not perfectly identify the two given clusterings for the CMUface and WebKB datasets. Nevertheless, it can still distinguish the two different views on each dataset. It’s possible that these different views embedded in subspaces share some common features and are not completely independent; as such, the two subspaces found by MISC do not quite correspond to the original views. The other methods (De-means, ADFT, and MNMF) cannot well identify the two views, because both and are close to the ‘identity’ clustering and far away from the ‘pose’ one. Compared to MISC, De-means finds multiple clusterings in the full space; as such, it cannot discover clusters embedded in subspaces. MNMF, ADFT, and OSC find multiple clusterings sequentially, thus subsequent ones depend on the formerly found ones. NBMC-OFV achieves the best results on CMUface and WebKB. The reason is that NBMC-OFV can discover multiple partially overlapping views, whereas the other algorithms can’t.
In summary, the advantages of MISC can be attributed to the explored multiple independent subspaces and to the kernel graph regularized semi-nonnegative matrix factorization, which contribute to the finding of low-redundant clusterings of high-quality.
In this paper, we study how to find multiple clusterings from data, and present an approach called MISC. MISC assumes that diverse clusterings may be embedded in different subspaces. It first uses independent component analysis to explore statistical independent subspaces, and it determines the number of subspaces and the number of clusters in each subspace. Next, it introduces a kernel graph regularized semi-nonnegative matrix factorization method to find linear and non-linear separable clusters in the subspaces. Experimental results on synthetic and real-world data demonstrate that MISC can identify meaningful alternative clusterings, and it also outperforms state-of-the-art multiple clustering methods. In the future, we plan to investigate solutions to find alternative clusterings embedded in overlapping subspaces. The code for MISC is available at http://mlda.swu.edu.cn/codes.php?name=MISC.
This work is supported by NSFC (61872300, 61741217, 61873214, and 61871020), NSF of CQ CSTC (cstc2018jcyjAX0228, cstc2016jcyjA0351, and CSTC2016SHMSZX0824), the Open Research Project of Hubei Key Laboratory of Intelligent Geo-Information Processing (KLIGIP-2017A05), and the National Science and Technology Support Program (2015BAK41B04).
- [Bae and Bailey2006] Bae, E., and Bailey, J. 2006. Coala: A novel approach for the extraction of an alternate clustering of high quality and high dissimilarity. In ICDM, 53–62.
Alternative clustering analysis: A review.Data Clustering:Algorithms and Applications 535–550.
- [Boyd and Vandenberghe2004] Boyd, S., and Vandenberghe, L. 2004. Convex optimization. Cambridge University Press.
- [Cai et al.2011] Cai, D.; He, X.; Han, J.; and Huang, T. S. 2011. Graph regularized nonnegative matrix factorization for data representation. TPAMI 33(8):1548–1560.
- [Caruana et al.2006] Caruana, R.; Elhawary, M.; Nguyen, N.; and Smith, C. 2006. Meta clustering. In ICDM, 107–118.
- [Cui, Fern, and Dy2007] Cui, Y.; Fern, X. Z.; and Dy, J. G. 2007. Non-redundant multi-view clustering via orthogonalization. In ICDM, 133–142.
- [Dalal and Triggs2005] Dalal, N., and Triggs, B. 2005. Histograms of oriented gradients for human detection. In CVPR, 886–893.
- [Dang and Bailey2010] Dang, X. H., and Bailey, J. 2010. Generation of alternative clusterings using the cami approach. In SDM, 118–129.
- [Davidson and Qi2008] Davidson, I., and Qi, Z. 2008. Finding alternative clusterings using constraints. In ICDM, 773–778.
- [Ding, Li, and Jordan2010] Ding, C. H.; Li, T.; and Jordan, M. I. 2010. Convex and semi-nonnegative matrix factorizations. TPAMI 32(1):45–55.
- [Ghahramani and Beal1999] Ghahramani, Z., and Beal, M. J. 1999. Variational inference for bayesian mixtures of factor analysers. In NIPS, 449–455.
- [Gretton et al.2005] Gretton, A.; Bousquet, O.; Smola, A.; and Sch?lkopf, B. 2005. Measuring statistical dependence with hilbert-schmidt norms. 3734(2005):63–77.
- [Guan et al.2010] Guan, Y.; Dy, J. G.; Niu, D.; and Ghahramani, Z. 2010. Variational inference for nonparametric multiple clustering. In MultiClust Workshop, KDD.
- [Günnemann et al.2014] Günnemann, S.; Färber, I.; Rüdiger, M.; and Seidl, T. 2014. Smvc: semi-supervised multi-view clustering in subspace projections. In KDD, 253–262.
- [Hu et al.2015] Hu, J.; Qian, Q.; Pei, J.; Jin, R.; and Zhu, S. 2015. Finding multiple stable clusterings. In ICDM, 171–180.
- [Hyvärinen and Köster2006] Hyvärinen, A., and Köster, U. 2006. Fastisa: A fast fixed-point algorithm for independent subspace analysis. In EsANN, 371–376.
- [Hyvärinen, Hoyer, and Inki2001] Hyvärinen, A.; Hoyer, P. O.; and Inki, M. 2001. Topographic independent component analysis. Neural Computation 13(7):1527–1558.
- [Jain, Meka, and Dhillon2008] Jain, P.; Meka, R.; and Dhillon, I. S. 2008. Simultaneous unsupervised learning of disparate clusterings. In SDM, 858–869.
- [Lee and Seung1999] Lee, D. D., and Seung, H. S. 1999. Learning the parts of objects by non-negative matrix factorization. Nature 401(6755):788–791.
- [Niu, Dy, and Ghahramani2012] Niu, D.; Dy, J.; and Ghahramani, Z. 2012. A nonparametric bayesian model for multiple clustering with overlapping feature views. In Artificial Intelligence and Statistics, 814–822.
- [Niu, Dy, and Jordan2010] Niu, D.; Dy, J. G.; and Jordan, M. I. 2010. Multiple non-redundant spectral clustering views. In ICML, 831–838.
- [Parsons, Haque, and Liu2004] Parsons, L.; Haque, E.; and Liu, H. 2004. Subspace clustering for high dimensional data. In KDD, 90–105.
- [Rissanen2007] Rissanen, J. 2007. Information and complexity in statistical modeling. Springer Science & Business Media.
- [Szabó, Póczos, and Lőrincz2012] Szabó, Z.; Póczos, B.; and Lőrincz, A. 2012. Separation theorem for independent subspace analysis and its consequences. Pattern Recognition 45(4):1782–1791.
- [Wang et al.2018] Wang, X.; Yu, G.; Carlotta, D.; Wang, J.; Yu, Z.; and Zhang, Z. 2018. Multiple co-clusterings. In ICDM, 1–6.
Bayesian k-means as a “maximization-expectation” algorithm.In SDM, 474–478.
- [Xing et al.2003] Xing, E. P.; Jordan, M. I.; Russell, S. J.; and Ng, A. Y. 2003. Distance metric learning with application to clustering with side-information. In NIPS, 521–528.
- [Yang and Zhang2017] Yang, S., and Zhang, L. 2017. Non-redundant multiple clustering by nonnegative matrix factorization. Machine Learning 106(5):695–712.
- [Ye et al.2016] Ye, W.; Maurus, S.; Hubig, N.; and Plant, C. 2016. Generalized independent subspace clustering. In ICDM, 569–578.
- [Zipf1949] Zipf, G. K. 1949. Human behavior and the principle of least effort. The Southwestern Social Science Quarterly 30(2):147–149.