# Information-Maximization Clustering based on Squared-Loss Mutual Information

Information-maximization clustering learns a probabilistic classifier in an unsupervised manner so that mutual information between feature vectors and cluster assignments is maximized. A notable advantage of this approach is that it only involves continuous optimization of model parameters, which is substantially easier to solve than discrete optimization of cluster assignments. However, existing methods still involve non-convex optimization problems, and therefore finding a good local optimal solution is not straightforward in practice. In this paper, we propose an alternative information-maximization clustering method based on a squared-loss variant of mutual information. This novel approach gives a clustering solution analytically in a computationally efficient way via kernel eigenvalue decomposition. Furthermore, we provide a practical model selection procedure that allows us to objectively optimize tuning parameters included in the kernel function. Through experiments, we demonstrate the usefulness of the proposed approach.

## Authors

• 182 publications
• 50 publications
• 2 publications
• 8 publications
• ### Semi-Supervised Information-Maximization Clustering

Semi-supervised clustering aims to introduce prior knowledge in the deci...
04/30/2013 ∙ by Daniele Calandriello, et al. ∙ 0

• ### Dependence Maximizing Temporal Alignment via Squared-Loss Mutual Information

The goal of temporal alignment is to establish time correspondence betwe...
06/19/2012 ∙ by Makoto Yamada, et al. ∙ 0

• ### Probabilistic Clustering Using Maximal Matrix Norm Couplings

In this paper, we present a local information theoretic approach to expl...
10/10/2018 ∙ by David Qiu, et al. ∙ 0

• ### LSMI-Sinkhorn: Semi-supervised Squared-Loss Mutual Information Estimation with Optimal Transport

Estimating mutual information is an important machine learning and stati...
09/05/2019 ∙ by Yanbin Liu, et al. ∙ 6

• ### Maximizing Mutual Information Across Feature and Topology Views for Learning Graph Representations

Recently, maximizing mutual information has emerged as a powerful method...
05/14/2021 ∙ by Xiaolong Fan, et al. ∙ 5

• ### Exploiting Convexification for Bayesian Optimal Sensor Placement by Maximization of Mutual Information

Bayesian optimal sensor placement, in its full generality, seeks to maxi...
06/13/2019 ∙ by Pinaky Bhattacharyya, et al. ∙ 0

• ### Deep Image Clustering with Category-Style Representation

Deep clustering which adopts deep neural networks to obtain optimal repr...
07/20/2020 ∙ by Junjie Zhao, et al. ∙ 9

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The goal of clustering is to classify data samples into disjoint groups in an unsupervised manner. K-means (MacQueen, 1967)

is a classic but still popular clustering algorithm. However, since k-means only produces linearly separated clusters, its usefulness is rather limited in practice.

To cope with this problem, various non-linear clustering methods have been developed. Kernel k-means (Girolami, 2002) performs k-means in a feature space induced by a reproducing kernel function (Schölkopf and Smola, 2002). Spectral clustering (Shi and Malik, 2000; Ng et al., 2002) first unfolds non-linear data manifolds by a spectral embedding method, and then performs k-means in the embedded space. Blurring mean-shift (Fukunaga and Hostetler, 1975; Carreira-Perpiñán, 2006)

uses a non-parametric kernel density estimator for modeling the data-generating probability density, and finds clusters based on the modes of the estimated density.

Discriminative clustering learns a discriminative classifier for separating clusters, where class labels are also treated as parameters to be optimized (Xu et al., 2005; Bach and Harchaoui, 2008). Dependence-maximization clustering determines cluster assignments so that their dependence on input data is maximized (Song et al., 2007; Faivishevsky and Goldberger, 2010).

These non-linear clustering techniques would be capable of handling highly complex real-world data. However, they suffer from lack of objective model selection strategies111 ‘Model selection’ in this paper refers to the choice of tuning parameters in kernel functions or similarity measures, not the choice of the number of clusters. . More specifically, the above non-linear clustering methods contain tuning parameters such as the width of Gaussian functions and the number of nearest neighbors in kernel functions or similarity measures, and these tuning parameter values need to be manually determined in an unsupervised manner. The problem of learning similarities/kernels was addressed in earlier works (Meila and Shi, 2001; Shental et al., 2003; Cour et al., 2005; Bach and Jordan, 2006), but they considered supervised setups, i.e., labeled samples are assumed to be given. Zelnik-Manor and Perona (2005)

provided a useful unsupervised heuristic to determine the similarity in a data-dependent way. However, it still requires the number of nearest neighbors to be determined manually (although the magic number ‘7’ was shown to work well in their experiments).

Another line of clustering framework called information-maximization clustering exhibited the state-of-the-art performance (Agakov and Barber, 2006; Gomes et al., 2010). In this information-maximization approach, probabilistic classifiers such as a kernelized Gaussian classifier (Agakov and Barber, 2006)

and a kernel logistic regression classifier

(Gomes et al., 2010) are learned so that mutual information (MI) between feature vectors and cluster assignments is maximized in an unsupervised manner. A notable advantage of this approach is that classifier training is formulated as continuous optimization problems, which are substantially simpler than discrete optimization of cluster assignments. Indeed, classifier training can be carried out in computationally efficient manners by a gradient method (Agakov and Barber, 2006) or a quasi-Newton method (Gomes et al., 2010). Furthermore, Agakov and Barber (2006) provided a model selection strategy based on the information-maximization principle. Thus, kernel parameters can be systematically optimized in an unsupervised way.

However, in the above MI-based clustering approach, the optimization problems are non-convex, and finding a good local optimal solution is not straightforward in practice. The goal of this paper is to overcome this problem by providing a novel information-maximization clustering method. More specifically, we propose to employ a variant of MI called squared-loss MI (SMI), and develop a new clustering algorithm whose solution can be computed analytically in a computationally efficient way via kernel eigenvalue decomposition. Furthermore, for kernel parameter optimization, we propose to use a non-parametric SMI estimator called least-squares MI (LSMI; Suzuki et al., 2009), which was proved to achieve the optimal convergence rate with analytic-form solutions. Through experiments on various real-world datasets such as images, natural languages, accelerometric sensors, and speech, we demonstrate the usefulness of the proposed clustering method.

The rest of this paper is structured as follows. In Section 2, we describe our proposed information-maximization clustering method based on SMI. Then the proposed method is compared with existing clustering methods qualitatively in Section 3 and quantitatively in Section 4. Finally, this paper is concluded in Section 5.

## 2 Information-Maximization Clustering with Squared-Loss Mutual Information

In this section, we describe our proposed clustering algorithm.

### 2.1 Formulation of Information-Maximization Clustering

Suppose we are given -dimensional i.i.d. feature vectors of size ,

 {xi|xi∈Rd}ni=1,

which are assumed to be drawn independently from a distribution with density . The goal of clustering is to give cluster assignments,

 {yi|yi∈{1,…,c}}ni=1,

to the feature vectors , where denotes the number of classes. Throughout this paper, we assume that is known.

In order to solve the clustering problem, we take the information-maximization approach (Agakov and Barber, 2006; Gomes et al., 2010)

. That is, we regard clustering as an unsupervised classification problem, and learn the class-posterior probability

so that ‘information’ between feature vector and class label is maximized.

The dependence-maximization approach (Song et al., 2007; Faivishevsky and Goldberger, 2010, see also Section 3.7) is related to, but substantially different from the above information-maximization approach. In the dependence-maximization approach, cluster assignments are directly determined so that their dependence on feature vectors

is maximized. Thus, the dependence-maximization approach intrinsically involves combinatorial optimization with respect to

. On the other hand, the information-maximization approach involves continuous optimization with respect to the parameter included in a class-posterior model . This continuous optimization of is substantially easier to solve than discrete optimization of .

Another advantage of the information-maximization approach is that it naturally allows out-of-sample clustering based on the discriminative model , i.e., a cluster assignment for a new feature vector can be obtained based on the learned discriminative model.

### 2.2 Squared-Loss Mutual Information

As an information measure, we adopt squared-loss mutual information (SMI). SMI between feature vector and class label is defined by

 SMI (1)

where denotes the joint density of and , and is the marginal probability of . SMI is the Pearson divergence (Pearson, 1900) from to , while the ordinary MI (Cover and Thomas, 2006),

 MI:=∫c∑y=1p∗(x,y)logp∗(x,y)p∗(x)p∗(y)dx, (2)

is the Kullback-Leibler divergence (Kullback and Leibler, 1951) from to

. The Pearson divergence and the Kullback-Leibler divergence both belong to the class of

Ali-Silvey-Csiszár divergences (which is also known as -divergences, see Ali and Silvey, 1966; Csiszár, 1967), and thus they share similar properties. For example, SMI is non-negative and takes zero if and only if and are statistically independent, as the ordinary MI.

In the existing information-maximization clustering methods (Agakov and Barber, 2006; Gomes et al., 2010, see also Section 3.8), MI is used as the information measure. On the other hand, in this paper, we adopt SMI because it allows us to develop a clustering algorithm whose solution can be computed analytically in a computationally efficient way via kernel eigenvalue decomposition.

### 2.3 Clustering by SMI Maximization

Here, we give a computationally-efficient clustering algorithm based on SMI (1).

Expanding the squared term in Eq.(1), we can express SMI as

 SMI =12∫c∑y=1p∗(x)p∗(y)(p∗(x,y)p∗(x)p∗(y))2dx =−∫c∑y=1p∗(x)p∗(y)p∗(x,y)p∗(x)p∗(y)dx+12 =12∫c∑y=1p∗(y|x)p∗(x)p∗(y|x)p∗(y)dx−12. (3)

Suppose that the class-prior probability

is set to a user-specified value for , where and . Without loss of generality, we assume that are sorted in the ascending order:

 π1≤⋯≤πc.

If is unknown, we may merely adopt the uniform class-prior distribution:

 p∗(y)=1c for y=1,…,c, (4)

which will be non-informative and thus allow us to avoid biasing clustering solutions222 Such a cluster-balance constraint is often employed in existing clustering algorithms (e.g., Shi and Malik, 2000; Xu et al., 2005; Niu et al., 2011). . Substituting into , we can express Eq.(3) as

 12∫c∑y=11πyp∗(y|x)p∗(x)p∗(y|x)dx−12. (5)

Let us approximate the class-posterior probability by the following kernel model:

 p(y|x;α) :=n∑i=1αy,iK(x,xi), (6)

where is the parameter vector, denotes the transpose, and denotes a kernel function with a kernel parameter . In the experiments, we will use a sparse variant of the local-scaling kernel (Zelnik-Manor and Perona, 2005):

 (7)

where denotes the set of nearest neighbors for ( is the kernel parameter), is a local scaling factor defined as , and is the -th nearest neighbor of .

Further approximating the expectation with respect to included in Eq.(5) by the empirical average of samples , we arrive at the following SMI approximator:

 ˆSMI :=12nc∑y=11πyα⊤yK2αy−12, (8)

where and .

For each cluster , we maximize under333Note that this unit-norm constraint is not essential since the obtained solution is renormalized later. . Since this is the Rayleigh quotient

, the maximizer is given by the normalized principal eigenvector of

(Horn and Johnson, 1985). To avoid all the solutions to be reduced to the same principal eigenvector, we impose their mutual orthogonality: for . Then the solutions are given by the normalized eigenvectors associated with the eigenvalues of . Since the sign of is arbitrary, we set the sign as

 ˜ϕy=ϕy×sign(ϕ⊤y1n),

where denotes the sign of a scalar and denotes the -dimensional vector with all ones.

On the other hand, since

 p∗(y)=∫p∗(y|x)p∗(x)dx≈1nn∑i=1p(y|xi;α)=α⊤yK1n,

and the class-prior probability was set to for , we have the following normalization condition:

 α⊤yK1n=πy.

Furthermore, probability estimates should be non-negative, which can be achieved by rounding up negative outputs to zero.

Taking these normalization and non-negativity issues into account, cluster assignment for is determined as the maximizer of the approximation of :

 yi =argmaxy[max(0n,K˜ϕy)]iπ−1ymax(0n,K˜ϕy)⊤1n=argmaxyπy[max(0n,˜ϕy)]imax(0n,˜ϕy)⊤1n,

where the max operation for vectors is applied in the element-wise manner and denotes the -th element of a vector. Note that we used in the above derivation. For out-of-sample prediction, cluster assignment for new sample may be obtained as

 y′

We call the above method SMI-based clustering (SMIC).

### 2.4 Kernel Parameter Choice by SMI Maximization

The solution of SMIC depends on the choice of the kernel parameter included in the kernel function . Since SMIC was developed in the framework of SMI maximization, it would be natural to determine the kernel parameter so as to maximize SMI. A direct approach is to use the SMI estimator (8) also for kernel parameter choice. However, this direct approach is not favorable because is an unsupervised SMI estimator (i.e., SMI is estimated only from unlabeled samples ). On the other hand, in the model selection stage, we have already obtained labeled samples , and thus supervised estimation of SMI is possible. For supervised SMI estimation, a non-parametric SMI estimator called least-squares mutual information (LSMI; Suzuki et al., 2009) was shown to achieve the optimal convergence rate. For this reason, we propose to use LSMI for model selection, instead of (8).

LSMI is an estimator of SMI based on paired samples . The key idea of LSMI is to learn the following density-ratio function,

 r∗(x,y):=p∗(x,y)p∗(x)p∗(y), (9)

without going through density estimation of , , and . More specifically, let us employ the following density-ratio model:

 r(x,y;θ) :=∑ℓ:yℓ=yθℓL(x,xℓ), (10)

where and is a kernel function with a kernel parameter . In the experiments, we will use the Gaussian kernel:

 L(x,x′)=exp(−∥x−x′∥22γ2), (11)

where the Gaussian width is the kernel parameter.

The parameter in the above density-ratio model is learned so that the following squared error is minimized:

 minθ12∫c∑y=1(r(x,y;θ)−r∗(x,y))2p∗(x)p∗(y)dx. (12)

Let be the parameter vector corresponding to the kernel bases , i.e., is the sub-vector of consisting of indices . Let be the length of , i.e., the number of samples in cluster . Then an empirical and regularized version of the optimization problem (12) is given for each as follows:

 minθy[12θ⊤yˆH(y)θy−θ⊤yˆh(y)+δ2θ⊤yθy], (13)

where () is the regularization parameter. is the matrix and is the -dimensional vector defined as

 ˆH(y)ℓ,ℓ′ :=nyn2n∑i=1L(xi,x(y)ℓ)L(xi,x(y)ℓ′), ˆh(y)ℓ :=1n∑i:yi=yL(xi,x(y)ℓ),

where is the -th sample in class (which corresponds to ).

A notable advantage of LSMI is that the solution can be computed analytically as

 ˆθ(y)=(ˆH(y)+δI)−1ˆh(y).

Then a density-ratio estimator is obtained analytically as follows444 Note that, in the original LSMI paper (Suzuki et al., 2009), the entire parameter for all classes was optimized at once. On the other hand, we found that, when the density-ratio model defined by Eq.(10) is used for SMI approximation, exactly the same solution as the original LSMI paper can be computed more efficiently by class-wise optimization. Indeed, in our preliminary experiments, we confirmed that our class-wise optimization significantly reduces the computation time compared with the original all-class optimization, with the same solution. Note that the original LSMI is applicable to more general setups such as regression, multi-label classification, and structured-output prediction. Thus, our speedup was brought by focusing on classification scenarios where Kronecker’s delta function is used as the kernel for class labels in the density-ratio model (10). :

 ˆr(x,y) =ny∑ℓ=1ˆθ(y)ℓL(x,x(y)ℓ).

The accuracy of the above least-squares density-ratio estimator depends on the choice of the kernel parameter included in and the regularization parameter in Eq.(13). Suzuki et al. (2009) showed that these tuning parameter values can be systematically optimized based on cross-validation as follows: First, the samples are divided into disjoint subsets of approximately the same size (we use in the experiments). Then a density-ratio estimator is obtained using (i.e., all samples without ), and its out-of-sample error (which corresponds to Eq.(12) without irrelevant constant) for the hold-out samples is computed as

 CVm:=12|Zm|2∑x,y∈Zmˆrm(x,y)2−1|Zm|∑(x,y)∈Zmˆrm(x,y).

This procedure is repeated for , and the average of the above hold-out error over all is computed as

 CV:=1MM∑m=1CVm.

Finally, the kernel parameter and the regularization parameter that minimize the average hold-out error are chosen as the most suitable ones.

Finally, based on an expression of SMI (1),

 SMI =−12∫c∑y=1r∗(x,y)2p∗(x)p∗(y)dx+∫c∑y=1r∗(x,y)p∗(x,y)dx−12,

an SMI estimator called LSMI is given as follows:

 LSMI:=−12n2n∑i,j=1ˆr(xi,yj)2+1nn∑i=1ˆr(xi,yi)−12, (14)

where is a density-ratio estimator obtained above. Since can be computed analytically, LSMI can also be computed analytically.

We use LSMI for model selection of SMIC. More specifically, we compute LSMI as a function of the kernel parameter of included in the cluster-posterior model (6), and choose the one that maximizes LSMI. A pseudo code of the entire SMI-maximization clustering procedure is summarized in Figures 13. Its MATLAB implementation is available from

## 3 Existing Clustering Methods

In this section, we review existing clustering methods and qualitatively discuss the relation to the proposed approach.

### 3.1 K-Means Clustering

K-means clustering (MacQueen, 1967) would be one of the most popular clustering algorithms. It tries to minimize the following distortion measure with respect to the cluster assignments :

 c∑y=1∑i:yi=y∥xi−μy∥2, (15)

where is the centroid of cluster and is the number of samples in cluster .

The original k-means algorithm is capable of only producing linearly separated clusters (Duda et al., 2001). However, since samples are used only in terms of their inner products, its non-linear variant can be immediately obtained by performing k-means in a feature space induced by a reproducing kernel function (Girolami, 2002).

As the optimization problem of (kernel) k-means is NP-hard (Aloise et al., 2009), a greedy optimization algorithm is usually used for finding a local optimal solution in practice. It was shown that the solution to a continuously-relaxed variant of the kernel k-means problem is given by the principal components of the kernel matrix (Zha et al., 2002; Ding and He, 2004). Thus, post-discretization of the relaxed solution may give a good approximation to the original problem, which is computationally efficient. This idea is similar to the proposed SMIC method described in Section 2.3. However, an essential difference is that SMIC handles the continuous solution directly as a parameter estimate of the class-posterior model.

The performance of kernel k-means depends heavily on the choice of kernel functions, and there is no systematic way to determine the kernel function. This is a critical weakness of kernel k-means in practice. On the other hand, our proposed approach offers a natural model selection strategy, which is a significant advantage over kernel k-means.

### 3.2 Spectral Clustering

The basic idea of spectral clustering (Shi and Malik, 2000; Ng et al., 2002) is to first unfold non-linear data manifolds by a spectral embedding method, and then perform k-means in the embedded space. More specifically, given sample-sample similarity (large means that and are similar), the minimizer of the following criterion with respect to is obtained under some normalization constraint:

 n∑i,jWi,j∥∥ ∥∥1√Di,iξi−1√Dj,jξj∥∥ ∥∥2,

where is the diagonal matrix with -th diagonal element given by . Consequently, the embedded samples are given by the principal eigenvectors of , followed by normalization. Note that spectral clustering was shown to be equivalent to a weighted variant of kernel k-means with some specific kernel (Dhillon et al., 2004).

The performance of spectral clustering depends heavily on the choice of sample-sample similarity . Zelnik-Manor and Perona (2005) proposed a useful unsupervised heuristic to determine the similarity in a data-dependent manner, called local scaling:

 Wi,j=exp(−∥xi−xj∥22σiσj),

where is a local scaling factor defined as

 σi=∥xi−x(t)i∥,

and is the -th nearest neighbor of . is the tuning parameter in the local scaling similarity, and was shown to be useful (Zelnik-Manor and Perona, 2005; Sugiyama, 2007). However, this magic number ‘7’ does not seem to work always well in general.

If is regarded as a kernel matrix, spectral clustering will be similar to the proposed SMIC method described in Section 2.3. However, SMIC does not require the post k-means processing since the principal components have clear interpretation as parameter estimates of the class-posterior model (6). Furthermore, our proposed approach provides a systematic model selection strategy, which is a notable advantage over spectral clustering.

### 3.3 Blurring Mean-Shift Clustering

Blurring mean-shift (Fukunaga and Hostetler, 1975) is a non-parametric clustering method based on the modes of the data-generating probability density.

In the blurring mean-shift algorithm, a kernel density estimator (Silverman, 1986) is used for modeling the data-generating probability density:

 ˆp(x)=1nn∑i=1K(∥x−xi∥2/σ2),

where is a kernel function such as a Gaussian kernel . Taking the derivative of with respect to and equating the derivative at to zero, we obtain the following updating formula for sample ():

 xi⟵∑nj=1Wi,jxj∑nj′=1Wi,j′,

where and is the derivative of . Each mode of the density is regarded as a representative of a cluster, and each data point is assigned to the cluster which it converges to.

Carreira-Perpiñán (2007) showed that the blurring mean-shift algorithm can be interpreted as an expectation-maximization algorithm (Dempster et al., 1977), where is regarded as the posterior probability of the -th sample belonging to the -th cluster. Furthermore, the above update rule can be expressed in a matrix form as , where is a sample matrix and is a stochastic matrix of the random walk in a graph with adjacency (Chung, 1997). is defined as and for . If is independent of , the above iterative algorithm corresponds to the power method (Golub and Loan, 1996) for finding the leading left eigenvector of . Then, this algorithm is highly related to the spectral clustering which computes the principal eigenvectors of (see Section 3.2). Although depends on in reality, Carreira-Perpiñán (2006) insisted that this analysis is still valid since and quickly reach a quasi-stable state.

An attractive property of blurring mean-shift is that the number of clusters is automatically determined as the number of modes in the probability density estimate. However, this choice depends on the kernel parameter and there is no systematic way to determine , which is restrictive compared with the proposed method. Another critical drawback of the blurring mean-shift algorithm is that it eventually converges to a single point (i.e., a single cluster, see Cheng, 1995, for details), and therefore a sensible stopping criterion is necessary in practice. Although Carreira-Perpiñán (2006) gave a useful heuristic for stopping the iteration, it is not clear whether this heuristic always works well in practice.

### 3.4 Discriminative Clustering

The support vector machine (SVM; Vapnik, 1995)

is a supervised discriminative classifier that tries to find a hyperplane separating positive and negative samples with the maximum margin.

Xu et al. (2005) extended SVM to unsupervised classification scenarios (i.e., clustering), which is called maximum-margin clustering (MMC).

MMC inherits the idea of SVM and tries to find the cluster assignments so that the margin between two clusters is maximized under proper constraints:

 miny∈{+1,−1}nmaxλ 2λ⊤1n−⟨K∘λλ⊤,yy⊤⟩ subject to −ε≤1⊤ny≤ε % and 0n≤λ≤C1n,

where denotes the Hadamard product (also known as the entry-wise product), and and are tuning parameters. The constraint corresponds to balancing the cluster size.

Since the above optimization problem is combinatorial with respect to and thus hard to solve directly, it is relaxed to a semi-definite program by replacing (which is a zero-one matrix with rank one) with a real positive semi-definite matrix (Xu et al., 2005). Since then, several approaches have been developed for further improving the computational efficiency of MMC (Valizadegan and Jin, 2007; Zhao et al., 2008; Zhang et al., 2009; Li et al., 2009; Wang et al., 2010).

The performance of MMC depends heavily on the choice of the tuning parameters and , but there is no systematic method to tune these parameters. The fact that our proposed approach is equipped with a model selection strategy would practically be a strong advantage over MMC.

Following a similar line to MMC, a discriminative and flexible framework for clustering (DIFFRAC; Bach and Harchaoui, 2008) was proposed. DIFFRAC tries to solve a regularized least-squares problem with respect to a linear predictor and class labels. Thanks to the simple least-squares formulation, the parameters in the linear predictor can be optimized analytically, and thus the optimization problem is much simplified. A kernelized version of the DIFFRAC optimization problem is given by

 miny∈{+1,−1}n tr(ΠΠ⊤κΓ(ΓKΓ+nκIn)−1Γ),

where is the cluster indicator matrix, which takes only at one of the elements in each row (this corresponds to the index of the cluster to which the sample belongs) and others are all zeros. () is the regularization parameter, and is a centering matrix. In practice, the above optimization problem is relaxed to a semi-definite program by replacing with a real positive semi-definite matrix. However, DIFFRAC is still computationally expensive and it suffers from lack of objective model selection strategies.

### 3.5 Generative Clustering

In the generative clustering framework (Duda et al., 2001), class labels are determined by

 ˆy=argmaxyp∗(y|x)=argmaxyp∗(x,y),

where is the class-posterior probability and is the data-generating probability. Typically, is modeled as

 p(x,y;β,π)=p(x|y;β)p(y;π),

where and

are parameters. Canonical model choice is the Gaussian distribution for

and the multinomial distribution for .

However, since class labels are unknown, one may not directly learn and in the joint-probability model . An approach to coping with this problem is to consider a marginal model,

 p(x;β,π)=c∑y=1p(x|y;β)p(y;π),

and learns the parameters and by maximum likelihood estimation (Duda et al., 2001):

 maxβ,πn∏i=1p(xi;β,π).

Since the likelihood function of the above mixture model is non-convex, a gradient method (Amari, 1967) may be used for finding a local maximizer in practice. For determining the number of clusters (mixtures) and the mixing-element model , likelihood cross-validation (Härdle et al., 2004) may be used.

Another approach to coping with the unavailability of class labels is to regard as latent variables, and apply the expectation-maximization (EM) algorithm (Dempster et al., 1977) for finding a local maximizer of the joint likelihood:

 maxβ,πn∏i=1p(xi,yi;β,π).

A more flexible variant of the EM algorithm called the split-and-merge EM algorithm (Ueda et al., 2000) is also available, which dynamically controls the number of clusters during the EM iteration.

Instead of point-estimating the parameters and , one can also consider their distributions in the Bayesian framework (Bishop, 2006). Let us introduce prior distributions and for the parameters and . Then the posterior distribution of the parameters is expressed as

 p(β,π|X)∝p(X|β,π)p(β)p(π),

where . Based on the Bayesian predictive distribution,

 ˆp(y|x,X)∝∬p(x,y|β,π)p(β,π|X)dβdπ,

class labels are determined as

 maxyˆp(y|x,X).

Because the integration included in the Bayesian predictive distribution is computationally expensive, conjugate priors are often adopted in practice. For example, for the Gaussian-cluster model , the Gaussian prior for the mean parameter and the Wishart prior is assumed for the precision parameter (i.e., the inverse covariance) are assumed; the Dirichlet prior is assumed for the multinomial model . Otherwise, the posterior distribution is approximated by the Laplace approximation (MacKay, 2003), the Markov chain Monte Carlo sampling (Andrieu et al., 2003), or the variational approximation (Attias, 2000; Ghahramani and Beal, 2000). The number of clusters can be determined based on the maximization of the marginal likelihood:

 (16)

The generative clustering methods are statistically well-founded. However, density models for each cluster need to be specified in advance, which lacks flexibility in practice. Furthermore, in the Bayesian approach, the choice of cluster models and prior distributions are often limited to conjugate pairs in practice. On the other hand, in the frequentist approach, only local solutions can be obtained in practice due to the non-convexity caused by mixture modeling.

### 3.6 Posterior-Maximization Clustering

Another possible clustering approach based on probabilistic inference is to directly maximizes the posterior probability of class labels (Bishop, 2006):

 maxYp∗(Y|X).

Let us model the cluster-wise data distribution by .

An approximate inference method called iterative conditional modes (Kurihara and Welling, 2009) alternatively maximizes the posterior probabilities of and until convergence:

 ˆY ⟵p(Y|X,ˆβ), ˆβ ⟵p(β|X,ˆY).

When the Gaussian model with covariance identity is assumed for , this algorithm is reduced to the k-means algorithm (see Section 3.1) under the uniform priors.

Let us consider the class-prior probability and model it by . Introducing the prior distributions and , we can approximate the posterior distribution of as

 p(Y|X)∝∬p(X|Y,β)p(β)p(Y|π)p(π)dβdπ.

Similarly to generative clustering described in Section 3.5, conjugate priors such as the Gauss-Wishart prior and the Dirichlet prior are practically useful in improving the computational efficiency. The number of clusters can also be similarly determined by maximizing the marginal likelihood (16). However, direct optimization of is often computationally intractable due to combinations, where is the number of clusters and is the number of samples. For this reason, efficient sampling schemes such as the Markov chain Monte Carlo are indispensable in this approach.

A Dirichlet process mixture (Ferguson, 1973; Antoniak, 1974) is a non-parametric extension of the above approach, where an infinite number of clusters are implicitly considered and the number of clusters is automatically determined based on observed data. In order to improve the computational efficiency of this infinite mixture approach, various approximation schemes such as Markov chain Monte Carlo sampling (Neal, 2000) and variational approximation (Blei and Jordan, 2006) have been introduced. Furthermore, variants of Dirichlet processes such as hierarchical Dirichlet processes (Teh et al., 2007), nested Dirichlet processes (Rodríguez et al., 2008), and dependent Dirichlet processes (Lin et al., 2010) have been developed recently.

However, even in this non-parametric Bayesian approach, density models for each cluster still need to be parametrically specified in advance, which is often limited to Gaussian models. This highly limits the flexibility in practice.

### 3.7 Dependence-Maximization Clustering

The Hilbert-Schmidt independence criterion (HSIC; Gretton et al., 2005) is a dependence measure based on a reproducing kernel function (Aronszajn, 1950). Song et al. (2007) proposed a dependence-maximization clustering method called clustering with HSIC (CLUHSIC), which tries to determine cluster assignments so that their dependence on feature vectors is maximized.

More specifically, CLUHSIC tries to find the cluster indicator matrix (see Section 3.4) that maximizes

 tr(KΠAΠ⊤),

where and is a cluster-cluster similarity matrix. Note that can be regarded as the kernel matrix for cluster assignments. Song et al. (2007) used a greedy algorithm to optimize the cluster indicator matrix, which is computationally demanding. Yang et al. (2010) gave spectral and semi-definite relaxation techniques to improve the computational efficiency of CLUHSIC.

HSIC is a kernel-based independence measure and the kernel function needs to be determined in advance. However, there is no systematic model selection strategy for HSIC, and using the Gaussian kernel with width set to the median distance between samples is a standard heuristic in practice (Fukumizu et al., 2009). On the other hand, our proposed approach is equipped with an objective model selection strategy, which is a notable advantage over CLUHSIC.

Another line of dependence-maximization clustering adopts mutual information (MI) as a dependency measure. Recently, a dependence-maximization clustering method called mean nearest-neighbor (MNN) clustering was proposed (Faivishevsky and Goldberger, 2010). MNN is based on the -nearest-neighbor entropy estimator proposed by Kozachenko and Leonenko (1987).

The performance of the original -nearest-neighbor entropy estimator depends on the choice of the number of nearest neighbors, . On the other hand, MNN avoids this problem by introducing a heuristic of taking an average over all possible . The resulting objective function is given by

 c∑y=11ny−1∑i≠j:yi=yj=ylog(∥xi−xj∥2+ϵ), (17)

where () is a smoothing parameter. Then this objective function is minimized with respect to cluster assignments using a greedy algorithm.

Although the fact that the tuning parameter is averaged out is convenient, this heuristic is not well justified theoretically. Moreover, the choice of the smoothing parameter is arbitrary. In the MATLAB code provided by one of the authors, was recommended, but there seems no justification for this choice. Also, due to the greedy optimization scheme, MNN is computationally expensive. On the other hand, our proposed approach offers a well-justified model selection strategy, and the SMI-based clustering gives an analytic-form solution which can be computed efficiently.

### 3.8 Information-Maximization Clustering with Mutual Information

Finally, we review methods of information-maximization clustering based on mutual information (Agakov and Barber, 2006; Gomes et al., 2010), which belong to the same family of clustering algorithms as our proposed method.

Mutual information (MI) is defined and expressed as

 MI :=∫c∑y=1p∗(x,y)logp∗(x,y)p∗(x)p∗(y)dx :=∫c∑y=1p∗(y|x)p∗(x)logp∗(y|x)dx−∫c∑y=1p∗(y|x)p∗(x)logp∗(y)dx. (18)

Let us approximate the class-posterior probability by a conditional-probability model with parameter . Then the marginal probability can be approximated as

 p∗(y)=∫p∗(y|x)p∗(x)dx≈1nn∑i=1p(y|xi;α). (19)

By further approximating the expectation with respect to included in Eq.(18) by the empirical average of samples , the following MI estimator can be obtained (Agakov and Barber, 2006; Gomes et al., 2010):

 ˆMI :=1nn∑i=1c∑y=1p(y|xi;α)logp(y|xi;α) :=−c∑y=1(1nn∑i=1p(y|xi;α))log(1nn∑j=1p(y|xj;α)). (20)

In Agakov and Barber (2006), the Gaussian model,

 p(y|x;α) ∝exp(−∥x−cy∥22s