    # Multiple Sample Clustering

The clustering algorithms that view each object data as a single sample drawn from a certain distribution, Gaussian distribution, for example, have been a hot topic for decades. Many clustering algorithms: such ask-means and spectral clustering, are proposed based on the single sample assumption. However, in real life, each input object can usually be the multiple samples drawn from a certain hidden distribution. The traditional clustering algorithms cannot handle such a situation. This calls for the multiple sample clustering algorithm. But the traditional multiple sample clustering algorithms can only handle scalar samples or samples from Gaussian distribution. This constrains the application field of multiple sample clustering algorithms. In this paper, we purpose a general framework for multiple sample clustering. Various algorithms can be generated by this framework. We apply two specific cases of this framework: Wasserstein distance version and Bhattacharyyadistance version on both synthetic data and stock price data. The simulation results show that the sufficient statistic can greatly improve the clustering accuracy and stability

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Multiple sample data is one kind of specific data structure in clustering. Suppose there are different groups of vectors. The multiple sample clustering tries to cluster the groups rather than the vectors themselves. Most of the clustering algorithms e.g.

k

-means, linear discriminant analysis, principal component analysis, etc, view each input object as a single sample drawn from univariate(or multivariate) Gaussian distribution.

Multiple samples clustering has a lot of applications in daily life. In video rating, for example, each video can be rated by different viewers and the rate scores of each video can be viewed as the multiple samples drawn from the same distribution. The clustering of different videos can help the video producers(such as Netflix) to recommend the most relevant movies(or ads) to the specified users[davidson2010youtube, mei2009automatic, yang2007online]. The quality of products coming from the same batch can also be seen as the multiple samples drawn from a hidden distribution. If we can cluster the same batches together, the manager can use this information to improve the production process[undey2003online, russell1998recursive]. What’s more, the time-series data, such as sensor networks[bandyopadhyay2003energy], or the stock prices at different time[harris1991stock], can also be viewed as the multiple samples from a hidden distribution. The clustering of sensor networks can find the relevance between sensors and reduce the communication cost of different sensors. Clustering stocks can help us build the investment portfolio.

Related work

Multiple samples clustering is a subfield of unsupervised learning, which is a way to detect data partition without prior knowledge. There are two main ramifications of unsupervised learning,

principle component, and clustering. Principal component[pearson1901liii] focuses on building linearly uncorrelated values based on a set of observations. This is not the main concern of our paper. Clustering aims at partitioning the whole observed set into separated subgroups. In the clustering field, scientists have been applying distribution information for decades. In 2006, Antonio Irpino[irpino2006new] applied Wasserstein distance on histogram symbolic data clustering, for clustering different states in America based on temperatures of each month. In 2013, Claudia Canali and Riccardo Lancellotti[canali2013automatic] applied Bhattacharyya distance into virtual machine clustering based on the behavior histogram of virtual machines, However, both algorithms are only useful for scalar samples. In 2007 Dillon[davis2007differential] applied KL-divergence on multiple samples clustering under the Gaussian distribution assumption. The simulation result shows that distribution information can greatly improve the clustering accuracy than just using the mean vector information of each sample group, but KL divergence is not exactly a distance. Many graph-based clustering algorithms are inapplicable under this metric and this algorithm cannot fit all different multiple sample data structures. A general framework for multiple samples clustering remains undeveloped.

Contribution and paper outline In this paper, we propose a general framework for multiple sample clustering, various algorithms can be generated from this framework. An adapted algorithm called KL divergence++ is built based on Dhillion’s work[davis2007differential], which can achieve higher clustering accuracy than the original one. Finally, we will compare the performance among five different algorithms to illustrate the importance of distribution information in multiple sample clustering.

This paper will be organized in the following structure: Section 2 includes the necessary notation definition and background knowledge. Section 3 tries to build the model of the multiple samples clustering problem and propose the general framework to solve it. Two different clustering algorithms will be proposed in section 4 and the simulation results of both synthetic and real data will show in section 5. Future research directions and main results are contained in section 6. [fortunato2016community][javed2018community][rossetti2018community]

Necessary Notations Although all symbols will be explicated when firstly referred to, we will talk about some basic notations in this paper for clarity. add a name of the table, just on the top of the table.

## 2 Preliminary

In this section, we are going to introduce some essential concepts and tools for multiple sample clustering. We all know the probability density function of univariate Gaussian distribution:

. Multivariate Gaussian distribution is a natural extension of it. Suppose a -dimensional multivariate Gaussian distribution with mean vector and covariance matrix . The probability density function is the following equation:

 p(x|m,Σ)=1(2π)d2|Σ|exp(−12(x−m)TΣ−1(x−m)) (1)

where is the determinant of

### 2.1 Distribution Metrics

Mathematically, metric on a space is a function: with the following four properties: for elements and Cartesian product between two spaces.

non-negativity

identity

symmetry

We will talk about two distribution metrics: Wasserstein distance and Bhattacharyya distance separately. KL divergence will show up in section 2.2.3 but it is not a distance metric since it does not obey the subadditivity rule.

#### 2.1.1 Wasserstein Distance

Wasserstein distance can be retrospected to 1781 when Gaspard Monge made up the optimal transport object function to measure the effort to move one pile of dirt into another place with different shape[monge1781memoire]. Wasserstein distance is also called the Earth movers’ distance because of this history. Let’s set two spaces, and respectively two measurements defined on and satisfying , and a map with for. The Wasserstein distance is defined as the minimum value of the cost function via choosing the optimal map

 DW(μ,ν)=minT∫Xc(x,T(x))dμ(x)dxx∈X,y∈Y (2)

In practice, for vector and in Euclidean space, we always set and the corresponded is called 2-norm Wasserstein.

Although Wasserstein distance can measure the distance between continuous distributions, discrete distributions or between a continuous distribution and a discrete distribution, we only consider the Gaussian distribution for computation efficiency. Nevertheless, the framework proposed in this paper can be easily extended to different distributions.

Suppose two Gaussians where and are the mean vectors of two Gaussians and and are respectively covariance matrices. The Wasserstein distance of two -dimension Gaussian distributions can be computed in the following equation:

 DW,2(μ1,μ2):=||m1−m2||22+Tr(Σ1+Σ2−2(Σ1/21Σ2Σ1/21)) (3)

#### 2.1.2 Bhattacharyya Distance

Bhattacharyya Distance between two distributions and on the same domain is

 DB(μ,ν)=∫x∈X√μ(x)ν(x)dx (4)

For the Bhattacharyya distance between Gaussian distributions ,

 DB(μ,ν)=18(m1−m2)TΣ−1(m1−m2)+12ln(|Σ|√|Σ1||Σ2|) (5)

where

#### 2.1.3 KL divergence computation between Gaussians

KL divergence between two distributions and is

 DKL(μ||ν)=∫Xμ(x)log(μ(x)ν(x))dx (6)

If are two -dimensional Gaussian distributions. The KL divergence between and is:

 DKL(μ1,ν2)=12[log|Σ2||Σ1|−d+Tr(Σ−12Σ1)+(m2−m1)TΣ−12(m2−m1)] (7)

We notice that compared with Wasserstein distance and Bhattacharyya distance, KL divergence is not a real distance since KL divergence does not have symmetry property. As a results, any graph partitioning algorithms, spectral clustering, for example, cannot be applied to the KL divergence adjacency matrix between Gaussians.

### 2.2 Spectral Clustering

Spectral clustering is one of the graph partitioning algorithm. In graph partitioning problem, let be an undirected graph with vertex set and similarity adjacency matrix . represents the similarity between and . Graph partitioning algorithms try to split into non-overlapped parts by minimizing an object function. Define , the complement of , , and : the number of nodes in . There are two most commonly used object function in graph partitioning:

 RatioCut(A1,…,Ak):=12k∑i=1W(Ai,¯Ai)|Ai|Ncut:=12k∑i=1W(Ai,¯Ai)vol(Ai) (8)

However, minimizing the RatioCut and Ncut object functions are NP-hard problems. Spectral clustering tries to solve an approximated problem of Them. The implication steps are shown in the following table:

We only analyze the normalized Ncut with case. Other situations can be easily extended from this analysis.

For the object function , we define the vector with elements

 fi=⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩√vol(¯A)vol(A)if vi∈A−√vol(¯A)vol(A)if vi∈¯A (11)

We can prove that and . Thus we can rewrite the problem of minimizing Ncut by the following equation.

 minAfTLfs.t. f in form eq.(???),Df⊥1,f′TDf=vol(V) (12)

Then we relax this problem by allowing to take arbitrary real values.

 minf∈RnfTLfs.t. Df⊥1,fTDf=vol(V) (13)

Then we substitute . After substitution, the problem is

 ming∈RngTD−12LD−12g s.t. g⊥D121,||g||2=vol(V) (14)

solution of eq.(14) is given by the second biggest eigenvector of .

## 3 Problem Formulation

In multiple sample clustering, each clustering object is not a vector, but a set of vectors sampled from a hidden distribution. Suppose we have clustering objects and each clustering object is combined by multiple sample vectors . The vectors in set are sampled from Gaussian distribution

. We can compute the estimated unbiased mean vector

and covariance matrix :

 ~mi=1qiqi∑j=1aij (15)
 ~Σi=1qi−1qi∑j=1(aij−~mi)(aij−~mi)T (16)

For the traditional clustering algorithm, -means, spectral clustering,etc, they can only use the mean vector

to clustering vector groups. However, for Wasserstein distance based clustering, Bhattacharyya distance based clustering, KL divergence based clustering, and KL divergence++ algorithm, they can use the first moment information

, second moment information , and even the whole distribution information into clustering. We will elaborate these algorithms in section 4.

## 4 Clustering Algorithm

We are going to introduce a general framework for multiple sample clustering. Two algorithms: Wasserstein distance based clustering and Bhattacharyya distance based clustering is generated from the framework.

The structure of the framework is shown in the following figure.

We can see from the figure that there are three main parts of multiple sample clustering: estimating the distributions corresponded to each vector groups , computing the adjacency matrix based on a specific distribution metric, and using graph partitioning algorithms to output clustered groups where for satisfies that . We summarize the general framework for multiple sample clustering in table 3:

### 4.1 Wasserstein Distance Based Clustering

The Wasserstein distance based clustering algorithm is a special case of the framework. For multiple samples dataset where . We assume that each group of sample are drawn from the identical Gaussian distribution so that the mean vector and covariance matrices are the sufficient statistic of the Gaussian distribution . Then we utilize the Wasserstein distance to build a symmetric adjacency matrix . Finally, we apply the normalized spectral clustering on the adjacency matrix to get the finally cluster . The pseudo code of this algorithm is shown in table 4.

### 4.2 Bhattacharyya Distance Based Clustering

The Bhattacharyya distance based clustering algorithm has the same form of Wasserstein distance based clustering. The only difference is that it replace the adjacency matrix with where . The pseudo code of this algorithm is shown in table 5.

### 4.3 KL-divergence based clustering Algorithm

KL-divergence algorithm is purposed by Dhillon in 2007[davis2007differential]. Unlike previous algorithm only allows scalar samples clustering, this algorithm firstly makes vector samples clustering possible. We can view this algorithm as an extension of -means algorithm. The algorithm can be separated into three steps. Step one, for vector groups , estimate the corresponded distribution where , make the initial assignment . Step two, in -th loop, compute clustering centers . Step three, make an assignment . Vector group is assigned to if the corresponded distribution has the smallest KL-divergence with cluster center if . Come back to step 2 until the cluster assignment does not change. The pseudo code of this algorithm is shown in table 6.

### 4.4 KL divergence++ algorithm

KL divergence++ is an improved version of KL-divergence based clustering algorithm. This algorithm is inspired by Arthur[arthur2007k]. Arthur choose the initial clustering center carefully by making the distances between clustering centers statistically big enough[arthur2007k]. Which yields KL divergence++ algorithm. Section 4.3 refers KL divergence based clustering algorithm has three steps. KL divergence++ only changes the step one. In step one, we do not make assignment randomly. Instead, we choose initial clustering centers in a sequence. The step one in KL divergence based clustering algorithm is separated into three steps. Step one, randomly choose the first clustering center from . Step two, for the -th clustering center, we need to compute the KL divergence between each distribution and chosen clustering center . Then choose the smallest element of this vector .

. Compute the the probability vector . The is proportional to . Step Three, Choose the th clustering center based on the probability vector. go back to step two until all clustering center is chosen.   Algorithm 4: KL-divergence++ algorithm

# Use k means++ algorithm to assign the initial cluster center

1: For vector groups , estimate the corresponded Gaussian distributions:
2: are the mean vectors of input Gaussians
3: are the covariance matrices of input Gaussians.
4: Choose the first clustering center randomly from input Gaussians.
5: for q = 1:(k-1) # choose the -th clustering center
6:  Compute divergence matrix where
7:  Build the minimum distance vector where .
8:  Compute the probability vector where
9:  Choose in the range of based on probability vector .
10: end for
11: Arrange the initial assignment based on the initial cluster center .
# Do the KL divergence based cluster
12: while not converge do
13:  for do # update means of clustering centers
14:
15:  end for
16:  for do # update cluster covariances of clustering centers
17:
18:  end for
19:  for do # assign each Gaussian to the closest cluster representative Gaussian
20:
21:  end for
22: end while

## 5 Simulation Results

In this section, we will apply means, spectral clustering, Wasserstein distance based clustering, Bhattacharyya distance based clustering, KL divergence based clustering and KL divergence++ into synthetic data and stock price data. Compare their clustering accuracy on these data sets. We will directly compute the normalized mutual information, a clustering accuracy index, between clustered result and ground truth in synthetic data. However, the accuracy of unsupervised learning is very hard to measure in real data set since we do not have the ground truth. As a result, in stock price data set, we decide to use the clustering result of the original data set as the ground truth. Then, we will add different level of i.i.d Gaussian noise on the data. Finally, compute the clustering accuracy by computing the mutual information between the clustering result of noised data set and "ground truth". We will firstly introduce mutual information in section 5.1 about the accuracy index in unsupervised learning: Mutual information. Then elaborate the simulation process and result of synthetic data and the real stock price data.

### 5.1 Normalized Mutual information

Normalized mutual information(NMI) is an index estimating the clustering quality. It comes from the mutual information. Mutual information is a measure of the mutual dependence between two variables. More specifically, it quantifies the "amount of information" obtained about one random variable through observing the other random variable. Suppose we have two different clustering results

and of clustering objects . It is easy to transform and into two discrete variables and . Let and be the corresponded distribution respectively defined on space and . Then we have

 pX(i)=1|X||πa=i|pY(j)=1|Y||πb=j|pX,Y(i,j)=1|X×Y||(πa=i,πb=j)| (17)

is the number of clustering objects assigned to cluster . The mutual information between and is defined as

 I(X,Y)=∑y∈Y∑x∈Xp(X,Y)(x,y)logp(X,Y)(x,y)pX(x)pY(y) (18)

Define the entropy of random variable and as followed. Then we have

 H(X)=−∑x∈XpX(x)logpX(x)H(Y)=−∑y∈YpY(y)logpY(y) (19)

The normalized mutual information is defined as

 NMI(X,Y)=2×I(X,Y)H(X)+H(Y) (20)

### 5.2 synthetic data

We apply the same synthetic data generating strategy as [davis2007differential] to maintain the comparability. Our synthetic dataset is a set of 200 objects. Each object consists 30 samples. The samples of the same object are drawn from same one of randomly generated -dimensional multivariate Gaussians . For . The mean vector is chosen uniformly at the surface of the -dimensional unit simplex. Covariance matrix

is a random matrix with eigenvalues

and

a random orthogonal matrix. We apply six different algorithms,

-means, spectral clustering, KL divergence based clustering algorithm, KL divergence++, Wasserstein distance based clustering, and Bhattacharyya distance based clustering. The simulation results can been shown as in figure 2. The first figure shows the NMI line if we fix the number of clusters and change dimension from 4 to 10. The second show the NMI line if we fix the dimension and change the number of clusters . Every figure is based on the average results after 1000 iterations. This simulation is run on the computer with Intel I5-8400 and takes 51142 seconds. The average NMI is shown in figure.2, table.7 and table.8.

Then, we compute the variance of normalized mutual information for each algorithm after 1000 iteration. The detail data is shown in fig.

3, table.9 and table.10.

Compare the results between the first-moment information based clustering algorithms: means, spectral clustering and the distribution information based clustering algorithms: KL divergence based clustering, KL divergence++, Wasserstein distance based clustering, and Bhattacharyya distance based clustering. We can find the distribution information can greatly improve the clustering accuracy, and Bhattacharyya distance based algorithm has the highest average NMI. What’s more, the Bhattacharyya distance based clustering also has the lowest variance among all second-moment information based algorithms. This simulation only use the Gaussian distribution as the assumed hidden distribution, but this results can be easily extended to more distribution assumptions.

### 5.3 Stock clustering

We utilize the New York stock exchange data collected by Dominik Gawlik [anudc:4896]. There are four different prices for each stock every day: open price(open), close price(close), low price (low) and high price (high). We can view each stock as a set of multiple samples on four features at different time. More specifically, 1726 -dimension samples from date Jun 4th, 2010 to Oct 7th, 2016. Then do the clustering based on the samples. Six different algorithms are applied: means++, spectral clustering, KL divergence based clustering, KL divergence++, Wasserstein distance based spectral clustering and Bhattacharyya distance based clustering. Use the assignment result of these each algorithm as ground truth and then add i.i.d Gaussian noise to the stock data. Apply the clustering algorithms to the noised data and compute the mutual information with the ’ground truth’ with respect to that algorithm. This simulation is run on platform I5-8400. 100 iteration takes 78450 seconds. We get the following six figures 4: