1 Introduction
Multiple sample data is one kind of specific data structure in clustering. Suppose there are different groups of vectors. The multiple sample clustering tries to cluster the groups rather than the vectors themselves. Most of the clustering algorithms e.g.
kmeans, linear discriminant analysis, principal component analysis, etc, view each input object as a single sample drawn from univariate(or multivariate) Gaussian distribution.
Multiple samples clustering has a lot of applications in daily life. In video rating, for example, each video can be rated by different viewers and the rate scores of each video can be viewed as the multiple samples drawn from the same distribution. The clustering of different videos can help the video producers(such as Netflix) to recommend the most relevant movies(or ads) to the specified users[davidson2010youtube, mei2009automatic, yang2007online]. The quality of products coming from the same batch can also be seen as the multiple samples drawn from a hidden distribution. If we can cluster the same batches together, the manager can use this information to improve the production process[undey2003online, russell1998recursive]. What’s more, the timeseries data, such as sensor networks[bandyopadhyay2003energy], or the stock prices at different time[harris1991stock], can also be viewed as the multiple samples from a hidden distribution. The clustering of sensor networks can find the relevance between sensors and reduce the communication cost of different sensors. Clustering stocks can help us build the investment portfolio.
Related work
Multiple samples clustering is a subfield of unsupervised learning, which is a way to detect data partition without prior knowledge. There are two main ramifications of unsupervised learning,
principle component, and clustering. Principal component[pearson1901liii] focuses on building linearly uncorrelated values based on a set of observations. This is not the main concern of our paper. Clustering aims at partitioning the whole observed set into separated subgroups. In the clustering field, scientists have been applying distribution information for decades. In 2006, Antonio Irpino[irpino2006new] applied Wasserstein distance on histogram symbolic data clustering, for clustering different states in America based on temperatures of each month. In 2013, Claudia Canali and Riccardo Lancellotti[canali2013automatic] applied Bhattacharyya distance into virtual machine clustering based on the behavior histogram of virtual machines, However, both algorithms are only useful for scalar samples. In 2007 Dillon[davis2007differential] applied KLdivergence on multiple samples clustering under the Gaussian distribution assumption. The simulation result shows that distribution information can greatly improve the clustering accuracy than just using the mean vector information of each sample group, but KL divergence is not exactly a distance. Many graphbased clustering algorithms are inapplicable under this metric and this algorithm cannot fit all different multiple sample data structures. A general framework for multiple samples clustering remains undeveloped.Contribution and paper outline In this paper, we propose a general framework for multiple sample clustering, various algorithms can be generated from this framework. An adapted algorithm called KL divergence++ is built based on Dhillion’s work[davis2007differential], which can achieve higher clustering accuracy than the original one. Finally, we will compare the performance among five different algorithms to illustrate the importance of distribution information in multiple sample clustering.
This paper will be organized in the following structure: Section 2 includes the necessary notation definition and background knowledge. Section 3 tries to build the model of the multiple samples clustering problem and propose the general framework to solve it. Two different clustering algorithms will be proposed in section 4 and the simulation results of both synthetic and real data will show in section 5. Future research directions and main results are contained in section 6. [fortunato2016community][javed2018community][rossetti2018community]
Necessary Notations Although all symbols will be explicated when firstly referred to, we will talk about some basic notations in this paper for clarity. add a name of the table, just on the top of the table.
Mathematical Meaning  Symbols 

vector  bold lower case: 
matrix  bold capital letter: 
spaces  
set  
distribution 
2 Preliminary
In this section, we are going to introduce some essential concepts and tools for multiple sample clustering. We all know the probability density function of univariate Gaussian distribution:
. Multivariate Gaussian distribution is a natural extension of it. Suppose a dimensional multivariate Gaussian distribution with mean vector and covariance matrix . The probability density function is the following equation:(1) 
where is the determinant of
2.1 Distribution Metrics
Mathematically, metric on a space is a function: with the following four properties: for elements and Cartesian product between two spaces.
 nonnegativity

 identity

 symmetry

 subadditivity

We will talk about two distribution metrics: Wasserstein distance and Bhattacharyya distance separately. KL divergence will show up in section 2.2.3 but it is not a distance metric since it does not obey the subadditivity rule.
2.1.1 Wasserstein Distance
Wasserstein distance can be retrospected to 1781 when Gaspard Monge made up the optimal transport object function to measure the effort to move one pile of dirt into another place with different shape[monge1781memoire]. Wasserstein distance is also called the Earth movers’ distance because of this history. Let’s set two spaces, and respectively two measurements defined on and satisfying , and a map with for. The Wasserstein distance is defined as the minimum value of the cost function via choosing the optimal map
(2) 
In practice, for vector and in Euclidean space, we always set and the corresponded is called 2norm Wasserstein.
Although Wasserstein distance can measure the distance between continuous distributions, discrete distributions or between a continuous distribution and a discrete distribution, we only consider the Gaussian distribution for computation efficiency. Nevertheless, the framework proposed in this paper can be easily extended to different distributions.
Suppose two Gaussians where and are the mean vectors of two Gaussians and and are respectively covariance matrices. The Wasserstein distance of two dimension Gaussian distributions can be computed in the following equation:
(3) 
2.1.2 Bhattacharyya Distance
Bhattacharyya Distance between two distributions and on the same domain is
(4) 
For the Bhattacharyya distance between Gaussian distributions ,
(5) 
where
2.1.3 KL divergence computation between Gaussians
KL divergence between two distributions and is
(6) 
If are two dimensional Gaussian distributions. The KL divergence between and is:
(7) 
We notice that compared with Wasserstein distance and Bhattacharyya distance, KL divergence is not a real distance since KL divergence does not have symmetry property. As a results, any graph partitioning algorithms, spectral clustering, for example, cannot be applied to the KL divergence adjacency matrix between Gaussians.
2.2 Spectral Clustering
Spectral clustering is one of the graph partitioning algorithm. In graph partitioning problem, let be an undirected graph with vertex set and similarity adjacency matrix . represents the similarity between and . Graph partitioning algorithms try to split into nonoverlapped parts by minimizing an object function. Define , the complement of , , and : the number of nodes in . There are two most commonly used object function in graph partitioning:
(8) 
However, minimizing the RatioCut and Ncut object functions are NPhard problems. Spectral clustering tries to solve an approximated problem of Them. The implication steps are shown in the following table:
Spectral Clustering
Input: graph matrix where describes the connection intensity between node and node .
Output: Bipartition and of the input data.
1. Compute the diagonal degree matrix with elements:
(9) 
2: Compute the normalized Laplacian matrix:
(10) 
3:Compute the first eigenvectors of
4:Let be the matrix containing the vectors as columns.
5:Form the matrix from by normalizing the rows to norm 1. That is to set
6:For , let be the vector corresponding to the th row of .
7: Apply means algorithm on points and get clusters
8:Output clusters with
We only analyze the normalized Ncut with case. Other situations can be easily extended from this analysis.
For the object function , we define the vector with elements
(11) 
We can prove that and . Thus we can rewrite the problem of minimizing Ncut by the following equation.
(12) 
Then we relax this problem by allowing to take arbitrary real values.
(13) 
Then we substitute . After substitution, the problem is
(14) 
solution of eq.(14) is given by the second biggest eigenvector of .
3 Problem Formulation
In multiple sample clustering, each clustering object is not a vector, but a set of vectors sampled from a hidden distribution. Suppose we have clustering objects and each clustering object is combined by multiple sample vectors . The vectors in set are sampled from Gaussian distribution
. We can compute the estimated unbiased mean vector
and covariance matrix :(15) 
(16) 
For the traditional clustering algorithm, means, spectral clustering,etc, they can only use the mean vector
to clustering vector groups. However, for Wasserstein distance based clustering, Bhattacharyya distance based clustering, KL divergence based clustering, and KL divergence++ algorithm, they can use the first moment information
, second moment information , and even the whole distribution information into clustering. We will elaborate these algorithms in section 4.4 Clustering Algorithm
We are going to introduce a general framework for multiple sample clustering. Two algorithms: Wasserstein distance based clustering and Bhattacharyya distance based clustering is generated from the framework.
The structure of the framework is shown in the following figure.
We can see from the figure that there are three main parts of multiple sample clustering: estimating the distributions corresponded to each vector groups , computing the adjacency matrix based on a specific distribution metric, and using graph partitioning algorithms to output clustered groups where for satisfies that . We summarize the general framework for multiple sample clustering in table 3:
4.1 Wasserstein Distance Based Clustering
The Wasserstein distance based clustering algorithm is a special case of the framework. For multiple samples dataset where . We assume that each group of sample are drawn from the identical Gaussian distribution so that the mean vector and covariance matrices are the sufficient statistic of the Gaussian distribution . Then we utilize the Wasserstein distance to build a symmetric adjacency matrix . Finally, we apply the normalized spectral clustering on the adjacency matrix to get the finally cluster . The pseudo code of this algorithm is shown in table 4.
4.2 Bhattacharyya Distance Based Clustering
The Bhattacharyya distance based clustering algorithm has the same form of Wasserstein distance based clustering. The only difference is that it replace the adjacency matrix with where . The pseudo code of this algorithm is shown in table 5.
4.3 KLdivergence based clustering Algorithm
KLdivergence algorithm is purposed by Dhillon in 2007[davis2007differential]. Unlike previous algorithm only allows scalar samples clustering, this algorithm firstly makes vector samples clustering possible. We can view this algorithm as an extension of means algorithm. The algorithm can be separated into three steps. Step one, for vector groups , estimate the corresponded distribution where , make the initial assignment . Step two, in th loop, compute clustering centers . Step three, make an assignment . Vector group is assigned to if the corresponded distribution has the smallest KLdivergence with cluster center if . Come back to step 2 until the cluster assignment does not change. The pseudo code of this algorithm is shown in table 6.
4.4 KL divergence++ algorithm
KL divergence++ is an improved version of KLdivergence based clustering algorithm. This algorithm is inspired by Arthur[arthur2007k]. Arthur choose the initial clustering center carefully by making the distances between clustering centers statistically big enough[arthur2007k]. Which yields KL divergence++ algorithm. Section 4.3 refers KL divergence based clustering algorithm has three steps. KL divergence++ only changes the step one. In step one, we do not make assignment randomly. Instead, we choose initial clustering centers in a sequence. The step one in KL divergence based clustering algorithm is separated into three steps. Step one, randomly choose the first clustering center from . Step two, for the th clustering center, we need to compute the KL divergence between each distribution and chosen clustering center . Then choose the smallest element of this vector .
. Compute the the probability vector . The is proportional to . Step Three, Choose the th clustering center based on the probability vector. go back to step two until all clustering center is chosen. Algorithm 4: KLdivergence++ algorithm
# Use k means++ algorithm to assign the initial cluster center
1: For vector groups , estimate the corresponded Gaussian distributions:
2: are the mean vectors of input Gaussians
3: are the covariance matrices of input Gaussians.
4: Choose the first clustering center randomly from input Gaussians.
5: for q = 1:(k1) # choose the th clustering center
6: Compute divergence matrix where
7: Build the minimum distance vector where .
8: Compute the probability vector where
9: Choose in the range of based on probability vector .
10: end for
11: Arrange the initial assignment based on the initial cluster center .
# Do the KL divergence based cluster
12: while not converge do
13: for do # update means of clustering centers
14:
15: end for
16: for do # update cluster covariances of clustering centers
17:
18: end for
19: for do # assign each Gaussian to the closest cluster representative Gaussian
20:
21: end for
22: end while
5 Simulation Results
In this section, we will apply means, spectral clustering, Wasserstein distance based clustering, Bhattacharyya distance based clustering, KL divergence based clustering and KL divergence++ into synthetic data and stock price data. Compare their clustering accuracy on these data sets. We will directly compute the normalized mutual information, a clustering accuracy index, between clustered result and ground truth in synthetic data. However, the accuracy of unsupervised learning is very hard to measure in real data set since we do not have the ground truth. As a result, in stock price data set, we decide to use the clustering result of the original data set as the ground truth. Then, we will add different level of i.i.d Gaussian noise on the data. Finally, compute the clustering accuracy by computing the mutual information between the clustering result of noised data set and "ground truth". We will firstly introduce mutual information in section 5.1 about the accuracy index in unsupervised learning: Mutual information. Then elaborate the simulation process and result of synthetic data and the real stock price data.
5.1 Normalized Mutual information
Normalized mutual information(NMI) is an index estimating the clustering quality. It comes from the mutual information. Mutual information is a measure of the mutual dependence between two variables. More specifically, it quantifies the "amount of information" obtained about one random variable through observing the other random variable. Suppose we have two different clustering results
and of clustering objects . It is easy to transform and into two discrete variables and . Let and be the corresponded distribution respectively defined on space and . Then we have(17) 
is the number of clustering objects assigned to cluster . The mutual information between and is defined as
(18) 
Define the entropy of random variable and as followed. Then we have
(19) 
The normalized mutual information is defined as
(20) 
5.2 synthetic data
We apply the same synthetic data generating strategy as [davis2007differential] to maintain the comparability. Our synthetic dataset is a set of 200 objects. Each object consists 30 samples. The samples of the same object are drawn from same one of randomly generated dimensional multivariate Gaussians . For . The mean vector is chosen uniformly at the surface of the dimensional unit simplex. Covariance matrix
is a random matrix with eigenvalues
anda random orthogonal matrix. We apply six different algorithms,
means, spectral clustering, KL divergence based clustering algorithm, KL divergence++, Wasserstein distance based clustering, and Bhattacharyya distance based clustering. The simulation results can been shown as in figure 2. The first figure shows the NMI line if we fix the number of clusters and change dimension from 4 to 10. The second show the NMI line if we fix the dimension and change the number of clusters . Every figure is based on the average results after 1000 iterations. This simulation is run on the computer with Intel I58400 and takes 51142 seconds. The average NMI is shown in figure.2, table.7 and table.8.Algorithm  4  5  6  7  8  9  10 

means  0.2605  0.1161  0.1777  0.1314  0.1103  0.0804  0.0610 
spectral clustering  0.2158  0.1005  0.1363  0.1117  0.0932  0.0726  0.0539 
KL divergence  0.9086  0.9406  0.9455  0.9438  0.9448  0.9401  0.9374 
KL divergence++  0.9242  0.9553  0.9555  0.9621  0.9621  0.9617  0.9580 
Wasserstein  0.7790  0.8540  0.9204  0.9439  0.9540  0.9625  0.9660 
Bhattacharyya  0.9090  0.9354  0.9516  0.9588  0.9599  0.9680  0.9716 
Algorithm  2  3  4  5  6  7  8  9 

means  0.1073  0.0481  0.1998  0.0826  0.1185  0.1354  0.1607  0.1800 
spectral clustering  0.1133  0.0361  0.1693  0.0743  0.1102  0.1190  0.1424  0.1490 
KL divergence  0.9940  0.9545  0.9399  0.9515  0.9362  0.9389  0.9427  0.9397 
KL divergence++  1  0.9684  0.9647  0.9633  0.9560  0.9511  0.9551  0.9497 
Wasserstein  0.9999  0.9799  0.9555  0.9408  0.9377  0.9259  0.9239  0.8999 
Bhattacharyya  1  0.9882  0.9674  0.9586  0.9549  0.9492  0.9485  0.9441 
Then, we compute the variance of normalized mutual information for each algorithm after 1000 iteration. The detail data is shown in fig.
3, table.9 and table.10.AlgorithmDimension  

4  5  6  7  8  9  10  
means  0.0018  0.0008  0.0012  0.0009  0.0007  0.0005  0.0004 
spectral clustering  0.0021  0.0009  0.0016  0.0009  0.0008  0.0005  0.0003 
KL divergence  0.0064  0.0046  0.0047  0.0047  0.0045  0.0044  0.0046 
KL divergence++  0.0053  0.0039  0.0045  0.0037  0.0035  0.0034  0.0036 
Wasserstein  0.0061  0.0063  0.0050  0.0041  0.0038  0.0033  0.0029 
Bhattacharyya  0.0052  0.0048  0.0039  0.0036  0.0033  0.0029  0.0025 
Algorithmk  2  3  4  5  6  7  8  9 

means  0.0033  0.0009  0.0021  0.0007  0.0007  0.0005  0.0006  0.0006 
spectral clustering  0.0024  0.0005  0.0024  0.0006  0.0006  0.0006  0.0007  0.0006 
KL divergence  0.0060  0.0123  0.0069  0.0045  0.0031  0.0025  0.0020  0.0017 
KL divergence++  0  0.0090  0.0050  0.0037  0.0028  0.0021  0.0017  0.0015 
Wasserstein  0  0.0045  0.0051  0.0047  0.0033  0.0028  0.0023  0.0023 
Bhattacharyya  0  0.0034  0.0047  0.0035  0.0026  0.0020  0.0017  0.0017 
Compare the results between the firstmoment information based clustering algorithms: means, spectral clustering and the distribution information based clustering algorithms: KL divergence based clustering, KL divergence++, Wasserstein distance based clustering, and Bhattacharyya distance based clustering. We can find the distribution information can greatly improve the clustering accuracy, and Bhattacharyya distance based algorithm has the highest average NMI. What’s more, the Bhattacharyya distance based clustering also has the lowest variance among all secondmoment information based algorithms. This simulation only use the Gaussian distribution as the assumed hidden distribution, but this results can be easily extended to more distribution assumptions.
5.3 Stock clustering
We utilize the New York stock exchange data collected by Dominik Gawlik [anudc:4896]. There are four different prices for each stock every day: open price(open), close price(close), low price (low) and high price (high). We can view each stock as a set of multiple samples on four features at different time. More specifically, 1726 dimension samples from date Jun 4th, 2010 to Oct 7th, 2016. Then do the clustering based on the samples. Six different algorithms are applied: means++, spectral clustering, KL divergence based clustering, KL divergence++, Wasserstein distance based spectral clustering and Bhattacharyya distance based clustering. Use the assignment result of these each algorithm as ground truth and then add i.i.d Gaussian noise to the stock data. Apply the clustering algorithms to the noised data and compute the mutual information with the ’ground truth’ with respect to that algorithm. This simulation is run on platform I58400. 100 iteration takes 78450 seconds. We get the following six figures 4:
Algorithmk  2  3  4  5  6  7  8  9 

means  0.9815  0.9691  0.7919  0.8561  0.7639  0.7284  0.7633  0.7755 
spectral clustering  0.9935  0.3093  0.3046  0.2835  0.1836  0.1335  0.1114  0.1076 
KL divergence  0.7632  0.8562  0.6755  0.8467  0.8455  0.8381  0.8283  0.8378 
KL divergence++  0.7817  0.8720  0.6770  0.8445  0.8457  0.8437  0.8314  0.8419 
Wasserstein  0.5979  0.7104  0.7659  0.6242  0.5761  0.1991  0.2655  0.3724 
Bhattacharyya  0.6348  0.4759  0.7517  0.5741  0.5999  0.6229  0.6437  0.6489 
Algorithmk  2  3  4  5  6  7  8  9 

means  0.9826  0.9815  0.7704  0.8656  0.7777  0.7285  0.7700  0.7839 
spectral clustering 0.9935  0.3714  0.3305  0.2982  0.1891  0.1345  0.1066  0.0979  
KL divergence  0.8513  0.8794  0.7225  0.8828  0.8868  0.8759  0.8789  0.8736 
KL divergence++  0.8547  0.8974  0.7196  0.8857  0.8877  0.8875  0.8826  0.8750 
Wasserstein  0.5909  0.6411  0.7549  0.6587  0.6370  0.6628  0.6635  0.5387 
Bhattacharyya  0.8677  0.6650  0.7683  0.6104  0.5979  0.6634  0.6426  0.6455 
Algorithmk  2  3  4  5  6  7  8  9 

means  0.9808  0.9775  0.7850  0.8677  0.7771  0.7618  0.7570  0.7945 
spectral clustering  0.9922  0.3920  0.3181  0.3040  0.1804  0.1270  0.0967  0.0951 
KL divergence  0.9136  0.8665  0.7479  0.9063  0.9204  0.9014  0.8856  0.8687 
KL divergence++  0.9220  0.8944  0.7683  0.9106  0.9278  0.9174  0.8918  0.8765 
Wasserstein  0.7037  0.6652  0.7387  0.7661  0.6339  0.7127  0.6890  0.6364 
Bhattacharyya  0.8336  0.8843  0.7347  0.5583  0.6099  0.6337  0.6233  0.6457 
In stock dataset simulation, we can split the algorithms into two groups: the means based algorithms: means++, KL divergence based clustering and KL divergence++ and spectral clustering based algorithms: spectral clustering, Wasserstein distance based spectral clustering, and Bhattacharyya distance based clustering. In each group, we can see that the introduction of second order information greatly improves the clustering stability.
6 Conclusion
In this paper, we propose a framework for multiple sample clustering, this framework can solve all of the problems without data structure limitation. We propose two specific cases of this framework: Wasserstein distance based clustering and Bhattacharyya distance based clustering. We also proposed an improved KL divergence based clustering algorithm: KL divergence++. In synthetic data, the simulation results shows that introducing distribution information can greatly improves the clustering accuracy. In stock price data, distribution information can strengthen the stability of clustering results. This framework can be applied in different areas considering the pervasiveness of multiple samples data.
Comments
There are no comments yet.