I Introduction
Nowadays information on the internet is exploding exponentially through time, and approximately 80% are stored in the form of text. So text mining has been a very hot topic. One particular research area is document clustering, which is a major topic in the Information Retrieval community. It allows to efficiently capture highorder similarities between objects described by rows and columns of a data matrix. In the domain of text clustering, a document is described as a set of words.
The relationship between documents and words allows for exploitation of the relationship between groups of words that occur mostly in a group of documents.
In [1], a cosimilarity measure has been proposed, called XSim [1] which builds on the idea of iteratively generating the similarity matrices between documents and words, each of them built on the basis of the other. This measure works well for unsupervised document clustering.
However, in recent researches, the sentence has been considered as a more informative feature term for improving the effectiveness of document clustering [2]. While considering three levels Documents Sentences Words to represent the data set, we are able to deal with a dependency between DocumentsSentences, as also between SentencesWords and, by deduction, between DocumentsWords.
Another important aspect in coclustering is the weight computing. A weighted value may be assigned as a link from a document to a word (or sentence) indicating the presence of the word (sentence) in that document. The 0/1 encoding denotes the presence or absence of an object in a given document.
Different weighting schemes such as the tfidf [3]
may be incorporated to better represent the importance of words in the corpus, but it has spawned the view that classical probability theory is unable to deal with uncertainties in natural language and machine learning.
So, we proceed to a fuzzification control process which converts crisp similarities to fuzzy ones. The conversion to fuzzy values is represented by the membership functions [4]. They allow a graphical representation of a fuzzy set [5]. These fuzzy similarity matrices are used to calculate fuzzy similarity between documents, sentences and words in a triadic computing called FTSim (Fuzzy Triadic Similarity).
Moreover, with the development of the Web and the high availability of the storage spaces, more and more documents become accessible. Data can be provided from multiple sites and can be seen as a collection of matrices. By separately processing these matrices, we get a huge loss of information.
Several extensions to the coclustering methods have been proposed to deal with such multiview data. Some works aim at combining multiple similarity matrices to perform a given learning task [6, 7]. The idea being to build clusters from multiple similarity matrices computed along different views.
Multiview coclustering such as MVSim [8] architecture, based on XSim measure [1] deals with the problem of learning cosimilarities from a collection of matrices describing interrelated types of objects. It was proved that this architecture provides some interesting properties both in terms of convergence and scalability and it allows an efficient parallelization of the process.
For this, we provide parallel architectures for FTSim to tackle the problem of learning similarities from a collection of matrices. For multisource or large matrices, we propose different parallel architectures in which each FTSim is the basic component or node we will use to deal with multiple matrices.
Thus, we consider a model in which data sets are distributed into sites (or relation matrices). They describe the connections between documents for each local data set.
Our goal is then to compute a fuzzy Documents Documents matrix for each site trying to take into account all the representative information expressed in the relations.
To combine multiple occurrences of FTSim, we propose sequential, merging and splitting based parallel architectures.
The rest of the paper is organized as follows: in section 2 we highlights backgrounds related to similarity measures in a multiview data sets. In section 3 we provide our fuzzy triadic similarity measure. In section 4 we present the three proposed architectures allowing parallel computing for coclustering. Section 5 concludes the paper and gives indications of some future work.
Ii Dealing with Multiview Data sets
Most of the existing clustering methods focus on data sets described by a unique data matrix, which can either be a matrix which describes objects by their characteristics, or a relation matrix that describes the intensity of the relation between instances of two types of objects, such as a Documents Words matrix. In the latter case, both types of objects can be clustered ; methods dealing with this task are referred to as coclustering approaches and have been extensively studied.
However, in many applications, data sets involving more than two types of interacting objects, or simply related, are also frequent. A simple way to represent such data sets is to use as many matrices as there are relations between the objects. Then, one could use classical coclustering methods to separately cluster the objects occurring in the different matrices but, in this way, interactions between objects are not taken into account, thus leading to a loss of information. Therefore, handling the views together, referenced as the multiview clustering task, is an interesting challenge in the learning domain to resolve limits of classical clustering.
Many extensions to the clustering methods have been proposed to deal with multiview data. In [9]
, they describe an extension of kmeans (MVKM) and of EM algorithms using multiview model. In
[6] and [7], the authors build clusters from multiple similarity matrices computed along different views. In [10], a coclustering system called MVSC has been proposed. It permits a multiview spectral clustering while using the cotraining that has been widely used in semisupervised learning problems. The general idea is to learn the clustering in one view and use it to label the data in an other view so as to modify the graph structure (similarity matrix).
Closer to our approach, some works aim at combining multiple similarity matrices to perform a given learning task. The MVSim architecture [8] which is an extension of the XSim algorithm [1], adapts the previous algorithm to the multiview context. It computes simultaneously the cosimilarity matrix for each of different kinds of objects described by relation matrices. The basic idea is to create a learning network isomorphic to these data sets structures. It was shown that it is possible to use this architecture to efficiently compute cosimilarities on large data sets by splitting a data matrix into smaller ones.
Iii FTSim: Fuzzy Triadic Similarity
Sentencebased analysis means that the similarity between documents should be based on matching sentences rather than on matching single words only. Sentences contain more information than single words (information regarding proximity and order of words) and have a higher descriptive power[11] [12][13]. Thus a document must be broken into a set of sentences, and a sentence is broken into a set of words. We focus on how to combine the advantages of two representation models in document coclustering.
To represent our textual data set, two representations have been proposed: the collection of matrices and the kpartite graph [14]. In the first, each matrix describes a view on the data. In the second, a graph is said to be kpartite when the nodes are partitioned into subsets with the condition than no two nodes of the same subset are adjacent. Thus in the kpartite graph paradigm [14], a given subset of nodes contains the instances of one type of objects, and a link between two nodes of different subsets represents the relation between these two nodes.
To explain our model we consider matrices to represent the data sets and we use a threepartite graph representation of the data matrices with three relations linking to explain our model.
From a functional point of view, the proposed FTSim model can be represented in the following way as shown in figure 1, where and are two data matrices representing a corpus and describing the connection between Documents/Sentences and Sentences/Words, brought by the threepartite graph [15].
After the generation of and matrices, we proceed to a fuzzification process. It converts crisp values to fuzzy ones. The conversion to fuzzy values is represented by the membership functions [4]. They allow a graphical representation of a fuzzy set [5]. There are various methods to assign membership values or the membership functions to fuzzy variables. We mention essentially the triangular and trapezoidal ones. The second form is the most suitable one for modeling fuzzy Sentences Documents and Words Sentences similarities.
For each document, we define a fuzzy membership function through a linear transformation between the lower bound value
, a membership of , to the upper bound value , which is assigned a membership of . This function is used because smaller values linearly increase in membership to the larger values for a positive slope and opposite for a negative slope.The following formulas show the fuzzy linear membership functions for and .
(1) 
and
(2) 
Before proceeding to fuzzy triadic computing, we must initialize Documents Documents, Sentences Sentences and Words Words fuzzy matrices with the identity ones denoted as , and . The similarity between the same documents (resp. sentences and words) have the value equal to 1. All others values are initialized with zero. is as follows:
(3) 
where , is the membership degree of the document according the one. Similarly, we determine the and .
After initializing , we calculate the new matrix which represents fuzzy similarities between documents while using and .
Usually, the similarity measure between two documents and is defined as a function that is the sum of the similarities between shared sentences.
Our idea is to generalize this function in order to take into account the intersection between all the possible pairs of sentences occurring in documents and . In this way, not only can we capture the fuzzy similarity of their common sentences but also the fuzzy ones coming from sentences that are not directly common in the documents but are shared with some other documents.For each pair of sentences not directly shared by the documents, we need to take into account the fuzzy similarity between them as provided by .
Since we work with fuzzy matrices formed by membership degrees, we should certainly be applied in accordance with the operators for fuzzy sets, especially the intersection and union. Thus, , except the case , can be formulated as follows:
(4) 
As we have shown for computing, we generalize fuzzy similarities in order to take into account the intersection between all the possible pairs of words occurring in sentences and . In this way, not only do we capture the fuzzy similarity of their common words but also the fuzzy ones coming from words that are not directly common in the sentences but are shared with some other sentences.
For each pair of words not directly shared by the sentences, we need to take into account the fuzzy similarity between them as provided by . The overall fuzzy similarity between documents and is defined in the following equation:
(5) 
Similarly, for each pair of words not directly shared by the sentences, we need to take into account the fuzzy similarity between them as provided by . The overall fuzzy similarity between documents and is defined in the following equation:
(6) 
Iv Parallel FTSim
For multisource or large data sets, we propose different parallel architectures in which each FTSim is the basic component or site we will use to deal with multiple matrices.
Thus, we consider a model in which the data sets are composed of relation matrices and . They describe the connections between documents for each local data set. Our goal is then to compute a fuzzy matrix for each data set trying to take into account all the information expressed in the relations.
To combine multiple occurrences of FTSim, we can adopt three different architectures: a sequential, a merging or a splitting based one.
Iva Sequentialbased parallel architecture
In this first model, an instance of is associated to each local site . Each site is represented by the relation matrice corresponding to the similarity between sentences/documents and words/sentences for . being the number of data sources. This instance is denoted . Figure 2 shows the sequentialbased parallel architecture.
As shown in figure 2, we assume a link between each and the following one. Then it computes the similarity matrices from the data matrices of the first data set and , and uses the resulting document similarity matrix to initialize the next site.
The document similarity issue of the dataset is used to initialize the next document similarity denoted by (the second document similarity matrice at iteration ). The initialization function presented in algorithm 1 is then run with a second and matrices etc.
[1] Compute the number of documents in and Let and
The natural question that arises is: how to initialise with ?
In the beginning, must contain all documents existing in the and the
data sets. They are initialized as an identity matrix denoted by
.After that, the obtained is updated with the similarities in . The different steps for the sequentialbased parallel process are presented in algorithme 2.
[1] , Execute with Initializing with Execute with
Each is connected to the inputs of the following one which creates a chain. In that way, the instances are sequentially run in a static or dynamic order and the similarity matrices are progressively updated.
The problem with this model is that the order matters. How do we choose the order of the matrices? How many iterations do we perform for each local ?
Thus, without any prior knowledge about the relative interest of the relation matrices and the number of iterations for each local computing, this model seems difficult to optimize.
IvB Mergingbased parallel architecture
In the second model, we propose to compute the similarity matrices from several sites and merge them before performing the coclustering algorithm on it. Figure 3 shows the mergingbased parallel architecture.
In this topology, all local instances are run in parallel, then the similarity matrices are simultaneously updated with an aggregation function. This policy offers the benefit that all the instances of have the same influence.
The aggregation function takes matrices , ,.., issue from each data source for a given iteration . Two rules are adopted:
Rule 1: If a given document does not appear in a single site then we assign its corresponding similarity measures directly in .
Rule 2: If a particular document appears in several different sites, we assign the minimum of all similarity measures relevant to this document to without taking into account the value of 0.
The different steps of aggregation computing are presented in algorithm 3.
[1] Collection of matrices , , .., Number of documents in , , .., Let and Each document of Appear in only one data set All) sites where appear with
So, for a given iteration , each instance produces its own similarity matrix . We thus get a set of output similarity matrices , ,.., the cardinal of which being equal to the number of datasets related to .
Therefore, we use the aggregation function denoted by and developed in the merging based function to compute a consensus similarity matrix merging all of the , ,.., with the current matrix .
In turn, this resulting consensus matrix is connected to the inputs of all the instances, to be taken into account in the iteration, thus creating feedback loops allowing the system to spread the knowledge provided by each within the network. The different steps for the mergingbased parallel process are presented in algorithme 4.
[1] collection of matrices , , Execute every with , and Merging with all Update each
The complexity of this architecture is obviously related to that of the algorithm. In the parallel mergingbased architecture, as each instance of can run on an independent core, the method can easily be parallelized, thus keeping the global complexity unchanged (considering the number of iterations as a constant factor). So, the complexity of the merging function can be ignored.
IvC Splittingbased parallel architecture
In this section we present a generated model that can use previous architectures to efficiently compute FTSim on large data sets by splitting a data matrix into smaller ones. Figure 4 shows the splittingbased parallel architecture.
In order to reduce the complexity of a problem of treating huge data sets, it is possible to split a given data matrix into a collection of smaller ones, each submatrix becoming a component of our network and processed as a separate view.
We have to evaluate the splitting approaches with the aim of finding the one most suitable with our solution. Here, our goal is to cluster the documents and to explore the behavior of the proposed architecture when varying the number of splits, obtaining submatrices. Then we adopt a random split sentence method. For each matrix, the sentences are divided into subsets thereby forming submatrices . So, The number of instances in the proposed network is equal to the number of splits .
For example, let us consider a problem with one [documents/sentences] matrix of size by in which we just want to cluster the documents. we can divide the problem into a collection of matrices of size by . Thus, by using a distributed version of on cores, we will gain both in time and space complexity.
By splitting a matrix, we lost some information. The solution does not compute the cosimilarities between all pairs of sentences but only between the words occurring in each . Thanks to the feedback loops of this architecture and to the presence of the common similarity matrix , we will be able to spread the information through the network and alleviate the problem of intermatrice comparisons.
Thus, by using a parallel version of on cores, we will gain both in time and space complexity: indeed,the time complexity decreases, leading to an overall gain of [16]. In the same way, the memory needed to store the similarity matrices between words will decrease by a factor.
V Conclusion
In this paper, a fuzzy triadic similarity model, called FTSim, for the coclustering task has been proposed. It takes, iteratively, into account three abstraction computing levels Document Sentences Words. The sentences consisting of one or more words are used to designate the fuzzy similarity of two documents. We are able to cluster together documents that have similar concepts based on their shared (or similar) sentences and in the same way to cluster together sentences based on words. This also allows us to use any classical clustering algorithm such as FuzzyCMeans (FCM) [17] or other fuzzy partitionedbased clustering approaches [18].
Our proposition has been extended to suit with multiview models. Because the domain of text clustering focuses on documents and their similarities, in our proposition we spread informations about document similarities. We have presented three parallel architectures that combine FTSim instances to compute similarities from different sources.
Actually, we need to further analyze the theoretical points of view and the behavior of the three architectures in a multithreading programming.
References
 [1] F. Hussain, XSim: A New Cosimilarity Measure: Application to Text Mining and Bioinformatics, Phd Thesis, 2010.
 [2] H. Chim and X. Deng, Efficient PhraseBased Document Similarity for Clustering, Knowledge and Data Engineering, IEEE Transactions, vol. 20, pp. 12171229, 2008.
 [3] G. Salton and C. Buckley, Termweighing approaches in automatic text retrieval, In Information Processing and Management, vol. 34, pp. 513523, 1988.
 [4] S. Kundu, Mintransitivity of fuzzy leftness relationship and its application to decision making, Fuzzy Sets and Systems, vol. 93, pp. 357367, 1997.
 [5] L.A. Zadeh, Fuzzy sets, Information and Control 8, pp. 338353, 1965.
 [6] W. Tang and Z. Lu and I.S. Dhillon, Clustering with multiple graphs, proceedings of the IEEE International Conference on Data Mining, pp. 10161021, 2009.

[7]
F. de Carvalho and Y. Lechevallier and F.M. de Melo, Partitioning hard clustering algorithms based on multiple dissimilarity matrices
, Pattern Recognition 45, pp. 447464, 2012.
 [8] G. Bisson and C. Grimal, Coclustering of MultiView Datasets: a Parallelizable Approach, IEEE International Conference on Data Mining, pp. 828833, 2012.
 [9] I. Drost and S. Bickel and T. Scheer, Discovering communities in linked data by multiview clustering, proceedings of the Annual Conference of the German Classication Society, Studies in Classication, Data Analysis, and Knowledge Organization, pp. 342349, 2005.
 [10] A. Kumar and H. Daume, A cotraining approach for multiview spectral clustering, proceedings of the International Conference on Machine Learning, pp. 393400, 2011.
 [11] H. Chim and X. Deng, Efficient PhraseBased Document Similarity for Clustering, Knowledge and Data Engineering, IEEE Transactions, vol. 20, pp. 12171229, 2008.
 [12] J.M. TorresMoreno and P.VelzquezMorales and J.G. Meunier, Cortex: Un algorithme pour la condensation automatique de textes, Colloque Interdisciplinaire en Sciences Cognitives, 2001.
 [13] M. Sven and L. Jorg and N. Hermann, Algorithms for bigram and trigram word clustering, Speech Communication, vol. 24, pp. 1937, 1998.
 [14] B. Long and X. Wu and Z.M. Zhang and S.Y. Philip, Unsupervised learning on kpartite graphs, proceedings of the ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 317326, 2006.
 [15] S. Alouane, M. Sassi Hidri and K. Barkaoui, Fuzzy Triadic Similarity for Text Categorization: Towards Parallel Computing, International Conference on Web and Information Technologies, pp. 265274, 2013.
 [16] G. Bisson and C. Grimal, An Architecture to Efficiently Learn CoSimilarities from MultiView Datasets, International Conference on Neural Information Processing, pp. 184193, 2012.
 [17] J.C. Bezdek, FCM: The Fuzzy CMeans clustering algorithm, Computers et Geosciences, vol. 10(23), pp. 191203, 1984.
 [18] J.B. MacQueen, Some methods for classification and analysis of multivariate observation, proceedings of the Berkeley Symposium on Mathematical Statistics and Probability, pp. 281297, 1967.
Comments
There are no comments yet.