Parallel architectures for fuzzy triadic similarity learning

12/21/2013 ∙ by Sonia Alouane-Ksouri, et al. ∙ CNAM 0

In a context of document co-clustering, we define a new similarity measure which iteratively computes similarity while combining fuzzy sets in a three-partite graph. The fuzzy triadic similarity (FT-Sim) model can deal with uncertainty offers by the fuzzy sets. Moreover, with the development of the Web and the high availability of storage spaces, more and more documents become accessible. Documents can be provided from multiple sites and make similarity computation an expensive processing. This problem motivated us to use parallel computing. In this paper, we introduce parallel architectures which are able to treat large and multi-source data sets by a sequential, a merging or a splitting-based process. Then, we proceed to a local and a central (or global) computing using the basic FT-Sim measure. The idea behind these architectures is to reduce both time and space complexities thanks to parallel computation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Nowadays information on the internet is exploding exponentially through time, and approximately 80% are stored in the form of text. So text mining has been a very hot topic. One particular research area is document clustering, which is a major topic in the Information Retrieval community. It allows to efficiently capture high-order similarities between objects described by rows and columns of a data matrix. In the domain of text clustering, a document is described as a set of words.

The relationship between documents and words allows for exploitation of the relationship between groups of words that occur mostly in a group of documents.

In [1], a co-similarity measure has been proposed, called X-Sim [1] which builds on the idea of iteratively generating the similarity matrices between documents and words, each of them built on the basis of the other. This measure works well for unsupervised document clustering.

However, in recent researches, the sentence has been considered as a more informative feature term for improving the effectiveness of document clustering [2]. While considering three levels Documents Sentences Words to represent the data set, we are able to deal with a dependency between Documents-Sentences, as also between Sentences-Words and, by deduction, between Documents-Words.

Another important aspect in co-clustering is the weight computing. A weighted value may be assigned as a link from a document to a word (or sentence) indicating the presence of the word (sentence) in that document. The 0/1 encoding denotes the presence or absence of an object in a given document.

Different weighting schemes such as the tf-idf [3]

may be incorporated to better represent the importance of words in the corpus, but it has spawned the view that classical probability theory is unable to deal with uncertainties in natural language and machine learning.

So, we proceed to a fuzzification control process which converts crisp similarities to fuzzy ones. The conversion to fuzzy values is represented by the membership functions [4]. They allow a graphical representation of a fuzzy set [5]. These fuzzy similarity matrices are used to calculate fuzzy similarity between documents, sentences and words in a triadic computing called FT-Sim (Fuzzy Triadic Similarity).

Moreover, with the development of the Web and the high availability of the storage spaces, more and more documents become accessible. Data can be provided from multiple sites and can be seen as a collection of matrices. By separately processing these matrices, we get a huge loss of information.

Several extensions to the co-clustering methods have been proposed to deal with such multi-view data. Some works aim at combining multiple similarity matrices to perform a given learning task [6, 7]. The idea being to build clusters from multiple similarity matrices computed along different views.

Multi-view co-clustering such as MV-Sim [8] architecture, based on X-Sim measure [1] deals with the problem of learning co-similarities from a collection of matrices describing interrelated types of objects. It was proved that this architecture provides some interesting properties both in terms of convergence and scalability and it allows an efficient parallelization of the process.

For this, we provide parallel architectures for FT-Sim to tackle the problem of learning similarities from a collection of matrices. For multi-source or large matrices, we propose different parallel architectures in which each FT-Sim is the basic component or node we will use to deal with multiple matrices.

Thus, we consider a model in which data sets are distributed into sites (or relation matrices). They describe the connections between documents for each local data set.

Our goal is then to compute a fuzzy Documents Documents matrix for each site trying to take into account all the representative information expressed in the relations.

To combine multiple occurrences of FT-Sim, we propose sequential, merging and splitting based parallel architectures.

The rest of the paper is organized as follows: in section 2 we highlights backgrounds related to similarity measures in a multi-view data sets. In section 3 we provide our fuzzy triadic similarity measure. In section 4 we present the three proposed architectures allowing parallel computing for co-clustering. Section 5 concludes the paper and gives indications of some future work.

Ii Dealing with Multi-view Data sets

Most of the existing clustering methods focus on data sets described by a unique data matrix, which can either be a matrix which describes objects by their characteristics, or a relation matrix that describes the intensity of the relation between instances of two types of objects, such as a Documents Words matrix. In the latter case, both types of objects can be clustered ; methods dealing with this task are referred to as co-clustering approaches and have been extensively studied.

However, in many applications, data sets involving more than two types of interacting objects, or simply related, are also frequent. A simple way to represent such data sets is to use as many matrices as there are relations between the objects. Then, one could use classical co-clustering methods to separately cluster the objects occurring in the different matrices but, in this way, interactions between objects are not taken into account, thus leading to a loss of information. Therefore, handling the views together, referenced as the multi-view clustering task, is an interesting challenge in the learning domain to resolve limits of classical clustering.

Many extensions to the clustering methods have been proposed to deal with multi-view data. In [9]

, they describe an extension of k-means (MVKM) and of EM algorithms using multi-view model. In

[6] and [7], the authors build clusters from multiple similarity matrices computed along different views. In [10]

, a co-clustering system called MVSC has been proposed. It permits a multi-view spectral clustering while using the co-training that has been widely used in semi-supervised learning problems. The general idea is to learn the clustering in one view and use it to label the data in an other view so as to modify the graph structure (similarity matrix).

Closer to our approach, some works aim at combining multiple similarity matrices to perform a given learning task. The MVSim architecture [8] which is an extension of the X-Sim algorithm [1], adapts the previous algorithm to the multi-view context. It computes simultaneously the co-similarity matrix for each of different kinds of objects described by relation matrices. The basic idea is to create a learning network isomorphic to these data sets structures. It was shown that it is possible to use this architecture to efficiently compute co-similarities on large data sets by splitting a data matrix into smaller ones.

Iii FT-Sim: Fuzzy Triadic Similarity

Sentence-based analysis means that the similarity between documents should be based on matching sentences rather than on matching single words only. Sentences contain more information than single words (information regarding proximity and order of words) and have a higher descriptive power[11] [12][13]. Thus a document must be broken into a set of sentences, and a sentence is broken into a set of words. We focus on how to combine the advantages of two representation models in document co-clustering.

To represent our textual data set, two representations have been proposed: the collection of matrices and the k-partite graph [14]. In the first, each matrix describes a view on the data. In the second, a graph is said to be k-partite when the nodes are partitioned into subsets with the condition than no two nodes of the same subset are adjacent. Thus in the k-partite graph paradigm [14], a given subset of nodes contains the instances of one type of objects, and a link between two nodes of different subsets represents the relation between these two nodes.

To explain our model we consider matrices to represent the data sets and we use a three-partite graph representation of the data matrices with three relations linking to explain our model.

From a functional point of view, the proposed FT-Sim model can be represented in the following way as shown in figure 1, where and are two data matrices representing a corpus and describing the connection between Documents/Sentences and Sentences/Words, brought by the three-partite graph [15].

Fig. 1: Functional diagram of FT-Sim.

After the generation of and matrices, we proceed to a fuzzification process. It converts crisp values to fuzzy ones. The conversion to fuzzy values is represented by the membership functions [4]. They allow a graphical representation of a fuzzy set [5]. There are various methods to assign membership values or the membership functions to fuzzy variables. We mention essentially the triangular and trapezoidal ones. The second form is the most suitable one for modeling fuzzy Sentences Documents and Words Sentences similarities.

For each document, we define a fuzzy membership function through a linear transformation between the lower bound value

, a membership of , to the upper bound value , which is assigned a membership of . This function is used because smaller values linearly increase in membership to the larger values for a positive slope and opposite for a negative slope.

The following formulas show the fuzzy linear membership functions for and .

(1)

and

(2)

Before proceeding to fuzzy triadic computing, we must initialize Documents Documents, Sentences Sentences and Words Words fuzzy matrices with the identity ones denoted as , and . The similarity between the same documents (resp. sentences and words) have the value equal to 1. All others values are initialized with zero. is as follows:

(3)

where , is the membership degree of the document according the one. Similarly, we determine the and .

After initializing , we calculate the new matrix which represents fuzzy similarities between documents while using and .

Usually, the similarity measure between two documents and is defined as a function that is the sum of the similarities between shared sentences.

Our idea is to generalize this function in order to take into account the intersection between all the possible pairs of sentences occurring in documents and . In this way, not only can we capture the fuzzy similarity of their common sentences but also the fuzzy ones coming from sentences that are not directly common in the documents but are shared with some other documents.For each pair of sentences not directly shared by the documents, we need to take into account the fuzzy similarity between them as provided by .

Since we work with fuzzy matrices formed by membership degrees, we should certainly be applied in accordance with the operators for fuzzy sets, especially the intersection and union. Thus, , except the case , can be formulated as follows:

(4)

As we have shown for computing, we generalize fuzzy similarities in order to take into account the intersection between all the possible pairs of words occurring in sentences and . In this way, not only do we capture the fuzzy similarity of their common words but also the fuzzy ones coming from words that are not directly common in the sentences but are shared with some other sentences.

For each pair of words not directly shared by the sentences, we need to take into account the fuzzy similarity between them as provided by . The overall fuzzy similarity between documents and is defined in the following equation:

(5)

Similarly, for each pair of words not directly shared by the sentences, we need to take into account the fuzzy similarity between them as provided by . The overall fuzzy similarity between documents and is defined in the following equation:

(6)

Iv Parallel FT-Sim

For multi-source or large data sets, we propose different parallel architectures in which each FT-Sim is the basic component or site we will use to deal with multiple matrices.

Thus, we consider a model in which the data sets are composed of relation matrices and . They describe the connections between documents for each local data set. Our goal is then to compute a fuzzy matrix for each data set trying to take into account all the information expressed in the relations.

To combine multiple occurrences of FT-Sim, we can adopt three different architectures: a sequential, a merging or a splitting based one.

Iv-a Sequential-based parallel architecture

In this first model, an instance of is associated to each local site . Each site is represented by the relation matrice corresponding to the similarity between sentences/documents and words/sentences for . being the number of data sources. This instance is denoted . Figure 2 shows the sequential-based parallel architecture.

Fig. 2: Sequential-based parallel architecture.

As shown in figure 2, we assume a link between each and the following one. Then it computes the similarity matrices from the data matrices of the first data set and , and uses the resulting document similarity matrix to initialize the next site.

The document similarity issue of the data-set is used to initialize the next document similarity denoted by (the second document similarity matrice at iteration ). The initialization function presented in algorithm 1 is then run with a second and matrices etc.

Initialization function [1] Compute the number of documents in and Let and

The natural question that arises is: how to initialise with ?

In the beginning, must contain all documents existing in the and the

data sets. They are initialized as an identity matrix denoted by

.

After that, the obtained is updated with the similarities in . The different steps for the sequential-based parallel process are presented in algorithme 2.

Sequential-based algorithm [1] , Execute with Initializing with Execute with

Each is connected to the inputs of the following one which creates a chain. In that way, the instances are sequentially run in a static or dynamic order and the similarity matrices are progressively updated.

The problem with this model is that the order matters. How do we choose the order of the matrices? How many iterations do we perform for each local ?

Thus, without any prior knowledge about the relative interest of the relation matrices and the number of iterations for each local computing, this model seems difficult to optimize.

Iv-B Merging-based parallel architecture

In the second model, we propose to compute the similarity matrices from several sites and merge them before performing the co-clustering algorithm on it. Figure 3 shows the merging-based parallel architecture.

Fig. 3: Merging-based parallel architecture.

In this topology, all local instances are run in parallel, then the similarity matrices are simultaneously updated with an aggregation function. This policy offers the benefit that all the instances of have the same influence.

The aggregation function takes matrices , ,.., issue from each data source for a given iteration . Two rules are adopted:

Rule 1: If a given document does not appear in a single site then we assign its corresponding similarity measures directly in .

Rule 2: If a particular document appears in several different sites, we assign the minimum of all similarity measures relevant to this document to without taking into account the value of 0.

The different steps of aggregation computing are presented in algorithm 3.

Merging Function [1] Collection of matrices , , .., Number of documents in , , .., Let and Each document of Appear in only one data set All) sites where appear with

So, for a given iteration , each instance produces its own similarity matrix . We thus get a set of output similarity matrices , ,.., the cardinal of which being equal to the number of data-sets related to .

Therefore, we use the aggregation function denoted by and developed in the merging based function to compute a consensus similarity matrix merging all of the , ,.., with the current matrix .

In turn, this resulting consensus matrix is connected to the inputs of all the instances, to be taken into account in the iteration, thus creating feedback loops allowing the system to spread the knowledge provided by each within the network. The different steps for the merging-based parallel process are presented in algorithme 4.

Parallel merging-based Algorithm [1] collection of matrices , , Execute every with , and Merging with all Update each

The complexity of this architecture is obviously related to that of the algorithm. In the parallel merging-based architecture, as each instance of can run on an independent core, the method can easily be parallelized, thus keeping the global complexity unchanged (considering the number of iterations as a constant factor). So, the complexity of the merging function can be ignored.

Iv-C Splitting-based parallel architecture

In this section we present a generated model that can use previous architectures to efficiently compute FT-Sim on large data sets by splitting a data matrix into smaller ones. Figure 4 shows the splitting-based parallel architecture.

Fig. 4: Splitting-based parallel architecture.

In order to reduce the complexity of a problem of treating huge data sets, it is possible to split a given data matrix into a collection of smaller ones, each sub-matrix becoming a component of our network and processed as a separate view.

We have to evaluate the splitting approaches with the aim of finding the one most suitable with our solution. Here, our goal is to cluster the documents and to explore the behavior of the proposed architecture when varying the number of splits, obtaining sub-matrices. Then we adopt a random split sentence method. For each matrix, the sentences are divided into sub-sets thereby forming sub-matrices . So, The number of instances in the proposed network is equal to the number of splits .

For example, let us consider a problem with one [documents/sentences] matrix of size by in which we just want to cluster the documents. we can divide the problem into a collection of matrices of size by . Thus, by using a distributed version of on cores, we will gain both in time and space complexity.

By splitting a matrix, we lost some information. The solution does not compute the co-similarities between all pairs of sentences but only between the words occurring in each . Thanks to the feedback loops of this architecture and to the presence of the common similarity matrix , we will be able to spread the information through the network and alleviate the problem of inter-matrice comparisons.

Thus, by using a parallel version of on cores, we will gain both in time and space complexity: indeed,the time complexity decreases, leading to an overall gain of [16]. In the same way, the memory needed to store the similarity matrices between words will decrease by a factor.

V Conclusion

In this paper, a fuzzy triadic similarity model, called FT-Sim, for the co-clustering task has been proposed. It takes, iteratively, into account three abstraction computing levels Document Sentences Words. The sentences consisting of one or more words are used to designate the fuzzy similarity of two documents. We are able to cluster together documents that have similar concepts based on their shared (or similar) sentences and in the same way to cluster together sentences based on words. This also allows us to use any classical clustering algorithm such as Fuzzy-C-Means (FCM) [17] or other fuzzy partitioned-based clustering approaches [18].

Our proposition has been extended to suit with multi-view models. Because the domain of text clustering focuses on documents and their similarities, in our proposition we spread informations about document similarities. We have presented three parallel architectures that combine FT-Sim instances to compute similarities from different sources.

Actually, we need to further analyze the theoretical points of view and the behavior of the three architectures in a multi-threading programming.

References

  • [1] F. Hussain, X-Sim: A New Cosimilarity Measure: Application to Text Mining and Bioinformatics, Phd Thesis, 2010.
  • [2] H. Chim and X. Deng, Efficient Phrase-Based Document Similarity for Clustering, Knowledge and Data Engineering, IEEE Transactions, vol. 20, pp. 1217-1229, 2008.
  • [3] G. Salton and C. Buckley, Term-weighing approaches in automatic text retrieval, In Information Processing and Management, vol. 34, pp. 513-523, 1988.
  • [4] S. Kundu, Min-transitivity of fuzzy leftness relationship and its application to decision making, Fuzzy Sets and Systems, vol. 93, pp. 357-367, 1997.
  • [5] L.A. Zadeh, Fuzzy sets, Information and Control 8, pp. 338-353, 1965.
  • [6] W. Tang and Z. Lu and I.S. Dhillon, Clustering with multiple graphs, proceedings of the IEEE International Conference on Data Mining, pp. 1016-1021, 2009.
  • [7] F. de Carvalho and Y. Lechevallier and F.M. de Melo, Partitioning hard clustering algorithms based on multiple dissimilarity matrices

    , Pattern Recognition 45, pp. 447-464, 2012.

  • [8] G. Bisson and C. Grimal, Co-clustering of Multi-View Datasets: a Parallelizable Approach, IEEE International Conference on Data Mining, pp. 828-833, 2012.
  • [9] I. Drost and S. Bickel and T. Scheer, Discovering communities in linked data by multi-view clustering, proceedings of the Annual Conference of the German Classication Society, Studies in Classication, Data Analysis, and Knowledge Organization, pp. 342-349, 2005.
  • [10] A. Kumar and H. Daume, A co-training approach for multi-view spectral clustering, proceedings of the International Conference on Machine Learning, pp. 393-400, 2011.
  • [11] H. Chim and X. Deng, Efficient Phrase-Based Document Similarity for Clustering, Knowledge and Data Engineering, IEEE Transactions, vol. 20, pp. 1217-1229, 2008.
  • [12] J.M. Torres-Moreno and P.Velzquez-Morales and J.-G. Meunier, Cortex: Un algorithme pour la condensation automatique de textes, Colloque Interdisciplinaire en Sciences Cognitives, 2001.
  • [13] M. Sven and L. Jorg and N. Hermann, Algorithms for bigram and trigram word clustering, Speech Communication, vol. 24, pp. 19-37, 1998.
  • [14] B. Long and X. Wu and Z.M. Zhang and S.Y. Philip, Unsupervised learning on k-partite graphs, proceedings of the ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 317-326, 2006.
  • [15] S. Alouane, M. Sassi Hidri and K. Barkaoui, Fuzzy Triadic Similarity for Text Categorization: Towards Parallel Computing, International Conference on Web and Information Technologies, pp. 265-274, 2013.
  • [16] G. Bisson and C. Grimal, An Architecture to Efficiently Learn Co-Similarities from Multi-View Datasets, International Conference on Neural Information Processing, pp. 184-193, 2012.
  • [17] J.C. Bezdek, FCM: The Fuzzy C-Means clustering algorithm, Computers et Geosciences, vol. 10(2-3), pp. 191-203, 1984.
  • [18] J.B. MacQueen, Some methods for classification and analysis of multivariate observation, proceedings of the Berkeley Symposium on Mathematical Statistics and Probability, pp. 281-297, 1967.