C-DLSI: An Extended LSI Tailored for Federated Text Retrieval

10/05/2018 ∙ by Qijun Zhu, et al. ∙ The Hong Kong University of Science and Technology 0

As the web expands in data volume and in geographical distribution, centralized search methods become inefficient, leading to increasing interest in cooperative information retrieval, e.g., federated text retrieval (FTR). Different from existing centralized information retrieval (IR) methods, in which search is done on a logically centralized document collection, FTR is composed of a number of peers, each of which is a complete search engine by itself. To process a query, FTR requires firstly the identification of promising peers that host the relevant documents and secondly the retrieval of the most relevant documents from the selected peers. Most of the existing methods only apply traditional IR techniques that treat each text collection as a single large document and utilize term matching to rank the collections. In this paper, we formalize the problem and identify the properties of FTR, and analyze the feasibility of extending LSI with clustering to adapt to FTR, based on which a novel approach called Cluster-based Distributed Latent Semantic Indexing (C-DLSI) is proposed. C-DLSI distinguishes the topics of a peer with clustering, captures the local LSI spaces within the clusters, and consider the relations among these LSI spaces, thus providing more precise characterization of the peer. Accordingly, novel descriptors of the peers and a compatible local text retrieval are proposed. The experimental results show that C-DLSI outperforms existing methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Due to the highly dynamic nature of the World Wide Web, traditional search engines (SEs) must face great challenges on scalability and adaptability. Because of the limited resources available to a search engine, it is hard to catch up with the fast expansion of the Web and the frequent updates of its contents. Consequently, the overall coverage of the search engines with respect to the size of the entire web deceases with time. We need a scalable and highly efficient search and index mechanism to make the data on the web in a timely manner accessible.

To overcome these difficulties, in the past decade, various information retrieval (IR) methods based on parallel and distributed computing have been proposed. Among these methods, parallel information retrieval [34] that maintains a single index and employs a server cluster to balance the load has been well studied and successfully applied in real-world search engines such as Google. However, it is not scalable with respect to the size and dynamics of the Web. Furthermore, it cannot handle the hidden deep web because of privacy issues. To alleviate these problems, federated information retrieval [28] and meta-search [22] were proposed. They send a query simultaneously to multiple search engines, collect the results from each search engine after the query has been evaluated separately, and last merge the results together (i.e., re-ranking). In this way, there is no need to access directly to the pages or the index at each search engine. FIR makes it possible to take advantage of the power of different search engines and provide large coverage of the Web. Since FIR facilitates cooperation among search engines, it can be more efficient and effective than meta-search. For this reason, FIR has attracted much attention in recent years.

As a promising solution to the scalability and adaptability problems, FIR aims to support search on a large amount of data in a distributed and self-organizing manner. In the FIR framework, each search peer indexes and maintains its own document collection, thus avoiding management problems associated with large data centers. A broker is introduced to maintain a directory of the peers together with summarization information, named descriptors, about them. For query processing, the broker will select peers that have high potential to return relevant documents for the query according to the peer descriptors. Note that the broker does not have to know the peers’ indexes or original document sets. In this paper, we only consider textual documents and content relevance in retrieval, so we name it federated text retrieval (FTR).

In conventional centralized IR methods, query processing only focuses on the problem of finding relevant documents using a single index. On the contrary, FTR requires a three-phase query processing procedure. First, it identifies promising peers which may return the most relevant documents. Then it submits the query to the selected search engines, each of which retrieves the results from its collection. Finally, it merges the results together and returns them to the user. Peer selection plays a key role in FTR, which is also the major concern of this paper. With peer selection, we can make query evaluation more efficient and, at the same time, save a lot of computing resources (e.g., power, communication bandwidth, CPU time, etc.). A number of peer selection approaches [5, 10] have been proposed, but they are mostly based on the word histogram of the peers and traditional term matching techniques.

Obviously, the content structure in a collection is significantly different from that in a document, because a document often focuses on one topic while a collection may have documents belonging to different topics. To precisely characterize a heterogeneous collection, it is necessary to divide it into smaller but more homogeneous clusters. Inspired by this basic idea, some approaches [35, 27]

utilize clustering to partition the document collection into different topics and then rank the peers based on the clusters. Experiments showed that topic-based ranking methods can substantially improve the quality of peer selection. However, these studies were only based on heuristics without much rationale behind them. Besides, they rely heavily on the cluster quality and ignore the relations among the clusters. In this paper, we propose a novel approach called Cluster-based Distributed Latent Semantic Indexing (C-DLSI) based on a formal analysis of the problem. In particular, C-DLSI, by applying clustering to distinguish the topics of a peer, extends the traditional LSI scheme and captures delicate semantic features of the peer, thus providing more precise characterization of the peer. Moreover, our method is quite scalable and cost-efficient for the updates.

We detail our main contributions as follows:

  1. An LSI-based framework (C-DLSI) for text retrieval in distributed environments was proposed. It encompasses directory maintenance and query processing.

  2. Identification of the properties of FTR and the feasibility analysis of extending LSI with clustering to improve the quality of peer representation. Specifically, the relations among the clusters are considered in C-DLSI to adapt to the properties of FTR. Our method is efficient and adapts to frequent collection updates since only the clusters affected by the updates need to be reindexed.

  3. Based on the analysis of C-DLSI, novel descriptors of the peers are proposed and a complete federated query processing strategy in FTR is developed.

  4. The extensive performance evaluation of C-DLSI on a TREC dataset. Impacts of different parameter selections are fully discussed.

The rest of the paper is organized as follows. In Section 2, we review the related work on FIR and peer selection. Some bases of our method, including the framework of FTR, Latent Semantic Indexing (LSI) and K-means Clustering are introduced in Section 3. In Section 4, we formalize the problem and present our approach C-DLSI in details. The experimental setup and corresponding results are showed in Section 5. The last section summarizes the results obtained in this paper.

Ii Related Work

Peer selection is a critical problem in FTR and distributed information retrieval systems in general. It has been studied for more than a decade. Many methods have been proposed to address this issue. gGloss (generalized Glossary-Of-Servers Server) [10, 11]

is a well-known method. It keeps statistics (document frequencies and total weights) about the servers to estimate which servers are potentially most useful for a given query. In particular, gGloss(0), a special form of gGloss, which aggregates all similarity values between the documents and the query, was shown to be the best and has been widely employed for comparison

[6, 27, 32]. In this paper, we also use it as a baseline.

The Collection Retrieval Inference Network (CORI) [5, 25] is another important work. It drew analogy between collection ranking and document ranking and applied some form of ranking strategy to rank the collections. Specifically, it replaces TF with DF (document frequency), and IDF with ICF (inverse collection frequency), the inverse of the proportion of the collections carrying at least one document which contains some query terms. Moreover, Yuwono and Lee [36]

proposed the cue-validity variance (CVV) method for collection selection. CVV measures the skewness of the distribution of a term across the collections and estimates the usefulness of the term for distinguishing a collection from another. Then terms with larger variances will be given larger weights in index collection ranking. An evaluation of a number of collection selection methods in a Web environment was given in

[6]. None of these methods consider the topic space of the peers and utilize semantic information beyond simple term matching to make a selection.

Latent Semantic Indexing (LSI) [7] was originally proposed to take advantage of implicit high-order structure in the association between terms with documents, namely, semantic structure, to improve the retrieval of relevant documents. Much efforts have been made to improve its performance [19, 15, 13] or broaden its applications [24]. An earlier work which tried to utilize LSI to improve the peer selection of FTR is the latent semantic database selection (LSDS) [32]. It simply applied LSI to preprocess the document collections, and conventional selection methods (e.g., CORI) on the ”cleaned” term/document matrices for ranking the collections. However, this method did not capture the key properties of FTR and, moreover, inherited the disadvantages of the conventional methods, e.g., ignoring the topic space of the peers and the drawbacks inherited from CORI.

To overcome the deficiencies of traditional methods, cluster-based approaches were proposed to identify the topic space of the peers. Document clustering was applied to organize collections around topics and then language modeling was used to represent the topics [35]. This method allows the right topics to be effectively identified for a given query. However, this method cannot distinguish the documents within a topic. Shen and Lee [27] proposed another cluster-based method IS-cluster which utilized cluster descriptors to rank the servers for meta-search engine. We also use this method as a candidate for comparison in our experiments. Term correlation was introduced to further improve cluster-based methods [38]. However, it did not consider the compatibility issue in FTR, which means that peer selection should adapt to the local document ranking functions. Further, it is very difficult to determine the parameters in the method. In this paper, we extend LSI with the clustering method to especially adapt to FTR and achieve better retrieval performances.

Recently, uncooperative federated search systems have been studied. In this case, collections do not disclose their index statistics to the broker. The broker has to sample documents from each collection and uses them for collection selection. ReDDE [29] estimates the number of relevant documents in collections and uses it to improve collection selection. Estimation of the needed information for collection selection from an uncooperative peer was addressed in [21]. [23] introduced a decision theoretic framework (DTF) for collection selection, which tries to minimize the overall costs of federated search including money, time, and retrieval quality. Similarly, [30] proposed a unified utility maximization framework (UUM) for resource selection, which evaluates queries on sampled index. Furthermore, an enhanced model called RUM [31] was proposed by considering the search effectiveness of collections. In general, they do not consider the topic space of the peers either and ignore the semantics. Thus, C-DLSI can also be embedded into these methods with slightly change (e.g., applied on the sampling documents) to adapt to this new scenario.

Iii Preliminaries

In this section, we introduce some preliminaries which act as the bases of our C-DLSI method. In particular, we first present the general FTR framework in Section 3.1. Then we describe two important techniques Latent Semantic Indexing (LSI) and document clustering in Section 3.2 and Section 3.3, respectively.

Iii-a FTR Framework

As a federated information retrieval scheme, FTR provides a loose cooperation among search peers in which each peer maintains its own local index and a central broker is employed to coordinate the cooperative text retrieval. Specifically, each peer has a complete search engine in itself. That is, it has its own crawler, index and search component for information gathering, organizing and retrieving, respectively. Besides, the peers share a common descriptor publishing scheme to disclose to the broker summaries of information in their repositories. On the other side, the broker of the FTR system take charge of the query processing by maintaining a centralized directory, which holds the descriptors of the peers’ local indexes.

Fig. 1: FTR framework.

In FTR, two basic functions are supported: directory maintenance (or peer descriptor publishing) and query processing. Figure 1 shows the whole framework of FTR. First, each peer summarizes the descriptor for its local index and publishes it to the directory in the central broker. These descriptors are used by the broker to select suitable peers in query processing. This process is known as the peer representation problem [3]. Usually, a descriptor contains connection information together with statistics for each term in the peer or a limited number of sampled documents. In this paper, we will provide a novel solution to peer representation in Section 4.3.

When a query arrives at the broker, the broker selects the most promising peers which may return the most relevant documents based on the descriptors. This is the peer selection problem [5]. Then the query is forwarded to the selected peers. Based on the local index, each peer evaluates the query and returns the results to the broker. Once receiving the results, the broker will employ a reranking method to properly merge the results together and present them to the user. It is called the result merging problem. We will consider these issues of the federated query processing in Section 4.4.

Iii-B Latent Semantic Indexing

Latent Semantic Indexing (LSI) proposed by Deerwester et al. aims at taking advantage of the semantic structure of a document collection to improve retrieval performance. Its objective is to overcome the fundamental deficiencies of conventional keyword-based information retrieval techniques. The problem stems from the fact that users are interested in documents which share the same conceptual content with the queries, but traditional IR techniques only perform keyword matching between queries and documents and thus cannot deal with synonymy and polysemy problems. To bridge the gap, LSI applies singular value decomposition (SVD) on a term-document matrix to statistically extract the implicit high-order structure in the association of terms with documents, which can be used to find the semantic representations of documents.

LSI is an extension of the vector space model, which approximates the term-document matrix by the truncated SVD of the matrix. Given an

term-document matrix , the SVD of is

(1)

where and have orthonormal columns, is a diagonal matrix having the singular values of in decreasing order (denoted as ) along its diagonal, and denotes the transpose of a vector or a matrix. LSI decompose to a lower dimensional vector space by retaining only the largest singular values, where . Specifically,

(2)

where and consist of the first columns of and respectively, and is the diagonal matrix containing the largest singular values of . Because the number of factors can be much smaller than the number of unique terms used to construct this space, terms will not be independent and the terms with similar meaning will be located near one another in the LSI space. The relevance score of a document vector with a query vector is measured by the cosine or dot product between the document vector in LSI space and . Without loss of generality, we assume that all vectors are normalized. Then the relevance score can be described as,

(3)

In this paper, we apply the distributed latent semantic indexing to peer selection and document ranking.

Iii-C Document Clustering

Since a peer contains a large number of documents, it potentially covers multiple topics compared to that of a single document. Thus, a word histogram created for the entire peer cannot provide a precise description of the peer, making it inadequate to apply a document ranking approach to peer ranking. A proper peer selection should consider the topics covered in the peers. In our framework, we utilize a clustering technique to group the documents of a peer and treat each cluster as an approximate topic. Then the peer is evaluated based on the clusters’ relevance scores with respect to a given query.

Although the clustering process is an offline process in the framework, we still need an efficient clustering method that can handle a large number of documents and adapt to updates on the peers. In this paper, we adopt the widely used k-means clustering algorithm [14] to deal with these problems. It uses an iterative procedure to find partitions of objects, which minimize the total intra-cluster variance (or the squared error function). Specifically, it starts with randomly selected objects to serve as the centroids and divides the objects according to the distances from them. Then it generates new centroids based on the current partitions and starts another round of division until a stable state is reached. Empirically, the k-means algorithm can converge quickly and is considered to be very efficient. For document space, k-means is to maximize the following measure:

(4)

where denotes the document number of the cluster, and the centroid of the cluster.

It is proven that the solution to the k-means clustering method coincides with the principal points solution [Flury], which means it is a point-representation scheme where the best representative points (i.e. topics) are obtained. On the contrary, SVD provides a component-representation of the document space and ensures the best representation of the information content in a reduced dimensional vector space. We will show that extending LSI with clustering can especially adapt to FTR.

Iv FTR with C-DLSI

In this section, we present the details about our cluster-based distributed latent semantic indexing (C-DLSI) method for FTR. By identifying the special properties of FTR, we extend LSI accordingly to treat the peer selection issue, which overcomes the deficiencies of the conventional approaches. In Section 4.1, we first formulate the peer selection problem in FTR, analyze the properties, and identify the deficiencies of the traditional approaches. Then we propose the C-DLSI method especially tailored for FTR in Section 4.2. Based on C-DLSI, the corresponding descriptors and a federated query processing in FTR are presented in Section 4.3 and Section 4.4, respectively. Finally in Section 4.5, we describe an update scheme for a peer in a highly dynamic environment.

Iv-a Properties of Peer Selection in FTR

In FTR, if the distribution of relevant documents across the results returned by the peers were known, the peers could be ranked by the number of relevant documents they return, which is known as relevance-based ranking (RBR) [4]. Consider a set of peers . To simplify the explanation, we assume that each peer is required to return the top results, i.e., with decreasing relevance score for peer . Let

denote the probability of relevance for document

given query . The ranking value for peer can be represented as,

(5)

Assume that the global weight of a term in each document is given. There are two major issues here. One is how to determine a proper relevance value which can approximate well. The other is how to summarize the descriptors of the peers based on which the ranking value can be properly estimated.

A simple solution is to estimate the relevance score by computing the inner products between and which is called gGloss(0) [10] and can be described as,

(6)

By only maintaining the document number and the centroid of the peer , the broker can estimate the ranking value of peer as,

(7)

which is equal to . To solve the problem of various values of a term among peers, CORI [5], on the other hand, estimates the ranking value by using the term frequency of each query term in each peer. A further improvement of CORI is to combine it with LSI [32]. Since these methods do not consider the topic space of the peers, the effect of polysemy, i.e., some terms common to two conceptually independent topics, is ignored. Different from document ranking, peer ranking is suffered from the polysemy issue more seriously, because the accumulation of small semantic deviations of the documents may lead to big error in peer ranking. For example, consider a collection of two documents. One is related to ”apple, fruit” while the other is related to ”computation, math”. If we ignore polysemy, we may draw the conclusion that the set is related to ”apple computer”, even though the individual relevance scores of the documents with the query ”apple computer” are not high.

A direct way to solve the polysemy issue is to use clustering to identify the topics in a peer. Consider a conceptually homogenous cluster. If it is regarded to be relevant to a query, say, ”apple, computer, product”, then a document in it which does not contain any query terms, e.g., talking about ”MacBook, OS”, is still likely to be relevant to the query. Therefore, we should also consider the synonyms in a cluster. Unfortunately, to the best of our knowledge, none of the existing methods can adapt well to this situation. For example, if we directly apply LSI on the whole collection and then cluster the documents, then the polysemy can not be effectively identified by LSI. [27] tried to solve this problem by representing a document with the descriptor of its cluster. Specifically, the weight of a term in the descriptor of a cluster is computed by,

(8)

where denotes the weight of term in document and the number of the documents in which contain term . We can see that, to handle synonyms, the weight of a term in the descriptor is estimated only according to the documents which contain them. Then similar formulas as (6) and (7) are utilized to compute the ranking value of the peers. Similarly, [35] employs language model to determine the relevant cluster and all of the documents in a relevant cluster are regarded to be relevant. However, these methods are restricted by two major drawbacks. First, they rely heavily on the quality of clustering and do not consider the relations among clusters. Second, since they assume that all the documents in a cluster is equally relevant, it may exaggerate the relevance score of weakly relevant or irrelevant document and is difficult to decide a proper ranking list of the documents, which is known as a compatibility issue. To overcome these limitations and adapt to the properties of FTR, we extend LSI with clustering which considers the relations among clusters and captures more accurate descriptions of the peers.

Iv-B Cluster-based Distributed LSI (C-DLSI)

In our method, the collection of a peer is partitioned into a number of clusters (using k-means clustering). Then, LSI is employed to derive the semantic structure within each cluster, i.e., LSI space for cluster with term-document matrix . To make the LSI spaces among clusters comparable, we restrict with singular values larger than a threshold . Let denote the singular value in and a number satisfy,

Then, the LSI space of is redefined as,

(9)

Consider a diagonal block matrix for a peer with the form,

(10)

where represents a conceptually independent topic, e.g., cluster . It is easy to prove the following relation [20],

(11)

It means if a collection is perfectly divided into a number of conceptually independent topics and no polysemy exists, the LSI space of a peer built in our method is equal to the traditional LSI which is directly applied on the whole collection. Obviously, the LSI spaces of the clusters can distinguish and capture the semantics of the documents more precisely than the existing methods. In the rest of this subsection, this idea will be further improved.

Each peer maintains the semantic structures of its clusters for descriptor generation and federated query processing. Thus, we call our method cluster-based distributed LSI (C-DLSI). Generally, the clusters of a peer may have some relations from each other because of several reasons, e.g., some topics are not conceptually independent in nature or the clustering is not perfect enough and some documents belonging to one topic are separated. In C-DLSI, a semantic similarity measure between any two clusters (named paired similarity) is introduced. This measure is estimated based on the similarity of the LSI vector spaces and consequently, a network of clusters is formed from words shared by each pair of clusters. With the similarity network of clusters, C-DLSI can further exploit the synonyms without loss of the polysemy information. Therefore, it can especially adapt to FTR.

Similar to [Bassu], we define two levels of the paired similarity. Consider two clusters and (). Let denote the term set for a cluster and the common term set for and . The first level of paired similarity only captures the frequency of occurrence of common terms. If and have common terms, we say there is a direct link between them. Define the proximity of and to be the minimal number of the intermediate clusters which link and . Let denote the proximity between and , then is defined as (we only consider the case of ),

(12)

where,

Moreover, a further level of the paired similarity captures the semantics of the common terms, denoted as . Let denote the term similarity matrix for cluster and the restriction of to the term set . Define the correlation measure between two matrices and as,

(13)

where,

Then, is defined as,

(14)

where,

Based on these definitions, the paired similarity between and is defined as,

(15)

Before further explanation, we first consider an example of the semantic structures in a collection as shown in Figure 2. Obviously, the polysemy of term ”apple” can be identified via clustering and divided into different clusters. However, some conceptually relevant terms such as ”MacBook” and ”Computer” may appear in different clusters. For a given query ”computer”, if only the LSI space within a cluster is considered, then the relevant documents in cluster 3 will be missed. For this reason, the paired similarities among clusters are employed to extend the LSI space built in a cluster.

Fig. 2: An example of the semantic structures in a collection.

For a cluster , let be a set of clusters in the same peer as which satisfies the following conditions,

where is a threshold above which the cluster is regarded to be relevant to . We call the relevant clusters of . Given a query (denote as its term set), the relevance score of cluster is computed as the sum of the relevance scores of to all of the terms in , which can be represented as,

(16)

The relevance score of cluster to a term , i.e., , can be estimated in two cases.

1) If , we have,

(17)

where is the number of documents in and is the weight of in the centroid of the LSI space of .

2) If , the relevance score can not be derived directly by using the LSI space of . Then we can rely on its relevant clusters to estimate the relevance score. Let be the first cluster of which satisfies . The projection of into the LSI space of can be presented as,

(18)

where denotes the column of which corresponds to term . Therefore, the relevance score of to term can be approximated as,

(19)

With the relevance scores of the clusters to the query , we can estimate the ranking value of a peer . Let be the clusters of with decreasing relevance scores. Then the ranking value is estimated by considering the most relevant clusters, which can be represented as,

(20)

Moreover, C-DLSI is efficient and scalable, since the size of a cluster is substantially smaller than that of the entire collection. With regard to document updates, it only requires reindexing of the affected clusters. Thus, it is suitable for highly dynamic environment in which documents are frequently updated.

Iv-C Descriptors of Peers

A peer will publish a descriptor of its content to the broker for peer selection. In C-DLSI, since a collection has been clustered, the descriptor of a peer consists of a set of cluster descriptors. According to Formulas , we need at least the document number , the centroid of the LSI space , and the eigen matrix to describe a cluster . However, is usually quite large and may cause heavy communications between the broker and the peers. To overcome this difficulty, we rewrite Formula (19) as follows,

(21)

It means we only need a value instead of the vector to estimate the relevance score. For any term , a list of vectors , which correspond to the order of the relevant clusters , are guaranteed to find the value . Therefore, the transmission of the matrix can be saved. Based on this, we define the descriptor for cluster in C-DLSI to contain the following aggregate information:

  1. The total number of documents in the cluster, .

  2. The centroid of LSI space in the cluster, .

  3. A list of vectors , which correspond to the order of the relevant clusters .

That is,

(22)

Then we define the descriptor for peer as,

(23)

which represents the peer with more fine-grained descriptions of its clusters. Since the number of clusters is extremely small, publishing this descriptor causes little overhead in C-DLSI.

Note that we assume in this paper that there is little or no overlap among the peers. This is a reasonable assumption in most cases. Since each peer is supported to cover a different part of the web, it corresponds to a distinct database. When this assumption is violated, we can add more aggregate information into cluster descriptors as proposed in [1].

Iv-D Federated Query Processing

As described in Section 3, federated query processing in FTR contains three steps, namely peer selection, local text retrieval, and result merging. In this subsection, we will discuss these three issues in C-DLSI.

As discussed in Section 4.2, the broker compute the ranking values for all the peers according to Formula (20) based on the descriptors. In particular, when computing the relevance score of a cluster to a term of the query with , the broker will scan the list and find the first relevant cluster which contains . Then the relevance score is computed by,

(24)

where denotes the weight of in the vector of . Otherwise, the relevance score can be directly computed according to Formula (17). The broker will choose the peers with largest ranking values and forward the query , together with the IDs of the most relevant clusters, to each of them. Then it enters the phase of local retrieval.

Local retrieval is performed by peers to retrieve relevant documents from the collections. C-DLSI only considers the LSI spaces of the most relevant clusters specified by the broker. The relevance score of a document to query is evaluated based on its LSI vector in the corresponding cluster . Similarly, we have,

(25)

If , then the relevance score of the document to term can be computed by,

(26)

where denotes the weight of in . Otherwise, the first cluster in which contains term will be found, denoted as and the relevance score is estimated as,

(27)

Thus, the evaluated documents can be sorted according to their relevance scores, and only the top-ranked documents will be returned as the results to the broker.

Result merging in FTR tries to provide a uniform ranked list of the documents returned from multiple peers. Assume that each peer has the global weights for all of the terms in the documents and applies the same ranking function. Then the relevance score of a document estimated by the peer is also valid as a global score among all peers. Thus, we can simply merge the documents according to their relevance scores returned by the corresponding peers in C-DLSI.

Another factor considered in our framework is the compatibility between peer selection and local text retrieval. Since the goal of FTR is to retrieve valuable peers that can return most relevant documents, this process is also impacted by local text retrieval schemes. Thus it requires the peer selection and local text retrieval to be compatible and consistent, which is called the compatibility issue. In C-DLSI, we try to guarantee this property in peer selection method and local text retrieval. Previous research has shown that LSI can help improve document retrieval in a single collection. However, most conventional methods for peer ranking are more likely to select the peers with the largest number of weighted keywords, which does not conform to the basic principle of LSI. Moreover, though several peer selection methods consider the semantic structures of the collections, they ignore the compatibility issue or it is hard to find a proper local text retrieval method to adapt to their peer ranking scheme.

Iv-E Collection Update

Although our C-DLSI method employs LSI in a distributed way and only requires applying SVD in a relatively small scale (i.e., on clusters only), collection update could still be costly for extremely large and dynamic collections. In our framework, we utilize a lazy scheme to handle this problem. In particular, each peer keeps all of the semantic transformation matrices of its own clusters (refer to Section 4.2). When an update occurs, e.g., due to textual update or newly crawled documents, it will first assign the updated documents to the most related clusters, e.g., cluster , and then directly use to evaluate the semantic vector according to the following formula:

(28)

where is the updated document vector. It can be easily proven that this form is consistent with the original form of Formula (2). If the amount of updates exceeds a threshold for a cluster, the corresponding peer will rebuild its LSI by applying SVD on the cluster again. This update handling scheme is also tested and analyzed in the experiments.

Type Count Avg. Length
Document 53595 213.455
Query 50 2.740
Term 186319 -
TABLE I: Summary of the experimental data
Fig. 3: Number of relevant documents for each query.

V Experiments

To evaluate the C-DLSI method, we build a simulation platform with one broker and peers. The documents come from the TREC collection Volume 4 and Volume 5, which consist of over documents with about in total size. In this section, we will first present the setup of the experiments, and then show the results from the C-DLSI evaluation.

V-a Experimental Setup

In our experimental platform, we use the documents of the TREC collection Volume and Volume . The queries are extracted from TREC-6 ad hoc topics (topics

). To simulate short Web queries, we use the terms appearing in the Title field of the topic description as the keywords. In the following experiments, we will also discuss the effect of query length on C-DLSI. Moreover, the standard relevance judgments provided by NIST are used to evaluate the retrieval effectiveness. Since only a portion of the documents are manually judged, we select them as the indexed documents to make the evaluation more reasonable. In particular, the selected documents are uniformly distributed to

indexing collections, which is considered the hardest scenario compared to a skewed distribution [27]. Table 1 gives a summary of the data set used in the experiments. Figure 4 and Figure 5 show the statistics of the indexed documents for each query. We can see that the number of relevant documents for each query is relatively small and it is not easy to identify them for most queries.

Fig. 4: The probability of a document which contains keywords being relevant for each query.
Notation Description Values
Cluster
Retrieved doc in each peer
Selected peer (cast number)
Threshold of LSI
Number of relevant clusters
TABLE II: Parameters used in the experiments
Fig. 5: Performance comparison w. r. t. increasing cast number for .
Fig. 6: Performance comparison w. r. t. increasing cast number for .

We use the LogEntropy weighting scheme [8] to compute the weight vector of each document, which is defined as

where is the frequency of term in document , is the total number of documents in the collection, and . The parameters and their settings used in the experiments are shown in Table 2. Generally, the effectiveness of a FTR system is not evaluated by the precision at recall points. Since only a subset of the peers is selected, it is usually impossible to retrieve all of the relevant documents. As in other research works [29], we use two metrics to evaluate the quality of the merged results. One is the top-N precision (P@N), which can be defined as follows.

(29)

where stands for the set of relevant documents in the top results. As a complement, we also use another metric named top-N average precision (AP@N) to evaluate the distribution of the relevant documents in the top-N results. AP@N is defined as,

(30)

which indicates that the higher the relevant documents are ranked, the larger AP@N will be. Unless stated to the contrary, the evaluation metrics shown in this paper are the average for all

queries. For comparison, as mentioned in Section 2, we also implemented two baseline algorithms gGloss(0) [10] and IS-Cluster [27] that were shown to be very effective.

V-B Experimental Results

In the experiments, we first evaluate the performance of C-DLSI and study the impacts of each parameter. Then we analyze the compatibility issue for C-DLSI in FTR. Next, we compare our method, denoted as C-DLSI(), with another form of C-DLSI which is based on a truncated value , namely, C-DLSI(). Finally, we examine the performance of the collection update algorithm. Since FTR usually selects a small number of peers, we focus more on the performance for small cast numbers (e.g., ) in the experiments.

T Comp0 Comp1 Avg. Recall
gGloss(0) IS-Cluster C-DLSI
5 0.56 0.22 0.44 0.24 0.149 0.165 0.159
10 0.4 0.42 0.4 0.32 0.288 0.282 0.28
15 0.46 0.36 0.42 0.42 0.398 0.403 0.412
20 0.38 0.36 0.38 0.5 0.521 0.52 0.51
25 0.32 0.46 0.36 0.5 0.608 0.624 0.604
30 0.34 0.46 0.36 0.4 0.686 0.702 0.698
35 0.3 0.44 0.4 0.42 0.776 0.782 0.768
40 0.34 0.4 0.42 0.38 0.853 0.859 0.858
45 0.28 0.34 0.36 0.3 0.938 0.932 0.932
50 0.0 0.0 0.0 0.0 1.0 1.0 1.0
TABLE III: Performance comparison for the three methods on peer selection

V-B1 Performance Evaluation

At the beginning, the descriptors of all peers are stored in the broker. During the experiments, the broker will load the short queries extracted from TREC topics and perform peer selection and final result merging. Table 3 presents the peer selection results for each approach, where the setting of C-DLSI is , , and . The comparison criterion is based on the number of relevant documents contained in the selected peers. Comp0 and Comp1 compare C-DLSI with gGloss(0) and IS-Cluster respectively, by measuring the portion of the queries in which one method outperforms the other. Specifically, means C-DLSI outperforms gGloss(0) (IS-Cluster), while denotes the reverse. Avg. Recall denotes the average recall of the selected peers for all of the queries. We can see that C-DLSI as a whole outperforms gGloss(0) and is close to IS-Cluster. Similar results can be observed for other settings.

Fig. 7: An example of gray-scale map for a peer when . The points represent the weights of the terms. The vertical lines separate the clusters.

For performance comparison, Figure 5 and Figure 6 show the top-N precision (P@N) and top-N average precision (AP@N) of all three methods for increasing cast number under two different settings. We can see that C-DLSI with a proper threshold (discussed in Section 5.2.2), e.g., for cluster number , in general outperforms the other two methods under both evaluation metrics. To understand these results further, we check the characteristics of the peers in the simulation. Specifically, the gray-scale map of a peer for is given in Figure 7. In this map, some popular terms are removed. It shows that the TREC data used in the experiments are relatively sparse, which means it is generally difficult to properly rank them. Besides, it leads to unsatisfactory clustering results. Based on this, we can only expect a modest improvement by applying C-DLSI on this dataset. Note that we use different collection assignments in two experiments to gain a general conclusion. The determination of parameters are discussed in Section 5.2.2. Since the performance of C-DLSI in different collection assignments in general are similar, we mainly utilize the collection assignment of as an example to investigate the properties of C-DLSI in the following experiments.

Fig. 8: C-DLSI performance (P@10) with different values for increasing cast number.

V-B2 Impacts of Parameters

Number of relevant clusters

In C-DLSI, only relevant clusters are used to judge the relevance of a collection (see Section 4.5). Figure 8 shows the performance of C-DLSI with different values for cast number 5, 10, 15, 20, 25 and 30. From the results, we can see that for the smallest number , the smallest value (i.e., ) outperforms other settings. As the cast number increase, larger values gradually become more preferable. It supports the fact that clustering is able to identify the documents which contain query terms but are irrelevant to the query, e.g., by keeping them in the clusters that refer to other topics. Thus, with smaller , we can remove the impacts of these irrelevant documents and provides better relevance estimation of the peers. That is why this strategy can achieve higher precision for most relevant peers (i.e., small cast numbers). However, because of the limited clustering quality on the peers which are not quite relevant to the query, some relevant documents may be assigned to the wrong clusters, that refer to other topics. In this case, exploring more clusters, which is achieved by having larger values, will be more effective. Therefore, larger gains better performance as more peers are considered (i.e., larger cast numbers). In the experiments, we choose for the case of to analyze the performance of C-DLSI.

Fig. 9: C-DLSI performance (P@10) with different values for increasing cast number.
Fig. 10: C-DLSI performance for the case when peers select their values independently. The performances of other methods such as C-DLSI() and IS-Cluster are also given for comparison.

Threshold of LSI

Basically, if the threshold of LSI is zero, then C-DLSI will degenerate to a pure cluster-based peer selection method. In the experiments, we can see from Figure 9 that the effectiveness of C-DLSI is not monotonously increasing with . Further, it always reaches the highest in the middle. In the example, the optimal value of is for . That is why C-DLSI with proper threshold can improve the performance of the federated querying processing in FTR. Another important issue here is how to decide the proper value of for each peer. Since each search peer in FTR maintains their own LSI, the threshold has to be decided independently. In Figure 10, we consider this issue in the same situation as in Figure 9 and make each peer select their local optimal independently. Based on the metric of AP@10, each peer searches the optimal threshold from the testing interval between 1 and 9, which finally concentrates on either 1 or 5. From the results, we can see that this threshold decision method causes a slight performance decrease compared to the optimal case of but still outperforms the other methods such as IS-Cluster. Generally, how to automatically decide the proper value for each peer and achieve a global optimal performance remains a open question to be answered in our future work.

Fig. 11: Performance of the three methods w. r. t. different cluster numbers.

Cluster number

In general, the quality of k-means clustering is determined by the preset cluster number . To examine how much the FTR approaches rely on the clustering quality, we investigate the performance of the three methods with different cluster numbers, as presented in Figure 11 (each cluster number corresponds to a different collection assignment). The results show that IS-Cluster is more sensitive to the cluster number. For larger cluster number (e.g., or ), IS-Cluster outperforms gGloss(0). However, for small cluster number (e.g., ), IS-Cluster may be beaten by gGloss(0). It indicates that IS-Cluster relies more on the clustering quality and we have to select proper to guarantee a good performance. On the contrary, C-DLSI is more stable and substantially outperforms gGloss(0) most of the time. This characteristic means a lot since the broker usually inclines to keep a small directory for scalability or bandwidth considerations. C-DLSI can better adapt to systemns with limited resource.

Fig. 12: Performance comparison between C-DLSI and the combination methods (IS-Cluster or gGloss(0) plus LSI-based IR).

V-B3 Compatibility Issue in FTR

As shown in Table 3, C-DLSI exhibits very close performance to IS-Cluster on peer selection. For some cast numbers, IS-cluster is even better. Based on this, we may consider whether a combination of IS-Cluster and LSI-based text retrieval is also a good choice compared to C-DLSI. In this subsection, we examine two combination methods, namely CM1 and CM2. Both of them retrieve documents in each peer based on LSI as in C-DLSI. However, for peer selection, CM1 uses gGloss(0) while CM2 uses IS-Cluster. Figure 12 shows the result of this comparison and it shows that C-DLSI still outperforms the combined methods. For example, although IS-Cluster and gGloss(0) get slightly better peer selection result than C-DLSI when cast number , the combination methods are still inferior to C-DLSI at . From the result, we can see that although IS-Cluster is able to detect the semantic meaning of each document, it does not adapt to LSI in local text retrieval (actually, it is difficult to find a proper local retrieval scheme for IS-Cluster because of its special property as we point out in Section 4.1), thus achieving little improvement. Therefore, the performance of the combination method CM2 is only slightly better than CM1. However, the peer selection method of C-DLSI can distinguish each document and obtain proper semantic spaces based on LSI. Therefore, it is more adaptive to the LSI-based text retrieval.

Fig. 13: Performance comparison between C-DLSI() and C-DLSI() for the case .
Fig. 14: Performance comparison between C-DLSI() and C-DLSI() for the case .
Fig. 15: An example of gray-scale map for a peer when .

V-B4 C-Dlsi() Vs. C-DLSI()

In this subsection, we compare our method C-DLSI, denoted as C-DLSI(), with another possible form of C-DLSI which is based on a truncated number , namely, C-DLSI(). Here we choose the truncated value with the best performance of all possible values for C-DLSI, e.g., for the case , and for the case . The results are shown in Figure 13 and Figure 14 for the cases and respectively. We can see that C-DLSI() in general beats C-DLSI(). In particular, for the case , when all of the indexing peers are selected (), this gap becomes largest, indicating an improvement over C-DLSI(). We find that the numbers of the relevant documents returned by the peers in two methods are quite similar. It means C-DLSI() makes the LSI spaces among clusters comparable and thus provides better result merging. Finally, we also examine the C-DLSI scheme without considering the cluster relations, denoted as C-DLSI-NR(). The performance are given in Figure 13 and Figure 14. Generally, by considering cluster relations, C-DLSI() gains some improvements compared to C-DLSI-NR(). This gain becomes larger for , because the clustering quality in the case is worse than that of (as shown in Figure 15).

V-B5 Collection Update Scheme

Finally, we will test our update scheme in FTR. In the experiment, we only simulate one case of collection update, i.e., indexing new documents. Specifically, we first index only of the total documents and build the corresponding LSI. Then we gradually add new documents from the remaining set, adding of the indexed size in each step. Figure 16 shows an example of the performance variation of the three methods during the update procedure. We can see that the performance of C-DLSI still increases with more documents added. For small cast number (e.g., ), C-DLSI outperforms (or at least be comparable to) gGloss(0) (IS-Cluster) until the update amount reaches . For larger cast number (e.g., ), this valid amount for C-DLSI before rebuilding the LSI decreases to (). Besides, we also get similar results for other settings. It means that our update scheme is especially applicable for FTR systems, in which the cast number is relatively small.

Fig. 16: Performance comparison with increasing update.

Vi Conclusion

In this paper, we proposed a promising solution for the challenges of FTR. Different from the existing methods, our proposed method, Cluster-based Distributed Latent Semantic Indexing (C-DLSI), captures the semantic structure of a peer by identifying the LSI spaces within the clusters and considering the relations among them, thus providing more precise evaluation of the peer. We analyzed the characteristics of C-DLSI, based on which novel descriptors of the peers and the federated query processing was proposed. Besides, we devised an effective form of C-DLSI, namely, C-DLSI(), the performance of which is studied and verified by using gray-scale map in the experiments. Our method is efficient since only the clusters affected by the updates need to be reindexed. Moreover, we consider the update problem of C-DLSI and provide an update scheme to make the framework more efficient while guaranteeing its effectiveness. The experimental results confirmed the superiority of our model and update scheme, and showed that our method outperforms other existing methods including the previous cluster-based method.

References

  • [1] M. Bender, S. Michel, P. Triantafillou, G. Weikum, and C. Zimmer. ”Improving collection selection with overlap awareness in p2p search engines”. SIGIR, 2005.
  • [2] M. Bender, S. Michel, P. Triantafillou, G. Weikum, and C. Zimmer. ”Minerva: Collaborative p2p search”. VLDB, Aug 2005.
  • [3] Y. Bernstein, M. Shokouhi, and J. Zobel. ”Compact features for detection of near-duplicates in distributed retrieval”. SPIRE, pages 110-121, 2006.
  • [4] J. Callan. Distributed information retrieval. Advances in Information Retrieval, pages 127-150, 2000.
  • [5] J. P. Callan, Z. Lu, and W. B. Croft. Searching distributed collections with inference networks. SIGIR 95’, 1995.
  • [6] N. Craswell, P. Bailey and D. Hawking: Server selection on the World Wide Web. ACM DL, 37-46, 2000
  • [7] S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. JASIS, 41(6):391-407, 1990.
  • [8] S. Dumais. Improving the retrieval of information from external sources. Behavior Research Methods, Instruments, and Computers, 23:229-236, 1991.
  • [9] S. Gauch, G. Wang, and M. Gomez. Profusion*: Intelligent fusion from multiple, distributed search engines. J. UCS, 2(9):637-649, 1996.
  • [10] L. Gravano, H. Garcia-Molina, and A. Tomasic. The effectiveness of gloss for the text database discovery problem. SIGMOD Conference, pages 126-137, 1994.
  • [11] L. Gravano, H. Garcia-Molina: Generalizing GlOSS to Vector-Space Databases and Broker Hierarchies. VLDB, 78-89, 1995
  • [12] D. Hawking and P. B. Thistlewaite. Methods for information server selection. ACM Trans. Inf. Syst., 17(1):40-76, 1999.
  • [13] P. Husbands, H. D. Simon, and C. H. Q. Ding. Term norm distribution and its effects on latent semantic indexing. Inf. Process. Manage, 41(4):777-787, 2005.
  • [14] A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988.
  • [15] E. R. Jessup and J. H. Martin: Taking a new look at the latent semantic analysis approach to information retrieval. Computational information retrieval, 121-144, 2001
  • [16] R. Khoussainov and N. Kushmerick. Automated index management for distributed web search. CIKM, 2003.
  • [17] S. Lawrence and C. Giles. Accessibility of information on the web. Nature, 400(6740):107-109, 1999.
  • [18] D. L. Lee, D. J. Zhao, and Q. Luo. Information retrieval in a peer-to-peer environment. Infoscale, 2006.
  • [19] T. A. Letsche and M. W. Berry: Large-Scale Information Retrieval with Latent Semantic Indexing. Inf. Sci. 1997
  • [20] D. Li and C.P. Kwong. Understanding Latent Semantic Indexing: A Topological Structure Analysis Using Q-analysis Method. IKE¡¯07
  • [21] K. L. Liu, C. T. Yu, and W. Meng. Discovering the representative of a search engine. CIKM, 2002.
  • [22] W. Meng, C. T. Yu, and K. L. Liu. Building efficient and effective metasearch engines. ACM Comput. Surv., 2002.
  • [23] H. Nottelmann and N. Fuhr. Evaluating different methods of estimating retrieval quality for resource selection. SIGIR, pages 290-297, 2003.
  • [24]

    T. T. Pham, N. Maillot, J. H. Lim, and J. P. Chevallet. Latent semantic fusion model for image retrieval and annotation. CIKM, pages 439-444, 2007.

  • [25] A. L. Powell, J. C. French, J. P. Callan, M. E. Connell and C. L. Viles: The impact of database selection on distributed searching. SIGIR, 2000
  • [26] D. Puppin, F. Silvestri and D.Laforenza: Query-driven document partitioning and collection selection. Infoscale 2006.
  • [27] Y. Shen and D. L. Lee. A meta-search method reinforced by cluster descriptors. WISE, pages 125-132, 2001.
  • [28] M. Shokouhi and J. Zobel. Federated text retrieval from uncooperative overlapped collections. SIGIR, pages 495-502, 2007.
  • [29] L. Si and J. Callan. Relevant document distribution estimation method for resource selection. SIGIR, pages 298-305, 2003.
  • [30] L. Si and J. Callan. Unified utility maximization framework for resource selection. CIKM, pages 32-41, 2004.
  • [31] L. Si and J. Callan. Modeling search engine effectiveness for federated search. SIGIR, pages 83-90, 2005.
  • [32] M. Sogrin, T. Kechadi, and N. Kushmerick. Latent semantic indexing for text database selection. In Proc. Workshop Heterogeneous and Distributed Information Retrieval, 2005.
  • [33] C. Tang, Z. Xu, and S. Dwarkadas. Peer-to-peer information retrieval using self-organizing semantic overlay networks. SIGCOMM, pages 175-186, 2003.
  • [34] A. Tomasic and H. Garcia-Molina: Issues in Parallel Information Retrieval. IEEE Data Eng. Bull. 41-49 (1994)
  • [35] J. Xu and W. B. Croft. Cluster-based language models for distributed retrieval. SIGIR, pages 254-261, 1999.
  • [36] B. Yuwono and D. L. Lee: Server ranking for distributed text retrieval systems on the internet. DASFAA, 1997.
  • [37] S. Zhang, G. Wu, G. Chen, and L. Xu. On building and updating distributed lsi for p2p systems. ISPA Workshops, 2005.
  • [38] D. J. Zhao, D. L. Lee, and Q. Luo. A meta-search method with clustering and term correlation. DASFAA, pages 543-553, 2004.
  • [39] S. Zhou, Z. Zhang, W. Qian, and A. Zhou. Sipper: Selecting informative peers in structured p2p environment for content-based retrieval. ICDE, 2006.