CURE: Collection for Urdu Information Retrieval Evaluation and Ranking

11/01/2020 ∙ by Muntaha Iqbal, et al. ∙ 0

Urdu is a widely spoken language with 163 million speakers worldwide across the globe. Information Retrieval (IR) for Urdu entails special consideration of research community due to its rich morphological features and a large number of speakers. In general, IR evaluation task is not extensively explored for Urdu. The most important missing element is the availability of a standardized evaluation corpus specific to Urdu. In this research work, we propose and construct a standard test collection of Urdu documents for IR evaluation and named it Collection for Urdu Retrieval Evaluation (CURE). We select 1,096 unique documents against 50 diverse queries from a large collection of 0.5 million crawled documents using two IR models. The purpose of test collection is the evaluation of IR models, ranking algorithms, and different natural language processing techniques. Next, we perform binary relevance judgment on the selected documents. We also built two other language resources for lemmatization and query expansion specific to our test collection. Evaluation of test collection is carried out using four retrieval models as well using the stop-words list, lemmatization, and query expansion. Furthermore, error analysis was performed for each query with different NLP techniques. To the best of our knowledge, this work is the first attempt for preparing a standardized information retrieval evaluation test collection for the Urdu language.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Urdu is the official language of Pakistan and is widely spoken in parts of countries like Bangladesh, United Kingdom, Canada, and India. Urdu has an estimated more than

million speakers worldwide111 It is amongst the top five most spoken languages of the world (weber2008top). Urdu finds its roots in Turkic, Persian, and Arabic languages. Most of Urdu vocabulary is borrowed from Persian, and Arabic languages (hardie2003developing).

Information retrieval is the process of storing and retrieving information for a given query (madankar2016information). Nowadays, a significant amount of content is available online in many regional languages, and users prefer to search for content in their native languages. Despite a large number of Urdu speakers around the world and availability of Urdu content online, the task of Information Retrieval (IR) in Urdu has not been explored that much as compared to other regional and global languages. Standard test collection is an essential requirement for the evaluation of IR. Test collections are available for many European222 333 , Indian444, and Asian languages (aleahmad2009hamshahri; esmaili2014towards). However, there is the scarcity of such test collections, especially for Urdu. In (riaz2008baseline), test collection was developed for evaluation of Urdu IR. This test collection consists of fewer documents and is not available for research purposes. Furthermore, this test collection is not prepared according to the Text Retrieval Conference (TREC) standards. In this work, we focus on developing a comprehensive test collection to evaluate IR for the Urdu language.

Our major contributions are as follows:

  • Test collection creation: For the development of test collection, the first corpus was built using crawled documents from Urdu websites. Second, queries are formulated using information needs from different topic categories. For each query, two retrieval models VSM and BM25 are used to retrieve top documents from crawled corpus. An initial set of documents obtained against queries was reduced to unique documents. For all documents, binary relevance judgment was provided. CURE is freely available for the research community online.

  • Evaluation of IR models: For the evaluation, all retrieved documents were compared against relevance judgments. Three evaluation measures, i.e., Precision, Recall and Mean Average Precision (MAP), was used for the evaluation of four IR models. Experimental results show that more relevant documents are retrieved at top ranks using BM25 model and Language Model with Jelenik Mercer Smoothing(LM-JMS). MAP for BM25 and LM-JMS is 73%, 68% for Language Model Dirichlet Smoothing (LM-DS) and 69% for VSM.

  • Evaluation of IR models with NLP techniques: In order to evaluate the impact of different Natural Language Processing (NLP) techniques on the test collection, we select three basic techniques such as stop-words, lemmatization, and query expansion. Already developed stop-words list was used for stop-words evaluation. There were words which were reduced to . Next, we used the dictionary look-up based approach for lemmatization with total word-lemma pair. Finally, we added inflectional variants for words in our list. We evaluated all retrieval models by enabling these NLP techniques. With stop-words removal, results of BM25, VSM, and LM-JMS are improved slightly compared to their baseline models. In case of lemmatization, all four retrieval models, i.e., BM25, LM-JMS, LM-DS, and VSM, showed improved performance slightly. Finally, in case of query expansion BM25, VSM, and LM-JMS showed improved performance. From the error analysis of retrieval models, it was observed that frequency of query terms matters although term frequency is not an indicator of relevance.

The rest of the paper is organized as follows: Section 2 presents related work on previous test collections. Section 3 introduces CURE test collection and explains the data set generation process and its statistics in detail. Our methodology of information retrieval and language processing techniques are presented in Section 4. Section 5 discusses the results of the evaluation of different IR models on CURE. Section 6 concludes the paper.

2. Related Work

The development of resources for evaluation of IR has been the prime focus of the information retrieval research community. Many international platforms are working on the evaluation procedure and have provided the research community with several specialized test collections in different languages. The task of retrieval evaluation is a well-explored one for English and other widely spoken European languages.

Text Retrieval Conference (TREC)555 is one of the most notable international platforms dedicated to this field (harman1995overview). This platform contains a test collection for IR for many languages. In addition to the English language, many non-English test collections built for retrieval evaluation of regional languages are also made available through this platform, i.e., German, French, Italian, Arabic, Spanish and, Chinese. Other notable platforms are, NII Test Collections for IR systems (NTCIR)666 for retrieval evaluation of cross-lingual and East Asian language text.Cross-Language Evaluation Forum (CLEF)777 focuses on European languages and Forum for Information Retrieval Evaluation (FIRE)888 is used for evaluation of IR for most of Indian languages. All these platforms provide test collections for evaluation of monolingual or cross-lingual IR. Several test collections are available for all the platforms mentioned above.

Communications of the ACM (CACM)999 provides a test corpus which consists of documents and queries with relevance judgments for English IR. CACM consists of bibliographic data and was developed for the evaluation of the extended VSM (fox1983characterization). For evaluation of Slovak IR test collection was created (hladek2016evaluation). This collection contains documents from an already developed news corpus (hladek2014slovak) for queries.

Similarly, for the evaluation of Persian IR, many test collections are available. In (esmaili2007mahak), Mahak collection was created for Persian IR evaluation. They used queries and collected news documents from Iranian Student’s News Agency (ISNA)101010 The most prominent effort for the creation of Persian IR collection is Hamshahri (aleahmad2009hamshahri). For the selection of documents, they used the pooling method. Documents were collected from news articles from to . It consists of documents and queries. Documents were selected from major news categories. For relevance judgment, Information Technology (IT) students were asked to mark the top documents if they are relevant to a specific query or not. For each query, they provided narration and description of the query. For the evaluation, they used four versions of Language Models (LM) and versions of VSM and reported that VSM performed well on their test collection.
Pewan (esmaili2014towards) is a test collection for evaluation of Kurdish IR and considered as the first standard collection for the Kurdish language. TREC methodology was followed in the construction of Pewan. Text documents were collected from news articles of Peyamner111111 and Voice-of-America121212 Overall, documents were collected from two news sources. They created queries initially, and from these queries, they used to create a pool from IR systems. In the final collection, they used queries and excluded those queries which had very large or few relevant documents. They evaluated retrieval systems on this test collection.

In (shamshed2010novel), authors developed the first Bangla collection for IR evaluation. Three major newspapers, Protho-alo131313, Kaler-Konath141414 and Amer-Desh151515 were used for document collection. Total documents were selected from these three sources against queries and

IT students were selected for relevance judgment. Precision and recall measures were used for evaluation and reported average precision and recall of 52% and 82%, respectively. A benchmark for Japanese IR was developed and named as

BMIR-J2 (sakai1999bmir). There are a total of text documents for queries. Test collection is also available for the evaluation of medical Information retrieval, i.e., Medlar collection consists of 18 queries, 273 documents, and selective judgment by query authors. Some other data sets are extended Medlar, Ophthalmology1, and Ophthalmology2 (salton1973recent).
All the platforms mentioned above do not provide any such corpus for the Urdu language that can be employed for Urdu specific IR performance evaluation. This can be attributed to the fact that there has not been much research work in the field of IR for the Urdu language. One such foundational work is described in (riaz2008baseline). They used the Backer-Riaz (becker2002study) corpus, which consists of Urdu news documents. From these documents, a baseline test collection was developed. They selected documents for their dataset and created a relevance judgment for each document. queries were created for baseline evaluation. The work done is not comprehensive, and the test collection is not available for public access. Besides this work, there is no mentionable work that is related to this task. In this context, our primary intention was to develop a test collection for evaluation of Urdu IR. Table 1 summarizes characteristics of existing test collections.

Collection Name Documents Relevant Documents Queries Domain Available TREC-1 741,856 277/Query 50 English Yes TREC-2 741,856 210/Query 50 English Yes TREC-3 741,856 196/Query 50 English Yes TREC-4 741,856 130/Query 50 English Yes CACM 3204 796 52 English Yes Slovak Collection 3980 1097 80 Slovak No Mahak 3007 N/A 216 Persian No Hamshari 166,774 2352 65 Persian No Pewan 115,430 N/A 22 Kurdish No Bangla Collection 69 43 14 Banglali No BMIR-J2 5080 28/query 60 Japanese No Medlar 273 N/A 18 Medical No Extended Medler 450 N/A 29 Medical No Ophthalmology 1 852 N/A 29 Medical No Ophthalmology 2 852 N/A 17 Medical No Baker-Riaz 200 N/A 4 Urdu No CURE 500,000 1,096 50 Urdu Yes
Table 1. Summary of existing test collections and CURE

3. CURE Test Collection

This section presents an overview of the data collected in this paper for Urdu IR evaluation. First, we explain the guidelines and the process of query creation. Next, we provide details regarding document collection. Finally, we outline the relevance judgment procedure of these collected documents.

3.1. Query creation guidelines

For query creation, we acquired assistance from three native speakers of Urdu to specify information need and create queries. We trained our subjects to specify information need and query like reported in (esmaili2014towards). Besides, some predefined guidelines were also provided for query creation to subjects (Paul2013Evaluating). These guidelines are listed as follows.

  • Formulating queries from information needs: Subjects were first asked to express their information needs regarding a topic. After that, a domain expert provided guidelines to subjects about transforming these information needs to queries. According to guidelines, subjects should identify those words from the information need that represent the main idea and are crucial to retrieve relevant documents.

    Besides, subjects were asked to formulate brief queries. Fig. 1 shows four sample information needs and their respective queries. The description of information needs provided with the query is useful for subjects in the relevance judgment phase (kinney2008evaluator). For instance, for the second query in Fig. 1 “Effects of smog (smog ky asrat)” the user is interested in those documents in which the adverse effects of smog on human beings or environment are discussed.

  • Query length: Length of the query should be higher than word because the notion of the relevance of a document for a single word query is ambiguous. Example of single word query is shown in the first row of Fig. 1.

  • Total queries: For standard test collection, there should be a minimum of queries in the test collection; therefore, each subject was asked to provide at least queries.

  • Query categories: Queries should be selected from diverse categories for a content-rich test collection. We provided four pairs of information needs and sample queries to our subjects in order to train them. These queries were from history, sports, business, and, health topics.

Figure 1. Information needs and respective queries

3.2. Query creation process

In the first phase, subjects define the set of the user information needs - the information the user is seeking various topics present in our corpus and prepare a set of corresponding queries. Initially, each subject created queries to get a pool of queries. In order to get precise user information needs, we remove all such queries with length more than seven words.

In assigning a relevance judgment to a document for a query with length higher than words, subjects were confused, so we discarded all such queries with length higher than words. After removing these queries, we had queries in our test collection, which is in accordance with the work is done in (urbano2016test) where authors suggested a minimum of queries for standard data collection. Queries in our standard test collection are represented in the standardized format. Fig. 2 shows an example of a four-word query from CURE.

Figure 2. Sample query from CURE
Figure 3. Length distributions of queries

Every query is assigned five different fields: (i) ‘QID’ - A unique query id, (ii) ‘totalWords’ field shows the total number of words, (iii) ‘noOfWordsWithSWR’ shows the number of words after stop word removal, (iv) ‘title’ field contains the query that was performed, and (v) a ‘description’ field which contains the actual information need based on which that query was formulated.

Next, we present different characteristics of queries. Fig. 3 shows the length distribution of queries. The average length of queries in our data set is . The maximum length of a query in our data set is , while the minimum length is . We group queries into different topic categories based on different information needs specified by users. Distribution of documents and queries over different topic categories is shown in Fig. 4. We found that 30% of queries in our test collection belonged to national and international news categories. On average, documents were retrieved for each query.

Figure 4. Distribution of queries and documents over categories

3.3. Document collection

After successful creation of queries, the second step is to select documents based on these queries from a large corpus. To obtain Urdu documents, we employed an open source web-crawler Nutch version (2.3.1), to fetch Urdu web documents present on Urdu websites. Our web-crawler fetched

million documents between August 2017 and November 2017. We note that the majority of these documents were crawled from Urdu news websites that frequently post Urdu content. Next, we indexed these documents in Apache Solr version (6.2.2) with document id, title, and content.

Figure 5. Sample document from CURE

In order to select documents from million corpus, we use the pooling method. Pooling is a standard method in which a subset of ‘top k’ documents is attained using different retrieval models from a corpus (manning2008introduction). The top ranked documents were selected using two retrieval models (VSM, BM25) in Apache Solr6 for the creation of the initial set of documents. After identifying duplicate documents in this subset, we obtained documents from an initial pool of documents from domains. We stored our data in a standardized XML format. Each document has: (i) ‘document_ID’, (ii) ‘title’ field carries the title of the web document, and (iii) ‘body’ field contains the content of that document. Fig. 5 shows sample document from our test collection. The total size of the collection is MB.

Figure 6. Document size distribution
Figure 7. Distribution of documents over categories
Figure 8. Most frequent terms in dataset

Fig. 6 shows the size of the distribution of documents in the collection of 1096 documents. The size of documents varies from KB to KB. More than 80% of the documents are from to KB in size. Fig. 7 shows the distribution of documents based on the source website categories. Almost 75% of the documents belonged to Urdu news sites. 18% of the documents belonged to the general website category.

Besides, documents were also retrieved from blogs, forums, and social media sites. Fig. 8 shows the most frequent words in our test collection and their length in characters. Almost all words are stop words except ”Pakistan” because in queries this word is used. Table 2 provides details of different attributes of CURE.

Attributes Value Collection Size 6.7 MB Document Format xml Total Domains 254 Total Queries 50 Total Token in Queries 219 Unique Tokens in Queries 143 Number of Categories 11 Total Documents 1,096 Average Documents for each Query 21 Total Token in Documents 859,432 Token after Stop Words Removal 461,300 Unique Tokens in Documents 15,418 Largest Document (by size) 143KB Smallest Document (by size) 1KB
Table 2. Attributes of CURE

3.4. Relevance judgments

After the creation of queries and pool of documents, relevance judgments are required for documents against the queries. For this purpose, we acquired assistance from the same subjects who created queries. The following guidelines were provided to subjects for relevance judgment:

  • If a document is partially relevant to a query, that document should be considered relevant. For example, in case of the third query of Fig. 1, (Pakistan dar-aamdat o bar-aamdat , Pakistani imports, and exports), all documents which provide information either related to “Imports/exports” or both will be considered as relevant to the query.

  • Documents in which the original query term does not exist, but a synonym of that word is used should be considered relevant as well. In the last query of Fig 1, a document is retrieved that does not contain the original word “asbab (reason)” but a synonym “waja (reason)” is used so that document is considered to be a relevant document for that query.

Two of the three subjects provided binary (relevant, not relevant) judgments for all documents manually. Table. 3 shows the sample relevance judgment. The format first shows the query ID followed by the document ID and the relevance judgment for this query-document pair. In our notation, ‘1’ value of relevance judgment means that the document is relevant to the query. After assigning relevance judgment to query-document pairs, the Kappa statistic was used to assess the degree of agreement between two subjects (cohen1960coefficient).

Query ID Document ID Judgment Q03 CURE-0070 1 Q03 CURE-0071 0 Q03 CURE-0072 0 Q03 CURE-0073 0 Q03 CURE-0074 1
Table 3. Sample relevance judgment from CURE
S1:Relevant S1:Not-Relevant Total S2:Relevant 701 31 732 S2:Not-Relevant 112 252 364 Total 813 283 1096
Table 4. Details of relevance judgment

In the above equation 1, P(A) is the proportion of time subjects agreed on relevance judgment while P(E) is the proportion of time subjects would be expected to agree by chance. From Table 4, we found value of P(A) and P(E) is and respectively so ‘agreement score’ is 70%, which is considered to be a substantial agreement (gwet2012benchmarking). In our study, subject-1 (S1) and subject-2 (S2) judged and documents to be relevant for 50 queries, respectively. Both subjects agreed on documents as relevant documents. A third subject (S3) was asked to resolve conflicting judgments to break the tie. From these documents, the third subject marked documents as relevant. Hence, from documents, documents were judged to be relevant documents against queries. After analyzing documents with conflicting relevance judgments, we observe two reasons that caused conflict in relevance judgments:

  • Mostly subjects were confused in cases of lengthy queries. Therefore, we revised our guidelines and restricted query length to a maximum of words.

  • Subjects provided strict relevance judgment for their own created queries; however, their judgments were not strict for queries created by other subjects. This behavior was the main cause of conflicting relevance judgment.

4. Methodology

First, we explain the language preprocessing applied to each query. Next, we briefly discuss different retrieval models selected for the evaluation of our test collection. Finally, we will explain basic language processing techniques which we also evaluated on our test collection.

4.1. Language pre-processing

When the query is received, the first step is to preprocess the query. In the preprocessing phase, the query is tokenized based on white space delimiter. After tokenization, we remove any un-necessary punctuation marks. After this preprocessing, the query is issued to the retrieval models.

4.2. Retrieval models

For evaluation, we have chosen four retrieval models: “TF-IDF based VSM” (guo2008similarity), “BM-25 model” (robertson1995okapi), “LM-DS”, and “LM-JMS” (zhai2017study) for comparison. Before applying any retrieval model, we have applied OR-ed operation on queries with length greater than 1 to create the first subset of potentially relevant documents. After generation, the documents are ranked based on relevance scores provided by the retrieval models. Next, we provide a brief description of the selected IR models.

  • Term Frequency-Inverse Document Frequency (TF-IDF) based Vector Space Model:

    In this type of retrieval model, documents in the corpus and queries are transformed into vector space using a bag of words representation scheme. The weights used for the terms are TF-IDF weights. Term frequency gives us the number of occurrences of a term in a document, whereas IDF value gives the notion of the uniqueness of a term to a document. Each term in the index is converted into an n-dimensional vector where ‘n’ is the number of documents. After transformation,

    cosine similarity

    is computed between the query and the document. TF is calculated by taking the square root of the number of occurrences of a word in the document, and traditional IDF is modified to never output a zero value. Cosine similarity and TF-IDF are calculated as follows 



    Where Vq is the vector representation for query and Vd is the vector space representation for the document.


    Here, tf stands for the number of times a certain term ‘t’ occurs in a document; ND is a number that represents total documents in the corpus and DF represents the document number that contains term ‘t’.

  • Best Matching (BM25) Model:
    BM25 can be termed as the next generation of the TF-IDF vector space model. BM25 makes use of the saturation parameters to control the term frequency parameter. Moreover, the length of the document is incorporated into the final score calculation relative to the average of the length of all the documents in the corpus. Final score in the BM25 algorithm is calculated as follows (robertson1995okapi):


    Where dL is the document length and avgdL is the average document length in the corpus. and are the saturation parameters and their values are and in our setting.

  • Language Model with Dirichlet Smoothing (LM-DS):
    Language model for IR is a probabilistic distribution that is learned in the form of a query likelihood estimation for each document independently. Smoothing is used to adjust this estimation in case of a query term that is absent in a document. Note that, the smoothing parameter used in this model is dependent on the document length. Using a constant and the document length, the value for smoothing parameter is determined as:


    The overall probability with smoothing is calculated using the following



    where N is the document length and is the probability of the query being generated given a document. c(t,d) denotes the document frequency of a term. P(t—c) is the language model of the collection.

  • Language Model with Jelenik Mercer Smoothing (LM-JMS):

    A linear interpolation of maximum likelihood estimation of a query word in the document and language model of a collection is employed in this model. The smoothing parameter is defined

    independent of documents and the query and documents with higher lengths tend to provide better estimates. Optimal smoothing parameter value needs to be tuned for the type of dataset under consideration. In our setting, is set to 0.7. The overall probability score for ranking is calculated as(zhai2017study):


    Here, is the probability of smoothing for a term in the document, is maximum likelihood estimation of term ‘t’ in the document ‘d’ whereas stands for the language model of the collection.

4.3. Natural Language Processing techniques in IR

Basic NLP techniques are essential requirements for any IR system. Our goal is to study the impact of enabling these NLP techniques on our test collection. Next, we describe different language processing techniques such as stop-words elimination, lemmatization, and query expansion that we used in addition to the baseline evaluation for the four retrieval models.

  • Stop-words removal: Stop-words are eliminated to evaluate the performance of our test collection. Stop-words removal slightly improved the performance of IR on Kurdish test collection (esmaili2014towards). We removed stop-words from the query using a predefined Urdu stop-words list161616 The stop-words list consists of very frequent words that do not carry any meaning of their own. There were words in this list, and we further reduced to words by removing cardinal and ordinal words.

  • Lemmatization: Lemmatization is used to find the root or stem of the input word, and it is closely related to the stemmer. In (balakrishnan2014stemming)

    , authors have used stemming and lemmatization to improve precision in IR for the English language, and for evaluation, they used CACM collection. Mean average precision was used as evaluation metrics for top 10 and top 20 documents. They compared results of baseline technique with these NLP techniques and reported significant improvement. The main hurdle in the Urdu language is the scarcity of basic language modeling tools. Visali Gupta 

    (gupta2016design) developed rule-based lemmatizer but it is not available for public use. Dictionary-based lemmatization with Stop-Words Removal (SWR) is applied at the query and index level, and we have word lemma pair in our lemmatization dictionary. In our test collection from queries, lemmatization affected only queries. Example of lemmatization is shown in Fig. 9.

  • Query Expansion: Query expansion is a process of generating all inflectional forms of given input words in the query (carlberger2001improving). Inflectional morpheme produces the grammatical form of a word (by adding plurals and a similar form of the word) and does not change part of speech (khan2011challenges). In information retrieval, performance can be improved by expanding query words because when words in query are expanded, chances of mismatch between document and query are minimized (vechtomova2009query). Effectiveness of information retrieval can also be improved by expanding the original query to synonyms (carpineto2012survey). CURE is also used for the evaluation of query expansion with Stop-Words Removal (SWR). Most frequent words were selected from our index to add inflectional variants, and after stop-words removal, we have unique words. We manually added inflectional forms of root words, and we have variants for these words. The maximum number of variants of a word is , and minimum while on average we have variants of a word. Example of variants is shown in Fig. 9. Our query word is first mapped to the root form of the word, and then variants of a word are added along with the root word. For the query “Amarten (Buildings)”, three variants of root word “Amart (Building)” is added. We retrieved all documents in which “Amarten (Buildings)” or “Amart (Building)” or “Amarton (Buildings)” words are present.

Figure 9. Query words with lemma and variants

4.4. Ranking

In general, we find the relevance score of the title and content field of the document for each word in the query for retrieval and ranking of test collection documents. The maximum of the relevance scores from title and content fields is taken as the relevance score of that query word against that document. While the relevance scores of individual query words are added in VSM and BM25 models, relevance scores are multiplied in case of language models to get the final score. Finally, retrieved results are sorted for the relevance score in descending order for ranking purpose.

5. Results

In this section, we describe our experimental results for the evaluation of different information retrieval models based on our test collection. Furthermore, we also discuss the impact of different NLP techniques on the evaluation of these models. For this purpose, retrieved documents were compared against relevance judgment provided in test collection for different use cases.

5.1. Metrics definitions

We use various different metrics for the evaluation such as Precision@10, Precision@20, Recall@50, and Mean Average Precision (MAP)@50. We briefly define these metrics below:

  • Precision@k in information retrieval is the ratio of relevant results in the top number of retrieved results. For instance, if five documents are relevant to a query in the top retrieved results, its precision will be .

  • Recall@k is the ratio of relevant documents in the top retrieved results to the total number of relevant documents present in the corpus.

  • Mean Average Precision (MAP)@k is used to measure if the relevant results in top retrieved results are present in the top ranks. MAP is a single measure that spans mean of average precision over all queries.

5.2. Evaluation of IR Models for CURE

Table 5 shows Precision@, Precision@, Recall@, and MAP for queries. In general, Precision@10 values for BM25 and LM with Jelenik Mercer smoothing for baseline perform in the range of whereas VSM the LM with Dirichlet smoothing shows a lower performance of . A precision of means that on average out of retrieved documents are considered relevant. We observe a similar trend for Precision@20 for four retrieval model. We note that, as we increase the number of retrieved documents, precision will decrease while recall will increase (robertson2000evaluation). Recall values in the top ranked documents are almost the same for every model except VSM with values above in every case which means over 85% of the relevant documents against a query in the test collection are retrieved in the top retrieved results for each model. Our results are consistent with already reported results in (riaz2008baseline). Similarly, MAP values for all the retrieval models under consideration are more than . Such level of MAP value indicates that we will have more relevant results in the top ranks as compared to non-relevant results. According to (zhai2017study), the performance of a particular retrieval model is highly dependent on the test collection under consideration and its attributes. A model performing well for a specific language test collection might not perform the same way on another data set. In our setting, we see that BM25 improves the retrieval results as compared to VSM. Also, we see a significant difference in the performance of language model retrievals where LM with Jelenik Mercer Smoothing outperforming LM with Dirichlet smoothing in every measure. This can be due to the nature of the language or the smoothing values used, but the exploration of this phenomena is left for future work.

Technique Model Precision@10 Precision@20 Recall@50 MAP@50 Baseline BM25 0.73 0.62 0.97 0.73 VSM 0.66 0.57 0.86 0.69 LM-DS 0.69 0.58 0.93 0.68 LM-JMS 0.73 0.61 0.97 0.73 Stop-Words-Removal (SWR) BM25 0.73 0.62 0.97 0.74 VSM 0.69 0.57 0.86 0.70 LM-DS 0.69 0.58 0.93 0.68 LM-JMS 0.73 0.61 0.96 0.73 Lemmatization with SWR BM25 0.74 0.63 0.97 0.75 VSM 0.71 0.58 0.87 0.72 LM-DS 0.71 0.59 0.93 0.70 LM-JMS 0.75 0.63 0.97 0.75 Query-Expansion with SWR BM25 0.74 0.63 0.98 0.75 VSM 0.71 0.58 0.88 0.72 LM-DS 0.70 0.58 0.94 0.69 LM-JMS 0.75 0.63 0.98 0.75
Table 5. Results of retrieval models

5.3. Error Analysis for Each Query

In our study, LM-JMS (P@10) returned at least nine correct relevant documents for 30 out of 50 total queries. Our investigation of the remaining queries showed that irrelevant documents were retrieved due to the retrieval model and errors in NLP techniques for 11 and 9 queries, respectively. Keeping in view the proposed NLP tools pipeline for our experiments in indexing and retrieving, we carried out detailed error analysis for 20 queries. The query-by-query error analysis of queries can be downloaded from 171717

For a given query, the order of NLP tools in our IR system pipeline is as follows: 1) word tokenization, 2) stop words removal, 3) lemmatization, and 4) query expansion. Our query by query error analysis showed that overall retrieval performance dropped due to an error in various NLP tools. Our performance was decreased due to word tokenizer (2 queries), lemmatization (2 queries), and query expansion (5 queries). Below we discuss detailed error analysis for each NLP tool that resulted in a decrease of overall retrieval performance.

Fig 10 shows an example of errors introduced by different NLP techniques applied in the IR tools pipeline. We start with an example of errors based on word tokenizer. First of all, our query was tokenized, and five tokens were obtained. However, four tokens were returned by stop-word removal, lemmatization, and query expansion. Note that we used lookup based lemmatization and query expansion in such a way that our dictionary did not cover the root word and variants for query terms. In the sample shown, ‘Quaid Azam” is one term; however, our word tokenizer tokenized this query based on white space in two tokens, i.e., “Quaid” and “Azam”. For this query, a total of five relevant documents were retrieved underlined in the figure 10. Some relevant documents were not retrieved for this query because of two reasons: 1) keyword “Jinnah” (actual name) was present in the documents instead of “Quaid Azam”, and 2) there was no space present between the term “Quaid” and “Azam”.

Next, we discuss the impact of query expansion. For this query, a total of eight relevant documents were retrieved on the top ten ranks. Our detailed inspection of both irrelevant documents showed that keywords “jamia (institute)”, “jamiat (institutes)”, and “institutes” were used instead of “university”. Error in this query occurred mainly due to query expansion because we do not have these synonyms for the query expansion. Similarly, a total of six relevant documents and four irrelevant documents were retrieved on the top ten ranks in case of lemmatization. Here the root word “Science” provided by the lemmatizer was matched with the index term present in retrieved documents. Due to this reason, less relevant documents were pushed to higher ranks after lemmatization for this query.

Figure 10. Error Analysis of NLP Techniques

Other than NLP techniques, our retrieval models also triggered some errors. Fig 11 shows an example of error occurred due to the retrieval model. For the last query mostly those documents were retrieved in which the terms “qadeem (ancient)” or “dunia ke qadeem amarteen (the ancient buildings of the world)” occurred frequently.

Figure 11. Error Analysis of LM-JMS

In general, for all retrieval models, mostly those documents were retrieved that include original query words for the baseline case. However, for lemmatization and query expansion retrieval of top rank documents relied on root words and variants of query terms. In our case, most errors were observed for query expansion. We believe this performance can further be enhanced by adding synonyms in the dictionary of query expansion. By adding synonyms, chances of vocabulary mismatch will be decreased for efficient retrieval. Our evaluation results indicate that by applying lemmatization and query expansion, results can be improved for Urdu IR which is in accordance with  (balakrishnan2014stemming) where authors use language pre-processing techniques such as stemming and lemmatization at document level on the English IR to improve retrieval performance.

6. Conclusion

Standardized evaluation of an IR system cannot be achieved in the absence of a test collection. The availability of such resources in the Urdu language is very scarce. In this research work, we have developed a standard test collection (CURE) for the task of Urdu IR evaluation. We manually designed queries and top documents were retrieved using VSM and BM25 models from a collection of 0.5 million documents for each query. After removing duplicate documents, we have unique documents in our test collection. For each document, binary relevance judgment is provided. Besides, we have developed essential NLP resources such as stop-words removal, lemmatization, and query expansion that are essential for information retrieval.

To evaluate the performance of stop-words elimination, already developed list was used. For evaluation of lemmatization and query expansion, we manually developed resource from most frequent words of indexed documents. In lemmatization dictionary, we have root form of words and in query expansion inflectional forms of words. Three evaluation measures Precision, Recall, and MAP was used to evaluate the performance of baseline retrieval models in addition to the evaluation of language resources. Results show that the performance of all retrieval model was improved by using basic language processing techniques. This test collection can be used to evaluate the performance of different features of information retrieval systems and the application of different NLP techniques in IR. In the future, we plan to increase the number of documents and queries in our collection. Complete test collection contains a set of queries, document collection, relevance judgment, and list of stop words 181818