The escalation of globalization burgeons the great demand for Cross-Lingual Information Retrieval (CLIR), which has broad applications such as cross-border e-commerce, cross-lingual question answering, and so on eco; ruckle2019improved; xu2021artificial. Informally, given a query in one language, CLIR is a document retrieval task that aims to rank the candidate documents in another language according to the relevance between the search query and the documents.
Most existing solutions to tackle the CLIR task are built upon machine translation dwivedi2016survey systems (also known as MT systems). One technical route is to translate either the query or the document to the same language as the other side dic-trans1; dic-trans2; cor-trans1; doc-trans1. The other is to translate both the query and the document to the same intermediate language kishida2003two, e.g. English. After aligning the language of the query and documents, monolingual retrieval is performed to accomplish the task. Hence, the performance of the MT systems and the error accumulations may render them inefficient in CLIR.
Recent studies strive to model CLIR with deep neural networks that encode both query and document into a shared space rather than using MT systemszhang2019improving; share-repre; hui2018co; eco. Though these approaches achieve some remarkable successes, the intrinsic differences between different languages still exist due to the implicit alignment of these methods. Meanwhile, the query is not very long, leading the lack of information while matching with candidate documents.
To tackle these issues, we aim to find a “silver bullet” to simultaneously perform explicit alignment between queries and documents and broaden the information of queries. The multilingual knowledge graph (KG), e.g. Wikidata vrandevcic2014wikidata, is our answer. As a representative multilingual KG, Wikidata111https://www.wikidata.org/wiki/Wikidata:Main˙Page includes more than 94 million entities and 2 thousand kinds of relations, and most of the entities in Wikidata have multilingual aligned names and descriptions222 More than 260 languages are supported now.. With such an external source of knowledge, we can build an explicit bridge between the source language and target language on the premise of the given query information. For example, Figure 1 exhibits a query “新冠病毒” in Chinese (“COVID-19” in English) and candidate documents in English. Through the multilingual KG, we could link “新冠病毒” to its aligned entity in English, i.e. “COVID-19”, and then extend to some related neighbors, such as “Fever”, “SARS-CoV-2” and “Oxygen Therapy”. Both the aligned entity and the local neighborhood might contribute to extend the insufficient query and fill in the linguistic gap between the query and documents.
Along this line, we adopt the multilingual KG as an external source to facilitate CLIR and propose a HIerarchical Knowledge Enhancement (HIKE for short) mechanism to fully integrate the relevant knowledge. Indeed, queries are usually short but rich in entities. HIKE establishes a link between queries and multilingual KG through the entities mentioned in queries, and makes full use of the semantic information of entities and their neighborhood in KG with a hierarchical information fusion mechanism. Specifically, a knowledge-level fusion integrates the information in each individual language in the KG, and a language-level fusion combines the integrated information from different languages. The multilingual KG provides valuable information, which helps to reduce the disparity between different languages and is beneficial to the matching process over queries and documents.
To summarize, the contributions are as follows.
We adopt the external multilingual KG not only as an enhancement for sparse queries but also as an explicit bridge mitigating the gap between the query and the document in CLIR. To the best of our knowledge, this is the first work that utilizes multilingual KG for the neural CLIR task.
We propose HIKE that makes full use of the entities mentioned in queries as well as the local neighborhoods in the multilingual KG for improving the performance in CLIR. HIKE contains a hierarchical information fusion mechanism to resolve the sparsity in queries and perform easier matching over the query-document pairs.
Extensive experiments on a number of benchmark datasets in four languages (English, Spanish, French, Chinese) validate the effectiveness of HIKE against state-of-the-art baselines.
Current information retrieval models for cross-lingual tasks can be categorized into two groups: (i) translation-based approaches nie2010cross; zbib2019neural and (ii) semantic alignment approaches bai2010learning; sokolov2013boosting.
Early works mainly focus on translation-based models. One way is to translate queries to the target language of documents query-trans-1, or to translate the documents or corpus to the same language as queries doc-trans-1; doc-trans-2. The other is to translate both queries and documents to the same intermediate language, e.g. English kishida2003two. In both cases, they aim to simplify the process and use the monolingual information retrieval methods to solve the CLIR problem.
Recently, with the development of deep neural networks, semantic alignment approaches, which directly tackle the CLIR tasks without the translation process, have gained much attention. These methods align queries and documents into the same space with probabilistic or neural network methods and perform query-document matching in the aligned space. sokolov2013boosting
proposed a method about learning bilingual n-gram correspondences from relevance rankings.share-repre presented a simple yet effective method using shared representations across CLIR models trained in different language pairs. The release of BERT bert leads to breakthroughs in various NLP tasks jiang2020cross, including document ranking tasks. Thus Contextualized Embeddings for Document Ranking (CEDR) cedr is an effective method for using BERT to enhance the current prevalent neural ranking models, such as KNRM knrm, PACRR pacrr and DRMM drmm. clirmatrix utilized a multilingual version of BERT (a.k.a multilingual BERT or mBERT) to conduct the CLIR task. These BERT-based neural ranking models achieve the state-of-the-art results compared with other models.
Besides, due to the fast-growing scale of KGs such as Wikidata vrandevcic2014wikidata
and DBpediaauer2007dbpedia, some researches focus on using high-quality KGs as extra knowledge to perform the information retrieval task. word-entity presented a word-entity duet framework for utilizing KGs in ad-hoc retrieval. Entity-Duet Neural Ranking Model (EDRM) entityduet, which introduces KGs to neural search systems, represents queries and documents by their word and entity annotations. Despite the popularity of KG for information retrieval, the works on the topic of KG for CLIR are rarely found. zhang2016xknowsearch introduced KG to CLIR systems using the standard similarity measures for document ranking. However, this work does not use neural network models. To the best of our knowledge, our work is the first work that incorporates multilingual KG information for the neural CLIR task.
In this section, we illustrate the overall framework of our HIKE model, including the model architecture and the detailed description of model components.
CLIR is a retrieval task in which search queries and candidate documents are written in different languages. Since search queries are usually short but rich in entities, HIKE establishes a connection between CLIR and the multilingual KG via the entities mentioned in queries, and leverage the KG information through these entities and their local neighborhood in KG. Specifically, for each entity, we obtain the following information from the multilingual KG: (i) entity label333In some large-scale KGs like Wikidata vrandevcic2014wikidata and DBpedia auer2007dbpedia, the name of an entity is denoted as its label., (ii) entity description, (iii) labels of neighboring entities, and (iv) descriptions of neighboring entities. It is worth noting that all the information in the KG is multilingual, and the information in different languages is aligned. We leverage the above information to facilitate the CLIR task. Given a query and a document . We present an entity and the -th neighboring entity , where is the entity set in KG. Both the entity and neighboring entities have two information for incorporating: labels and descriptions. Furthermore, for a specific bilingual information retrieval task, the label and description of can be described as and , respectively. The label and the description of can be descried as and , where indicates the source language or target language. All these information, including , , , , and , is composed of a sequence of tokens.
HIKE incorporates the multilingual semantic information of the entities and their local neighborhoods from KG into the current CLIR model. The overall architecture of HIKE is shown in Figure 2. HIKE consists of three modules: an encoder module, a hierarchical information fusion module and a query-document matching module. Specifically, in the encoder module, HIKE utilizes multilingual BERT to embed the queries, documents, and semantic information from KG into low-dimensional vectors. Thus the encoder outputs the embeddings to the hierarchical information fusion module, and the latter combines the information from KG into queries and expedites the matching with documents. Particularly, the knowledge-level (first-level) fusion integrates the information in KG, using the multi-head attention mechanismattention
. We use two individual knowledge-level fusion modules to extract features from source and target languages. And then, the language-level (second-level) fusion integrates two representations of an entity in source and target languages through a multi-layer perceptron. After the hierarchical information fusion mechanisms, we utilize a matching model to get the relevance score of the query-document pair. The higher the score, the more relevant the query and the document are.
The encoder aims to embed the tokens from queries, documents, entities and neighboring entities. It consists of two parts: Query and Document Duet Encoder (QD-Duet-Encoder) and Knowledge Encoder (K-Encoder). QD-Duet-Encoder embeds a query-document pair to a -dimensional vector. And K-Encoder transforms the label and description of an entity into another -dimension vector.
QD-Duet-Encoder concatenates the tokens from queries and documents into one sequence, using [CLS] and [SEP] as meta-tokens. [CLS] is a special symbol added in front of every input example, and [SEP] is a special separator token bert. And then the encoder sums the token embedding, segment embedding, positional embedding for each token to get the input embedding and computes the output embedding that represents the semantic and matching information of a query-document pair. Embedding query and document together can make the ranking model benefit from deep semantic information from BERT in addition to individual contextualized token matching cedr. For a given query and document , we have an output from QD-Duet-Encoder as shown in Equation (1). is the [CLS] embedding of the output.
where is a multilingual BERT model444We used BERT-base, multilingual cased. and means concatenating two sequences of tokens to one sequence.
K-Encoder aims to embed the knowledge information from entities or neighboring entities in two languages to a feature vector. Inspired by the advantages of embedding the query and document together, we use [CLS] and [SEP] to concatenate the label and the description of an entity to obtain the embedding. Suppose there are neighboring entities, we denote the set of neighboring entity labels as and the descriptions as . All these entities are fed into K-Encoder to compute a feature embedding of the entity as
where . is also a multilingual BERT. denotes that the parameter is for source and target languages, respectively. We sort the neighboring entities in descending order according to their relevance to the central entity and select top neighboring entities to obtain , where is a hyper-parameter. Specifically, we first run the popular KG embedding model TransE TransE
to get the embeddings of entities, and then calculate the cosine similarity between two entities as the relevance score.and are the [CLS] embedding of the entity and the -th neighboring entity, respectively. The set of feature vectors of neighboring entities is . , and will be treated as the inputs of the fusion module in the next subsection.
Hierarchical Information Fusion
In this section, we detail the hierarchical information fusion module, which is a two-level fusion mechanism, comprising knowledge-level fusion and language-level fusion.
Knowledge-Level Fusion contains two modules: a multi-head self-attention mechanism and an information aggregator. With the help of both two modules, our model can learn a wealth of similar semantic information among the entity, neighboring entities and query-doc pair. In the self-attention mechanism, , and are gathered together and fed into the attention module to calculate the attention values. The input matrix is denoted as:
where is an operation that stacks row vectors into a matrix.
contains the embeddings from query, document, entity and the local neighborhood of the entity. To encapsulate more valuable information, we utilize the multi-head attention mechanism attention to learn better latent semantic information. The self-attention module takes three inputs (the query, the key, and the value), which are denoted as , , ( is the embedding size) respectively. To be specific, we only discuss the
-th head of the multi-head attention mechanism. First, the self-attention model uses each embedding into get the query , key and value
through a linear transformation layer. Then the model goes on using each embedding in the query to attend each embedding in the key through the scaled dot-product attention mechanismattention, and gets the attention score. Finally, the obtained attention score is applied upon the value to calculate a new representation of , which is formulated as:
Therefore, each row of is capable of incorporating the semantic information from the rows in . Furthermore, a layer normalization operation ba2016layer is applied to the output of attention model to obtain the representation of the -th head . Next, we pack the multi-head information using the following operation:
where is a parameter matrix and is the number of heads.
Accordingly, we obtain the representation after the multi-head attention , where denotes that the parameter is for source and target languages respectively. , and represent the output vectors of multi-head self attention. Finally, we use an information aggregator which consists of a linear transformation layer as Equation (6) to compute the final representation of the knowledge-level features.
where is a vectorization function that concatenates each row of a matrix as a long vector. is a parameter matrix and is a -dimension vector. incorporates the deep semantic information from the KG.
Language-Level Fusion combines the query-document pair information with and , which are obtained from the knowledge-level fusion. We use the as guidance in the fusion processing, which is donated in blue arrow in Figure 2. And then, these embeddings are combined by a linear transformation layer which uses
as the activation function to generate a unified representation as:
where and represent the source and target languages. and are parameters. is the unified embedding that incorporates the information from queries, documents, and the multilingual KG.
Finally, HIKE uses the matching function to obtain the score of a query-document pair. Particularly, and will be concatenated and fed into another linear layer to obtain the relevant ranking score of the query-document pair:
where is the ranking score between the query and document. and are parameters. And
is an activate function to convert the results into the probability over different classes.
In the training stage, we use standard pairwise hinge loss to train the model as shown in Equation (9).
and are the set of relevant documents and irrelevant documents of the query , and .
In this section, we describe the details of our experiments, including the dataset, the multilingual KG, baselines, evaluation metrics and implementation details.
We evaluate the HIKE model in a public CLIR dataset CLIRMatrix clirmatrix. Specifically, we use the MULTI-8 set in CLIRMatrix, in which queries and documents are jointly aligned in 8 different languages. The dataset is mined from 49 million unique queries and 34 billion (query, document, relevance label) triplets. The relevance label indicates the relevance of the query-document pair. The higher the value, the more relevant the query-document pair is. In MULTI-8, queries remain the same no matter what the language of documents is. For instance, three language pairs English-Spanish, English-French and English-Chinese in MULTI-8 share the same queries. Furthermore, we choose four widely used languages in the world to conduct the bilingual information retrieval tasks, including English (EN), French (FR), Spanish (ES) and Chinese (ZH). Thus there are 12 language pairs in the dataset for training, validation and testing. The training sets of every language pair contain 10,000 queries, while the validation and the test sets contain 1,000 queries. Meanwhile, the number of candidate documents for each query is 100. We use the test1 set in MULTI-8 as our test set to verify the model performance. The statistics of the datasets are summarized in Table 1.
NDCG values of baselines and our model. Numbers in the table are in percentages. * marks statistically significant improvements (t-test with p-value0.05) compared with the best baseline.
We use Wikidata vrandevcic2014wikidata, a multilingual KG with entities and relations in a multitude of languages. Up until now, Wikidata contains more than 94 million entities and more than 2000 kinds of relations. And the related entities of queries are annotated by mGENRE decao2020multilingual, a multilingual entity linking model which has a high accuracy of entity linking on 105 languages. Table 2 shows the average number of neighboring entities in each dataset.
To demonstrate the effectiveness of our model, we compare the performance with the following baselines.
Vanilla BERT cedr; clirmatrix: a fine-tuned multilingual BERT model for CLIR.
CEDR cedr: the contextualized embeddings for document ranking (CEDR) model. This model can be applied to various popular neural ranking models, including KNRM knrm, DRMM drmm and PACRR pacrr, to form CEDR-KNRM/DRMM/PACRR.
HIKE: A variant of HIKE, which concatenates the KG information with the query directly. The difference between HIKE and HIKE is that HIKE does not use the hierarchical information fusion mechanism.
|HIKE w/o descriptions||85.39||84.29||84.05||77.27||79.09||76.35||77.79||80.95||76.41||68.69||70.31||69.51|
|HIKE w/o labels||85.47||84.86||84.81||78.34||79.57||76.38||78.58||81.36||76.71||69.29||70.59||70.34|
|HIKE w/o neighboring entities||85.33||84.47||84.58||78.03||78.17||76.65||78.15||80.90||76.55||68.65||70.23||69.09|
|HIKE w/o target language information||84.68||83.98||83.84||77.70||78.39||76.22||77.79||81.18||76.25||68.59||69.94||69.09|
Normalized Discounted Cumulative Gain (NDCG) is adopted for evaluation. And we choose NDCG@1, NDCG@5 and NDCG@10 (only evaluate the top 1, 5 and 10 returned documents) as the metrics in all language pairs.
In the training stage, the number of heads for the multi-head attention mechanism in knowledge-level fusion is set to 6. In order to reduce the GPU memory and training time, we save the embeddings of entity information before training. The number of all entities we extracted from KG is 376,785. And we only fine-tune the BERT model to obtain textual representations. The learning rates are divided into two parts: the BERT and the other modules . And we set to 1e-5 and
to 1e-3. We set the number of neighboring entities in KG as 3. For those entities without enough neighboring entities, we copy the existing neighboring entities instead. We randomly sample 1600 query-document pairs as our training data per epoch. The maximum training epochs are set to 15.
We conduct three experiments to demonstrate the effectiveness of the HIKE model.
Table 3 summarizes the evaluation results of different cross-lingual retrieval models. From Table 3, we have the following findings. (i) The results indicate that HIKE significantly and consistently outperforms all the baseline models on 12 language pairs w.r.t all metrics, which demonstrates the effectiveness of the proposed model HIKE. (ii) Comparing with Vanilla BERT, the improvement of HIKE embodies the usefulness and importance of the KG. The external KG makes up for the deficiency of queries and provides accurate information while ranking the documents. Moreover, the results of HIKE perform better than HIKE, which shows the advantages of our hierarchical fusion mechanism. (iii) Specifically, HIKE achieves substantial improvements of both NDCG@1 and NDCG@5 on most datasets comparing with other models, which indicates the knowledge information learned from the entities and neighboring entities is highly related to the task. This result shows that HIKE is capable of ranking the most relevant documents to the top.
All these findings prove that KG information and the hierarchical information fusion can facilitate the CLIR task, and narrow the gap between different languages.
In this section, we conduct the ablation study to testify the effectiveness of different information used in HIKE. In addition, we do the experiments as:
Remove the labels or descriptions of entities and neighboring entities to verify the effects of them.
Remove the information of neighboring entities to study the influence of neighboring entities.
Remove the information of target language to learn the importance of them in document ranking.
The results are shown in Table 4. From the results, we observe that (i) HIKE obtains the best ranking performance than other incomplete models, indicating that every part of our model makes contributions to the ranking performance. (ii) The model without entity labels outperforms the one without entity description. We conjecture the reason lies in that the information from entity descriptions is more abundant than that from the labels, which is able to provide more beneficial information for the CLIR task. (iii) The model without target language information performs worst in our ablation test. It demonstrates that target language information plays a significant role in the CLIR task, which establishes an explicit connection between the query in the source language and the documents in the target language.
The Effect of Neighboring Entity Number
In this subsection, we explore the influence of the number of neighboring entities. We set the number of neighboring entities from 1 to 7 (step-size is 2) and conduct the experiments over all datasets. Figure 3 demonstrates the results, which are divided into four groups according to the different source languages. Each group contains three different target languages. From the figure, there exists an optimal number of neighbors for each language pair. The model performance first goes up as the number of neighboring entities increases. After the optimal value, the performance falls down. We conjecture the reason lies in that models with small numbers of neighbors cannot take full advantage of the local neighborhood information in KG, resulting in weak NDCG@10 values. While large numbers of neighboring entities may bring in some unrelated information, leading to unsatisfactory results as well.
In this paper, we presented HIKE, a hierarchical knowledge-enhanced model for the CLIR task. HIKE introduces external multilingual KG into the CLIR task and is equipped with a hierarchical information fusion mechanism to take full advantage of the KG information. Specifically, the knowledge-level fusion integrates the KG information in each language. And the language-level fusion combines the information from both source and target languages. The multilingual KG is capable of providing valuable information for the CLIR task, which is beneficial to bridge the gap between queries and documents in different languages. Finally, extensive experiments on benchmark datasets clearly validated the superiority of HIKE against various state-of-the-art baselines.
This work is supported by Alibaba Group through Alibaba Innovative Research Program. The research work is supported by the National Key Research and Development Program of China under Grant No. 2017YFB1002104, the National Natural Science Foundation of China under Grant No. U1836206, 62176014, U1811461, and the China Postdoctoral Science Foundation under Grant No. 2021M703273. Xiang Ao is also supported by the Project of Youth Innovation Promotion Association CAS and Beijing Nova Program Z201100006820062.