A Study of Neural Matching Models for Cross-lingual IR

by   Puxuan Yu, et al.
University of Massachusetts Amherst

In this study, we investigate interaction-based neural matching models for ad-hoc cross-lingual information retrieval (CLIR) using cross-lingual word embeddings (CLWEs). With experiments conducted on the CLEF collection over four language pairs, we evaluate and provide insight into different neural model architectures, different ways to represent query-document interactions and word-pair similarity distributions in CLIR. This study paves the way for learning an end-to-end CLIR system using CLWEs.



page 1

page 2

page 3

page 4


Unsupervised Cross-Lingual Information Retrieval using Monolingual Data Only

We propose a fully unsupervised framework for ad-hoc cross-lingual infor...

Embedding Meta-Textual Information for Improved Learning to Rank

Neural approaches to learning term embeddings have led to improved compu...

Cross-lingual Short-text Matching with Deep Learning

The problem of short text matching is formulated as follows: given a pai...

Evaluating Multilingual Text Encoders for Unsupervised Cross-Lingual Retrieval

Pretrained multilingual text encoders based on neural Transformer archit...

The Cross-Lingual Arabic Information REtrieval (CLAIRE) System

Despite advances in neural machine translation, cross-lingual retrieval ...

A Feature Analysis for Multimodal News Retrieval

Content-based information retrieval is based on the information containe...

Cross-lingual Document Retrieval using Regularized Wasserstein Distance

Many information retrieval algorithms rely on the notion of a good dista...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

CLIR is the task of retrieving documents in target language with queries written in source language . The increasing popularity of projection-based weakly-supervised (xing2015normalized; glavavs2019properly; joulin2018loss) and unsupervised (conneau2017word; artetxe2018unsupervised) cross-lingual word embeddings has spurred unsupervised frameworks (litschko2018unsupervised) for CLIR, while in the realm of mono-lingual IR, interaction-based neural matching models (xiong2017end; guo2016deep; pang2016text) that utilize semantics contained in word embeddings have been the dominant force. This study fills the gap of utilizing CLWEs in neural IR models for CLIR.

Traditional CLIR approaches translate either document or query using off-the-shelf SMT system such that query and document are in the same language. A number of researchers (ture2013flat; ture2014exploiting) later investigated utilizing translation table to build a probabilistic structured query (darwish2003probabilistic) in the target language. Recently, Litschko et al. showed that CLWEs are good translation resources by experimenting with a CLIR method (dubbed TbT-QT) that translates each query term in the source language to the nearest target language term in the CLWE space (litschko2018unsupervised). CLWEs are obtained by aligning two separately trained embeddings for two languages in the same latent space, where a term in is proximate to its synonyms in and its translations in , and vice versa. TbT-QT takes only the top-1 translation of a query term and uses the query likelihood model (ponte1998language) for retrieval. The overall retrieval performance can be damaged by vocabulary mismatch magnified with translation error. Using closeness measurement between query and document terms in the shared CLWE space as matching signal for relevance can alleviate the problem, but this area has not been extensively studied.

The reasons for the success of neural IR models for mono-lingual retrieval can be grouped into two categories:

Pattern learning

: the construction of word-level query-document interactions enables learning of various matching patterns (e.g., proximity, paragraph match, exact match) via different neural network architectures.

Representation learning: models in which interaction features are built with differentiable operations (e.g., kernel pooling (xiong2017end)) allow customizing word embeddings via end-to-end learning from large-scale training data.

Although representation learning is capable of further improving overall retrieval performance (xiong2017end), it was shown in the same study that updating word embeddings requires large-scale training data to work well (more than 100k search sessions in their case). In CLIR, however, datasets usually have fewer than 200 queries per available language pair and can only support training neural models with smaller capacity. Therefore, we focus on the pattern learning aspect of neural models.

In this study, we formulate the following research questions:

  • RQ1: how should a neural model for mono-lingual retrieval be adapted for CLIR?

  • RQ2: how do neural models compare with each other and with unsupervised models for CLIR?

We answer these two main research questions with analysis (§ 2), experiments (§ 3) and discussions (§ 4) in the rest of the paper.

2. Analysis

2.1. Unsupervised CLIR Methods with CLWEs

Two unsupervised CLIR approaches using CLWEs are proposed by Litschko et al. (litschko2018unsupervised). BWE-Agg

ranks documents with respect to a query using the cosine similarity of query and document embeddings, obtained by aggregating the CLWEs of their constituent terms. The simpler version, namely

BWE-Agg-Add, takes the average embeddings of all terms for queries and documents, while the more advanced version BWE-Agg-IDF builds document embeddings by weighting terms with their inverse document frequencies. TbT-QT, as described in § 1, first translates each query term to its nearest cross-lingual neighbor term and then adopts query-likelihood in mono-lingual setting. These two approaches represent different perspectives towards CLIR using CLWEs. BWE-Agg builds query and document representations out of CLWEs but completely neglects exact matching signals, which play important roles in IR. Also, although query and document terms are weighted based on IDF, using only one representation for a long document can fail to emphasize the section of a document that is truly relevant to the query. TbT-QT only uses CLWEs as query translation resources and adopts exact matching in a mono-lingual setting, so its performance is heavily dependent on the translation accuracy (precision@1) of CLWEs. Analytically, an interaction-based neural matching model that starts with word level query-document interactions and considers both exact and similar matching can make up for the shortcomings of the above two methods.

2.2. Neural IR Models

2.2.1. Background

For interaction-based matching models, we select three representative models (MatchPyramid (pang2016text; pang2016study), DRMM (guo2016deep) and KNRM (xiong2017end)) from the literature for analysis and experiments.

MatchPyramid: The MatchPyramid (pang2016text; pang2016study) (MP for short) is one of the earliest models that starts with capturing word-level matching patterns for retrieval. It casts the ad-hoc retrieval task as a series of image recognition problems, where the “image” is the matching matrix of a query-document pair , and each “pixel” is the interaction value of a query term and a document term . Typical interaction functions are cosine similarity, dot product, Gaussian kernel, and indicator function (for exact match). The intuition behind hierarchical convolutions and pooling is to model phrase, sentence and even paragraph level matching patterns.

DRMM: The DRMM (guo2016deep) model uses a matching histogram to capture the interactions of a query term with the whole document. The valid interval of cosine similarity (i.e.,

) is discretized into a fixed number of bins such that a matching histogram is essentially a fixed-length integer vector. Features from different histograms are weighted based on attention calculated on query terms.

DRMM is not position-preserving, as the authors claim that relevance matching is not related to term order.

K-NRM: The KNRM (xiong2017end) model takes matrix representation for query-document interaction (similar to MP), but “categorizes” interactions into different levels of cosine similarities (similar to DRMM), using Gaussian kernels with different mean value . The distinct advantage of KNRM over DRMM is that the former allows gradient to pass through Gaussian kernels, and therefore supports end-to-end learning of embeddings.

2.2.2. Mono-lingual to Cross-lingual

According to results reported in respective studies (pang2016text; guo2016deep; xiong2017end), the relative performance of three models for mono-lingual IR should be KNRM DRMM MP, even when embedding learning is turned off with KNRM. Tweaking a neural model for support of CLIR is trivial: instead of considering interaction value as two terms’ similarity in a mono-lingual embedded space, we consider the proximity of their representations in the shared cross-lingual embedded space. However, there are several matters to consider while making the transition:

Exact matching signals: The significant difference between cross-lingual and mono-lingual IR is that the former (almost) never encounters exact match of terms in different languages. However, neglecting such factors can be costly for models like MP, the disadvantage of which when compared to the other two models is the inability to capture exact and similarity matching signals at the same time. To this end, we first define in CLIR the exact matching of two terms (in different languages) as their cosine similarity in the CLWE space exceeding a certain threshold value . We then implement a hybrid version, namely MP-Hybrid, that joins exact and soft matching signals extracted from interaction matrices built with indicator function and cosine similarity function, such that ranking features from dual channels are concatenated for an MLP to predict a ranking score.

Word-pair similarity distribution

: The cosine similarities of two terms with close meanings but in different languages are distributed differently than those in the same language. Specifically, the top word-pair similarity distributions of CLWEs tend to have smaller mean and variance. In an example shown in Table 

1, the cosine similarity of the five closest words to “telephone” in English embedded space111https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip ranges from 0.818 to 0.669, while in aligned English-Spanish embedded space222https://dl.fbaipublicfiles.com/fasttext/vectors-aligned/wiki.es.align.vec, it ranges from 0.535 to 0.520. The similarity distribution affects histogram construction of DRMM and similarly for the kernel pooling of KNRM. The distribution also affects the exact matching threshold value for related variants of MP. Since the cosine similarity of a query term and its perfectly correct translation can be less than 0.6, setting too high can lead to failure of capturing positive matching signals.

EN phone telephones Telephone landline rotary-dial
0.818 0.761 0.720 0.694 0.669
ES telefónicos teléfono telefónica telefónia telefóno
0.535 0.522 0.522 0.520 0.520
Table 1. Cosine similarities of the top-5 closest words to “telephone” in an English embedding space (EN) and in an aligned English-Spanish embedding space (ES).

Vocabulary mismatch and translation error: Query translation based CLIR methods (e.g., TbT-QT (litschko2018unsupervised)) first translate queries from to , then use mono-lingual retrieval in . Apart from the inherent vocabulary mismatch problem within , the translation error from to has to be also counted. Looking at the example in Table 1, TbT-QT would look for occurrence of “telefónicos” in the collection, and documents containing only the correct translation (“teléfono”) would be overlooked. Interaction-based neural matching models alleviate this issue by giving partial credit to sub-optimal nearest neighbors, which in many cases are the correct translations. To demonstrate the necessity of directly using cross-lingual word embedding similarity as interaction for neural models, we conduct comparative experiments where queries are first translated term-by-term like TbT-QT using CLWEs, then used for retrieval in mono-lingual setting. Such models are referred to as {MP,DRMM,K-NRM}-TbT-QT, respectively.

3. Experiments

Datasets: We evaluate the models on the CLEF test suite for the CLEF 2000-2003 campaigns. We select four language pairs: English (EN) queries to {Dutch (NL), Italian (IT), Finnish (FI), Spanish (ES)} documents. All documents for the four languages are used for evaluation, and are truncated to preserve the first 500 tokens for computational efficiency (pang2016study). The statistics of the evaluation datasets are shown in Table 2. The titles of CLEF topics are used as English queries. All queries and documents are lower-cased, with stopwords, punctuation marks and one-character tokens removed.

Cross-lingual word embeddings: We adopt the pre-aligned fastText CLWEs333https://fasttext.cc/docs/en/aligned-vectors.html

. Mono-lingual fastText embeddings are trained on Wikipedia corpus in respective languages, and aligned using weak supervision from a small bilingual lexicon with the RCSLS loss as the optimization objective 


#queries 160 160 90 160
#docs 42,734 40,320 16,351 46,540
#rel 29.1 19.5 10.9 49.5
#label 375.4 338.3 282.6 372.7
Table 2. Basic statistics of CLEF data for evaluation: number of queries (#queries), number of documents (#docs), average number of relevant documents per query (#rel), and average number of labeled documents per query (#label).

Model specifications: We implemented two CLWEs based unsupervised CLIR algorithms BWE-Agg and TbT-QT as baselines (litschko2018unsupervised). In addition to the query likelihood model in the original study, we pair TbT-QT with BM25 to investigate the influence of retrieval models to queries translated using CLWEs.

We experiment with five variants of the MP model, two for the DRMM model and two for the KNRM model. As the interaction value of query term and document term , {MP,DRMM,KNRM}-Cosine uses the cosine similarity , MP-Gaussian uses , and MP-Exact takes , where is a pre-defined threshold value (set to 0.3 for Table 3). MP-Hybrid concatenates the flattened features after dynamic pooling layer from MP-Cosine and MP-Exact into one vector, and uses an MLP to predict a final score. {MP,DRMM,KNRM}-TbT-QT is equal to first translating query to target language query , and running with {MP,DRMM,KNRM}-Cosine model.

For the MP model, we adopt one layer convolution with kernel size set to , dynamic pooling size set to , and kernel count set to 64. For the DRMM model, we adopt the log-count-based histogram with bin size set to 30. For the KNRM

model, kernel count is set to 20 and standard deviation of each Gaussian kernel is set to 0.1. All decisions made above are based on extensive hyper-parameter tuning that first prioritizes generalizable retrieval performance then computational efficiency and model simplicity.

Model training: All neural models in the experiments are trained with the pairwise hinge loss. Given a triple , where document is relevant and document is non-relevant with respect to query

, the loss function is defined as:

where denotes the predicted matching score for , and represents the learnable parameters in the neural network. Note that we randomly select documents that are explicitly labeled non-relevant (-1) as negative samples for training. Five negative

pair are sampled for each positive pair. We apply stochastic gradient descent method Adam 


(learning rate=1e-3) in mini-batches (64 in size) for optimization. The maximum number of training epochs allowed is 20.


: As the CLEF dataset uses binary relevance judgement, we adopt MAP as the evaluation metric. In order to conduct evaluation on enough queries that conclusions can possibly be statistically significant, we adopt 5-fold cross-validation with validation and test sets. Statistical significant tests are performed using the two-tailed paired t-test at the


BWE-Agg-Add .237 .173 .170 .297
BWE-Agg-IDF .246 .178 .180 .298
TbT-QT-BM25 .240 .231 .122 .341
TbT-QT-QL .297 .268 .126 .387
MP-Cosine .348 .331 .254 .423
MP-Gaussian .322 .319 .203 .405
MP-Exact .327 .295 .202 .415
MP-Hybrid .343 .326 .243 .427
MP-TbT-QT .327 .300 .195 .409
DRMM-Cosine .374 .352 .304 .462
DRMM-TbT-QT .345 .324 .193 .450
KNRM-Cosine .368 .313 .286 .423
KNRM-TbT-QT .329 .288 .200 .405
Table 3. MAP performance of all CLIR methods. Boldfaced is the best performer in each language pair. Underlined is the best MP variant.

4. Discussion and Conclusion

4.1. Parsing Results

The experimental results of CLIR on four language pairs are reported in Table 3. TbT-QT generally works better than BWE-Agg except for ENFI. This might indicate that the English-Finnish CLWEs are not aligned well to provide quality top-1 query term translation. The larger gaps between {MP,DRMM,K-NRM}-Cosine and {MP,DRMM,K-NRM}-TbT-QT for EN-FI than the other three language pairs reinforce this argument. All neural models achieve statistically significant

improvement over heuristic baselines.

DRMM-Cosine consistently achieves the best performance for all language pairs. Although DRMM and KNRM are conceptually similar, the former performs significantly better, with KNRM’s embedding layer kept frozen. The attention mechanism applied to query terms for DRMM can be a factor. On EN{IT,ES}, the MP model performs on par with or better than KNRM. This finding indicates that the convolution plus dynamic pooling architecture can also be an option for learning an end-to-end CLIR model. Comparing different approaches to build query-document interaction matrices for MP, it is clear that cosine similarity of source language query term and target language document term in the CLWE space is the best choice, which contradicts the conclusions in the study of mono-lingual IR (pang2016study) where Gaussian kernel and indicator function are found to work better. The exact matching variant MP-Exact we proposed works reasonably well, indicating that most decisions of relevance are influenced by top similarity matching signals. The hybrid variant MP-Hybrid we propose improves upon MP-Exact but does not outperform MP-Cosine except for ENES. This is expected because matching signals from MP-Exact are not from truly exact matches of terms, but are derived from cosine similarity matrices as in MP-Cosine. The combination of two models results in redundant information. The fact that {MP,DRMM,K-NRM}-TbT-QT outperform baseline approaches but are not as good as respective cosine variants demonstrates (1) the effectiveness of pattern learning of neural models; and (2) the necessity to directly build cross-lingual interactions of query and document in two languages, rather than building interactions after translation.

4.2. Word-pair Similarity Distribution

(a) EN{NL,ES,IT}
(b) ENFI
(c) ENNL
Figure 1. (a,b) – Red: percentage of cross-lingual word pair with similarity ; Blue: MP-Exact retrieval performance with different similarity threshold value . (c): Similarity distribution of word-pairs in the ENNL collection.

The distribution of word pair similarities influences the exact matching threshold in MP-Exact, the query translation strategy in TbT-QT, and the embedding fine-tuning for an end-to-end model. We take source language terms in the queries and target language terms in the documents, calculate their pairwise cosine similarities in the aligned CLWE space, and plot the similarity distributions. In Figure 0(a) and 0(b), we show in red the percentage of cross-lingual word-pairs with similarity above . The three distributions in Figure 0(a) are very similar at tail (), therefore the corresponding MP-Exact’s performance peaks at the same . ENFI is distributed differently but the pattern shown is similar (Figure 0(b)). The shapes of cross-lingual similarity distribution for all four language pairs are very similar, therefore we only plot ENNL in Figure 0(c) for demonstration. Mono-lingual similarity distribution in Xiong et al.’s study (xiong2017end)

has large variance, positive mean, strong positive skewness and high density at large

. In comparison, the cross-lingual similarity distribution (Figure 0(c)) has small variance, negative mean, no obvious skewness to the left or right, and the density drops low and flat after , where word-pairs are considered highly similar (i.e., quality translations). This provides insights into why top-1 translation with CLWEs is not necessarily significantly better than translations ranked at slightly lower positions.

4.3. Conclusions

Answer to RQ1: To adapt a neural model for CLIR, exact matching representations, cross-lingual word-pair similarity distribution, and translation error using CLWEs have to be considered. In specific model settings, choices of interaction representations and hyper-parameters (e.g., dynamic pooling size at document side for MP) are found to be different from mono-lingual IR.

Answer to RQ2: Neural matching models experimented in this study all outperform baselines using CLWEs. The DRMM achieves the best results across the board, while MP and KNRM perform inconsistently on different language pairs.

Moving forward, a worthwhile endeavor will be to investigate an end-to-end neural model that learns from large-scale CLIR data.


This work was supported in part by the Center for Intelligent Information Retrieval and in part by the Air Force Research Laboratory (AFRL) and IARPA under contract #FA8650-17-C-9118 under subcontract #14775 from Raytheon BBN Technologies Corporation. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect those of the sponsor.