1. Introduction
CLIR is the task of retrieving documents in target language with queries written in source language . The increasing popularity of projectionbased weaklysupervised (xing2015normalized; glavavs2019properly; joulin2018loss) and unsupervised (conneau2017word; artetxe2018unsupervised) crosslingual word embeddings has spurred unsupervised frameworks (litschko2018unsupervised) for CLIR, while in the realm of monolingual IR, interactionbased neural matching models (xiong2017end; guo2016deep; pang2016text) that utilize semantics contained in word embeddings have been the dominant force. This study fills the gap of utilizing CLWEs in neural IR models for CLIR.
Traditional CLIR approaches translate either document or query using offtheshelf SMT system such that query and document are in the same language. A number of researchers (ture2013flat; ture2014exploiting) later investigated utilizing translation table to build a probabilistic structured query (darwish2003probabilistic) in the target language. Recently, Litschko et al. showed that CLWEs are good translation resources by experimenting with a CLIR method (dubbed TbTQT) that translates each query term in the source language to the nearest target language term in the CLWE space (litschko2018unsupervised). CLWEs are obtained by aligning two separately trained embeddings for two languages in the same latent space, where a term in is proximate to its synonyms in and its translations in , and vice versa. TbTQT takes only the top1 translation of a query term and uses the query likelihood model (ponte1998language) for retrieval. The overall retrieval performance can be damaged by vocabulary mismatch magnified with translation error. Using closeness measurement between query and document terms in the shared CLWE space as matching signal for relevance can alleviate the problem, but this area has not been extensively studied.
The reasons for the success of neural IR models for monolingual retrieval can be grouped into two categories:
Pattern learning
: the construction of wordlevel querydocument interactions enables learning of various matching patterns (e.g., proximity, paragraph match, exact match) via different neural network architectures.
Representation learning: models in which interaction features are built with differentiable operations (e.g., kernel pooling (xiong2017end)) allow customizing word embeddings via endtoend learning from largescale training data.
Although representation learning is capable of further improving overall retrieval performance (xiong2017end), it was shown in the same study that updating word embeddings requires largescale training data to work well (more than 100k search sessions in their case). In CLIR, however, datasets usually have fewer than 200 queries per available language pair and can only support training neural models with smaller capacity. Therefore, we focus on the pattern learning aspect of neural models.
In this study, we formulate the following research questions:

RQ1: how should a neural model for monolingual retrieval be adapted for CLIR?

RQ2: how do neural models compare with each other and with unsupervised models for CLIR?
2. Analysis
2.1. Unsupervised CLIR Methods with CLWEs
Two unsupervised CLIR approaches using CLWEs are proposed by Litschko et al. (litschko2018unsupervised). BWEAgg
ranks documents with respect to a query using the cosine similarity of query and document embeddings, obtained by aggregating the CLWEs of their constituent terms. The simpler version, namely
BWEAggAdd, takes the average embeddings of all terms for queries and documents, while the more advanced version BWEAggIDF builds document embeddings by weighting terms with their inverse document frequencies. TbTQT, as described in § 1, first translates each query term to its nearest crosslingual neighbor term and then adopts querylikelihood in monolingual setting. These two approaches represent different perspectives towards CLIR using CLWEs. BWEAgg builds query and document representations out of CLWEs but completely neglects exact matching signals, which play important roles in IR. Also, although query and document terms are weighted based on IDF, using only one representation for a long document can fail to emphasize the section of a document that is truly relevant to the query. TbTQT only uses CLWEs as query translation resources and adopts exact matching in a monolingual setting, so its performance is heavily dependent on the translation accuracy (precision@1) of CLWEs. Analytically, an interactionbased neural matching model that starts with word level querydocument interactions and considers both exact and similar matching can make up for the shortcomings of the above two methods.2.2. Neural IR Models
2.2.1. Background
For interactionbased matching models, we select three representative models (MatchPyramid (pang2016text; pang2016study), DRMM (guo2016deep) and KNRM (xiong2017end)) from the literature for analysis and experiments.
MatchPyramid: The MatchPyramid (pang2016text; pang2016study) (MP for short) is one of the earliest models that starts with capturing wordlevel matching patterns for retrieval. It casts the adhoc retrieval task as a series of image recognition problems, where the “image” is the matching matrix of a querydocument pair , and each “pixel” is the interaction value of a query term and a document term . Typical interaction functions are cosine similarity, dot product, Gaussian kernel, and indicator function (for exact match). The intuition behind hierarchical convolutions and pooling is to model phrase, sentence and even paragraph level matching patterns.
DRMM: The DRMM (guo2016deep) model uses a matching histogram to capture the interactions of a query term with the whole document. The valid interval of cosine similarity (i.e.,
) is discretized into a fixed number of bins such that a matching histogram is essentially a fixedlength integer vector. Features from different histograms are weighted based on attention calculated on query terms.
DRMM is not positionpreserving, as the authors claim that relevance matching is not related to term order.KNRM: The KNRM (xiong2017end) model takes matrix representation for querydocument interaction (similar to MP), but “categorizes” interactions into different levels of cosine similarities (similar to DRMM), using Gaussian kernels with different mean value . The distinct advantage of KNRM over DRMM is that the former allows gradient to pass through Gaussian kernels, and therefore supports endtoend learning of embeddings.
2.2.2. Monolingual to Crosslingual
According to results reported in respective studies (pang2016text; guo2016deep; xiong2017end), the relative performance of three models for monolingual IR should be KNRM DRMM MP, even when embedding learning is turned off with KNRM. Tweaking a neural model for support of CLIR is trivial: instead of considering interaction value as two terms’ similarity in a monolingual embedded space, we consider the proximity of their representations in the shared crosslingual embedded space. However, there are several matters to consider while making the transition:
Exact matching signals: The significant difference between crosslingual and monolingual IR is that the former (almost) never encounters exact match of terms in different languages. However, neglecting such factors can be costly for models like MP, the disadvantage of which when compared to the other two models is the inability to capture exact and similarity matching signals at the same time. To this end, we first define in CLIR the exact matching of two terms (in different languages) as their cosine similarity in the CLWE space exceeding a certain threshold value . We then implement a hybrid version, namely MPHybrid, that joins exact and soft matching signals extracted from interaction matrices built with indicator function and cosine similarity function, such that ranking features from dual channels are concatenated for an MLP to predict a ranking score.
Wordpair similarity distribution
: The cosine similarities of two terms with close meanings but in different languages are distributed differently than those in the same language. Specifically, the top wordpair similarity distributions of CLWEs tend to have smaller mean and variance. In an example shown in Table
1, the cosine similarity of the five closest words to “telephone” in English embedded space^{1}^{1}1https://dl.fbaipublicfiles.com/fasttext/vectorsenglish/wikinews300d1M.vec.zip ranges from 0.818 to 0.669, while in aligned EnglishSpanish embedded space^{2}^{2}2https://dl.fbaipublicfiles.com/fasttext/vectorsaligned/wiki.es.align.vec, it ranges from 0.535 to 0.520. The similarity distribution affects histogram construction of DRMM and similarly for the kernel pooling of KNRM. The distribution also affects the exact matching threshold value for related variants of MP. Since the cosine similarity of a query term and its perfectly correct translation can be less than 0.6, setting too high can lead to failure of capturing positive matching signals.EN  phone  telephones  Telephone  landline  rotarydial 
0.818  0.761  0.720  0.694  0.669  
ES  telefónicos  teléfono  telefónica  telefónia  telefóno 
0.535  0.522  0.522  0.520  0.520 
Vocabulary mismatch and translation error: Query translation based CLIR methods (e.g., TbTQT (litschko2018unsupervised)) first translate queries from to , then use monolingual retrieval in . Apart from the inherent vocabulary mismatch problem within , the translation error from to has to be also counted. Looking at the example in Table 1, TbTQT would look for occurrence of “telefónicos” in the collection, and documents containing only the correct translation (“teléfono”) would be overlooked. Interactionbased neural matching models alleviate this issue by giving partial credit to suboptimal nearest neighbors, which in many cases are the correct translations. To demonstrate the necessity of directly using crosslingual word embedding similarity as interaction for neural models, we conduct comparative experiments where queries are first translated termbyterm like TbTQT using CLWEs, then used for retrieval in monolingual setting. Such models are referred to as {MP,DRMM,KNRM}TbTQT, respectively.
3. Experiments
Datasets: We evaluate the models on the CLEF test suite for the CLEF 20002003 campaigns. We select four language pairs: English (EN) queries to {Dutch (NL), Italian (IT), Finnish (FI), Spanish (ES)} documents. All documents for the four languages are used for evaluation, and are truncated to preserve the first 500 tokens for computational efficiency (pang2016study). The statistics of the evaluation datasets are shown in Table 2. The titles of CLEF topics are used as English queries. All queries and documents are lowercased, with stopwords, punctuation marks and onecharacter tokens removed.
Crosslingual word embeddings: We adopt the prealigned fastText CLWEs^{3}^{3}3https://fasttext.cc/docs/en/alignedvectors.html
. Monolingual fastText embeddings are trained on Wikipedia corpus in respective languages, and aligned using weak supervision from a small bilingual lexicon with the RCSLS loss as the optimization objective
(joulin2018loss).Lang. Pair  EN NL  EN IT  EN FI  EN ES 

#queries  160  160  90  160 
#docs  42,734  40,320  16,351  46,540 
#rel  29.1  19.5  10.9  49.5 
#label  375.4  338.3  282.6  372.7 
Model specifications: We implemented two CLWEs based unsupervised CLIR algorithms BWEAgg and TbTQT as baselines (litschko2018unsupervised). In addition to the query likelihood model in the original study, we pair TbTQT with BM25 to investigate the influence of retrieval models to queries translated using CLWEs.
We experiment with five variants of the MP model, two for the DRMM model and two for the KNRM model. As the interaction value of query term and document term , {MP,DRMM,KNRM}Cosine uses the cosine similarity , MPGaussian uses , and MPExact takes , where is a predefined threshold value (set to 0.3 for Table 3). MPHybrid concatenates the flattened features after dynamic pooling layer from MPCosine and MPExact into one vector, and uses an MLP to predict a final score. {MP,DRMM,KNRM}TbTQT is equal to first translating query to target language query , and running with {MP,DRMM,KNRM}Cosine model.
For the MP model, we adopt one layer convolution with kernel size set to , dynamic pooling size set to , and kernel count set to 64. For the DRMM model, we adopt the logcountbased histogram with bin size set to 30. For the KNRM
model, kernel count is set to 20 and standard deviation of each Gaussian kernel is set to 0.1. All decisions made above are based on extensive hyperparameter tuning that first prioritizes generalizable retrieval performance then computational efficiency and model simplicity.
Model training: All neural models in the experiments are trained with the pairwise hinge loss. Given a triple , where document is relevant and document is nonrelevant with respect to query
, the loss function is defined as:
where denotes the predicted matching score for , and represents the learnable parameters in the neural network. Note that we randomly select documents that are explicitly labeled nonrelevant (1) as negative samples for training. Five negative
pair are sampled for each positive pair. We apply stochastic gradient descent method Adam
(kingma2014adam)(learning rate=1e3) in minibatches (64 in size) for optimization. The maximum number of training epochs allowed is 20.
Evaluation
: As the CLEF dataset uses binary relevance judgement, we adopt MAP as the evaluation metric. In order to conduct evaluation on enough queries that conclusions can possibly be statistically significant, we adopt 5fold crossvalidation with validation and test sets. Statistical significant tests are performed using the twotailed paired ttest at the
level.Lang. Pair  EN NL  EN IT  EN FI  EN ES 

BWEAggAdd  .237  .173  .170  .297 
BWEAggIDF  .246  .178  .180  .298 
TbTQTBM25  .240  .231  .122  .341 
TbTQTQL  .297  .268  .126  .387 
MPCosine  .348  .331  .254  .423 
MPGaussian  .322  .319  .203  .405 
MPExact  .327  .295  .202  .415 
MPHybrid  .343  .326  .243  .427 
MPTbTQT  .327  .300  .195  .409 
DRMMCosine  .374  .352  .304  .462 
DRMMTbTQT  .345  .324  .193  .450 
KNRMCosine  .368  .313  .286  .423 
KNRMTbTQT  .329  .288  .200  .405 
4. Discussion and Conclusion
4.1. Parsing Results
The experimental results of CLIR on four language pairs are reported in Table 3. TbTQT generally works better than BWEAgg except for ENFI. This might indicate that the EnglishFinnish CLWEs are not aligned well to provide quality top1 query term translation. The larger gaps between {MP,DRMM,KNRM}Cosine and {MP,DRMM,KNRM}TbTQT for ENFI than the other three language pairs reinforce this argument. All neural models achieve statistically significant
improvement over heuristic baselines.
DRMMCosine consistently achieves the best performance for all language pairs. Although DRMM and KNRM are conceptually similar, the former performs significantly better, with KNRM’s embedding layer kept frozen. The attention mechanism applied to query terms for DRMM can be a factor. On EN{IT,ES}, the MP model performs on par with or better than KNRM. This finding indicates that the convolution plus dynamic pooling architecture can also be an option for learning an endtoend CLIR model. Comparing different approaches to build querydocument interaction matrices for MP, it is clear that cosine similarity of source language query term and target language document term in the CLWE space is the best choice, which contradicts the conclusions in the study of monolingual IR (pang2016study) where Gaussian kernel and indicator function are found to work better. The exact matching variant MPExact we proposed works reasonably well, indicating that most decisions of relevance are influenced by top similarity matching signals. The hybrid variant MPHybrid we propose improves upon MPExact but does not outperform MPCosine except for ENES. This is expected because matching signals from MPExact are not from truly exact matches of terms, but are derived from cosine similarity matrices as in MPCosine. The combination of two models results in redundant information. The fact that {MP,DRMM,KNRM}TbTQT outperform baseline approaches but are not as good as respective cosine variants demonstrates (1) the effectiveness of pattern learning of neural models; and (2) the necessity to directly build crosslingual interactions of query and document in two languages, rather than building interactions after translation.4.2. Wordpair Similarity Distribution
The distribution of word pair similarities influences the exact matching threshold in MPExact, the query translation strategy in TbTQT, and the embedding finetuning for an endtoend model. We take source language terms in the queries and target language terms in the documents, calculate their pairwise cosine similarities in the aligned CLWE space, and plot the similarity distributions. In Figure 0(a) and 0(b), we show in red the percentage of crosslingual wordpairs with similarity above . The three distributions in Figure 0(a) are very similar at tail (), therefore the corresponding MPExact’s performance peaks at the same . ENFI is distributed differently but the pattern shown is similar (Figure 0(b)). The shapes of crosslingual similarity distribution for all four language pairs are very similar, therefore we only plot ENNL in Figure 0(c) for demonstration. Monolingual similarity distribution in Xiong et al.’s study (xiong2017end)
has large variance, positive mean, strong positive skewness and high density at large
. In comparison, the crosslingual similarity distribution (Figure 0(c)) has small variance, negative mean, no obvious skewness to the left or right, and the density drops low and flat after , where wordpairs are considered highly similar (i.e., quality translations). This provides insights into why top1 translation with CLWEs is not necessarily significantly better than translations ranked at slightly lower positions.4.3. Conclusions
Answer to RQ1: To adapt a neural model for CLIR, exact matching representations, crosslingual wordpair similarity distribution, and translation error using CLWEs have to be considered. In specific model settings, choices of interaction representations and hyperparameters (e.g., dynamic pooling size at document side for MP) are found to be different from monolingual IR.
Answer to RQ2: Neural matching models experimented in this study all outperform baselines using CLWEs. The DRMM achieves the best results across the board, while MP and KNRM perform inconsistently on different language pairs.
Moving forward, a worthwhile endeavor will be to investigate an endtoend neural model that learns from largescale CLIR data.
Acknowledgements
This work was supported in part by the Center for Intelligent Information Retrieval and in part by the Air Force Research Laboratory (AFRL) and IARPA under contract #FA865017C9118 under subcontract #14775 from Raytheon BBN Technologies Corporation. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect those of the sponsor.
Comments
There are no comments yet.