DeepAI
Log In Sign Up

Efficient Nearest Neighbor Search for Cross-Encoder Models using Matrix Factorization

10/23/2022
by   Nishant Yadav, et al.
Google
University of Massachusetts Amherst
0

Efficient k-nearest neighbor search is a fundamental task, foundational for many problems in NLP. When the similarity is measured by dot-product between dual-encoder vectors or ℓ_2-distance, there already exist many scalable and efficient search methods. But not so when similarity is measured by more accurate and expensive black-box neural similarity models, such as cross-encoders, which jointly encode the query and candidate neighbor. The cross-encoders' high computational cost typically limits their use to reranking candidates retrieved by a cheaper model, such as dual encoder or TF-IDF. However, the accuracy of such a two-stage approach is upper-bounded by the recall of the initial candidate set, and potentially requires additional training to align the auxiliary retrieval model with the cross-encoder model. In this paper, we present an approach that avoids the use of a dual-encoder for retrieval, relying solely on the cross-encoder. Retrieval is made efficient with CUR decomposition, a matrix decomposition approach that approximates all pairwise cross-encoder distances from a small subset of rows and columns of the distance matrix. Indexing items using our approach is computationally cheaper than training an auxiliary dual-encoder model through distillation. Empirically, for k > 10, our approach provides test-time recall-vs-computational cost trade-offs superior to the current widely-used methods that re-rank items retrieved using a dual-encoder or TF-IDF.

READ FULL TEXT VIEW PDF

page 6

page 17

page 18

page 19

page 20

page 21

page 22

page 23

03/10/2022

LoopITR: Combining Dual and Cross Encoder Architectures for Image-Text Retrieval

Dual encoders and cross encoders have been widely used for image-text re...
06/24/2019

An Empirical Comparison of FAISS and FENSHSES for Nearest Neighbor Search in Hamming Space

In this paper, we compare the performances of FAISS and FENSHSES on near...
04/23/2020

Distilling Knowledge for Fast Retrieval-based Chat-bots

Response retrieval is a subset of neural ranking in which a model select...
11/09/2022

Distribution-Aligned Fine-Tuning for Efficient Neural Retrieval

Dual-encoder-based neural retrieval models achieve appreciable performan...
09/23/2019

Learning Dense Representations for Entity Retrieval

We show that it is feasible to perform entity linking by training a dual...
12/15/2021

Large Dual Encoders Are Generalizable Retrievers

It has been shown that dual encoders trained on one domain often fail to...
02/16/2018

Recognizing Cuneiform Signs Using Graph Based Methods

The cuneiform script constitutes one of the earliest systems of writing ...

1 Introduction

(a) Model architecture
(b) Query-item score distribution
Figure 1: Model architecture and score distribution for three neural scoring functions. Dual-Encoder (DE) models score a query-item pair using independently computed query and item embeddings. [cls]-CE model computes the score by jointly encoding the query-item pair followed by passing the joint query-item embedding through a linear layer. Our proposed [emb]-CE model embeds special tokens amongst query and item tokens, and computes the query-item score using contextualixed query and item embeddings extracted using the special tokens after jointly encoding the query-item pair.

Finding top- scoring items for a given query is a fundamental sub-routine of recommendation and information retrieval systems Kowalski (2007); Das et al. (2017)

. For instance, in question answering systems, the query corresponds to a question and the item corresponds to a document or a passage. Neural networks are widely used to model the similarity between a query and an item in such applications 

Zamani et al. (2018); Hofstätter et al. (2019); Karpukhin et al. (2020); Qu et al. (2021). In this work, we focus on efficient -nearest neighbor search for one such similarity function – the cross-encoder model.

Cross-encoder models output a scalar similarity score by jointly encoding the query-item pair and often generalize better to new domains and unseen data Chen et al. (2020); Wu et al. (2020); Thakur et al. (2021) as compared to dual-encoder 111also referred to as two-tower models, Siamese networks models which independently embed the query and the item in a vector space, and use simple functions such as dot-product to measure similarity. However, due to the black-box nature of the cross-encoder based similarity function, the computational cost for brute force search with cross-encoders is prohibitively high. This often limits the use of cross-encoder models to re-ranking items retrieved using a separate retrieval model such as a dual-encoder or a tf-idf-based model Logeswaran et al. (2019); Zhang and Stratos (2021); Qu et al. (2021). The accuracy of such a two-stage approach is upper bounded by the recall of relevant items by the initial retrieval model. Much of recent work either attempts to distill information from an expensive but more expressive cross-encoder model into a cheaper student model such as a dual-encoder Wu et al. (2020); Hofstätter et al. (2020); Lu et al. (2020); Qu et al. (2021); Liu et al. (2022), or focuses on cheaper alternatives to the cross-encoder model while attempting to capture fine-grained interactions between the query and the item  Humeau et al. (2020); Khattab and Zaharia (2020); Luan et al. (2021).

In this work, we tackle the fundamental task of efficient -nearest neighbor search for a given query according to the cross-encoder. Our proposed approach, annCUR, uses CUR decomposition Mahoney and Drineas (2009), a matrix factorization approach, to approximate cross-encoder scores for all items, and retrieves -nearest neighbor items while only making a small number of calls to the cross-encoder. Our proposed method selects a fixed set of anchor queries and anchor items, and uses scores between anchor queries and all items to generate latent embeddings for indexing the item set. At test time, we generate latent embedding for the query using cross-encoder scores for the test query and anchor items, and use it to approximate scores of all items for the given query and/or retrieve top- items according to the approximate scores. In contrast to distillation-based approaches, our proposed approach does not involve any additional compute-intensive training of a student model such as dual-encoder via distillation.

In general, the performance of a matrix factorization-based method depends on the rank of the matrix being factorized. In our case, the entries of the matrix are cross-encoder scores for query-item pairs. To further improve rank of the score matrix, and in-turn performance of the proposed matrix factorization based approach, we propose [emb]-CE which uses a novel dot-product based scoring mechanism for cross-encoder models (see Figure 0(a)). In contrast to the widely used [cls]-CE approach of pooling query-item representation into a single vector followed by scoring using a linear layer, [emb]-CE produces a score matrix with a much lower rank while performing at par with [cls]-CE on the downstream task.

We run extensive experiments with cross-encoder models trained for the downstream task of entity linking. The query and item in this case correspond to a mention of an entity in text and a document with an entity description respectively. For the task of retrieving -nearest neighbors according to the cross-encoder, our proposed approach presents superior recall-vs-computational cost trade-offs over using dual-encoders trained via distillation as well as over unsupervised tf-idf-based methods (§3.2). We also evaluate the proposed method for various indexing and test-time cost budgets as well as study the effect of various design choices in §3.3 and §3.4.

2 Matrix Factorization for Nearest Neighbor Search

2.1 Task Description and Background

Figure 2: CUR decomposition of a matrix () using a subset of its columns () and rows (). The blue rows () corresponding to the anchor queries and are used for indexing the items to obtain item embeddings and the green row corresponds to the test query. Note that the test row () can be approximated using a subset of its columns () and the latent representation of items ().

Given a scoring function that maps a query-item pair to a scalar score, and a query , the -nearest neighbor task is to retrieve top- scoring items according to the given scoring function from a fixed item set .

In NLP, queries and items are typically represented as a sequence of tokens and the scoring function is typically parameterized using deep neural models such as transformers Vaswani et al. (2017). There are two popular choices for the scoring function – the cross-encoder (CE) model, and the dual-encoder (DE) model. The CE model scores a given query-item pair by concatenating the query and the item using special tokens, passing them through a model (such as a transformer ) to obtain representation for the input pair followed by computing the score using linear weights .

While effective, computing a similarity between a query-item pair requires a full forward pass of the model, which is often quite computationally burdensome. As a result, previous work uses auxiliary retrieval models such as BM25 Robertson et al. (1995) or a trained dual-encoder (DE) model to approximate the CE. The DE model independently embeds the query and the item in , for instance by using a transformer followed by pooling the final layer representations into a single vector (e.g. using cls token). The DE score for the query-item pair is computed using dot-product of the query embedding and the item embedding.

In this work, we propose a method based on CUR matrix factorization that allows efficient retrieval of top- items by directly approximating the cross-encoder model rather than using an auxiliary (trained) retrieval model.

CUR Decomposition

Mahoney and Drineas (2009) In CUR matrix factorization, a matrix is approximated using a subset of its rows , a subset of its columns and a joining matrix as follows

where and are the indices corresponding to rows and columns respectively, and the joining matrix optimizes the approximation error. In this work, we set to be the Moore-Penrose pseudo-inverse of , the intersection of matrices and , in which case is known as the skeleton approximation of  Goreinov et al. (1997).

2.2 Proposed Method Overview

Our proposed method annCUR, which stands for Approximate Nearest Neighbor search using CUR decomposition, begins by selecting a fixed set of anchor queries () and anchor items (). It uses scores between queries and all items to index the item set by generating latent item embeddings. At test time, we compute exact scores between the test query and the anchor items , and use it to approximate scores of all items for the given query and/or retrieve top- items according to the approximate scores. We could optionally retrieve items, re-rank them using exact scores and return top- items.

Let and where contains scores for the anchor queries and all items, contains scores for a test query and all items, contains scores for the anchor queries and the anchor items, and contains scores for the test query paired with the anchor items.

Using CUR decomposition, we can approximate using a subset of its columns () corresponding to the anchor items and a subset of its rows () corresponding to the anchor queries as

Figure 2 shows CUR decomposition of matrix . At test time, containing approximate item scores for the test query can be computed using , and where contains exact scores between the test query and the anchor items. Matrices and can be computed offline as these are independent of the test query.

2.3 Offline Indexing

The indexing process first computes containing scores between the anchor queries and all items.

We embed all items in as

where is the pseudo-inverse of . Each column of corresponds to a latent item embedding.

2.4 Test-time Inference

At test time, we embed the test query in using scores between and anchor items .

We approximate the score for a query-item pair () using inner-product of and where is the embedding of item .

We can use along with an off-the-shelf nearest-neighbor search method for maximum inner-product search Malkov and Yashunin (2018); Johnson et al. (2019); Guo et al. (2020) and retrieve top-scoring items for the given query according to the approximate query-item scores without explicitly approximating scores for all the items.

2.5 Time Complexity

During indexing stage, we evaluate for query-item pairs, and compute the pseudo-inverse of a matrix. The overall time complexity of the indexing stage is , where is the cost of computing the pseudo-inverse of a matrix, and is the cost of computing on a query-item pair. For CE models used in this work, we observe that .

At test time, we need to compute for query-item pairs followed by optionally re-ranking items retrieved by maximum inner-product search (MIPS). Overall time complexity for inference is , where is the time complexity of MIPS over items to retrieve top- items.

2.6 Improving score distribution of CE models for matrix factorization

The rank of the query-item score matrix, and in turn, the approximation error of a matrix factorization method depends on the scores in the matrix. Figure 0(b) shows a histogram of query-item score distribution (adjusted to have zero mean) for a dual-encoder and [cls]-CE model. We use [cls]-CE to refer to a cross-encoder model parameterized using transformers which uses cls token to compute a pooled representation of the input query-item pair. Both the models are trained for zero-shot entity linking (see §3.1 for details). As shown in the figure, the query-item score distribution for the [cls]-CE

model is significantly skewed with only a small fraction of items (entities) getting high scores while the score distribution for a dual-encoder model is less so as it is generated explicitly using dot-product of query and item embeddings. The skewed score distribution from

[cls]-CE leads to a high rank query-item score matrix, which results in a large approximation error for matrix decomposition methods.

We propose a small but important change to the scoring mechanism of the cross-encoder so that it yields a less skewed score distribution, thus making it much easier to approximate the corresponding query-item score matrix without adversely affecting the downstream task performance. Instead of using cls token representation to score a given query-item pair, we add special tokens amongst the query and the item tokens and extract contextualized query and item representations using the special tokens after jointly encoding the query-item pair using a model such as a transformer .

The final score for the given query-item pair is computed using dot-product of the contextualized query and item embeddings.

We refer to this model as [emb]-CE. Figure 0(a) shows high-level model architecture for dual-encoders, [cls]-CE and [emb]-CE model.

As shown in Figure 0(b), the query-item score distribution from an [emb]-CE model resembles that from a DE model. Empirically, we observe that rank of the query-item score matrix for [emb]-CE model is much lower than the rank of a similar matrix computed using [cls]-CE, thus making it much easier to approximate using matrix decomposition based methods.

3 Experiments

In our experiments, we use CE models trained for zero-shot entity linking on ZeShEL dataset (§3.1). We evaluate the proposed method and various baselines on the task of finding -nearest neighbors for cross-encoder models in §3.2, and evaluate the proposed method for various indexing and test-time cost budgets as well as study the effect of various design choices in §3.3 and §3.4. All resources for the paper including code for all experiments and model checkpoints is available at https://github.com/iesl/anncur

ZeShEL Dataset

The Zero-Shot Entity Linking (ZeShEL) dataset was constructed by Logeswaran et al. (2019) from Wikia. The task of zero-shot entity linking involves linking entity mentions in text to an entity from a list of entities with associated descriptions. The dataset consists of 16 different domains with eight, four, and four domains in training, dev, and test splits respectively. Each domain contains non-overlapping sets of entities, thus at test time, mentions need to be linked to unseen entities solely based on entity descriptions. Table 1 in the appendix shows dataset statistics. In this task, queries correspond to mentions of entities along with the surrounding context, and items correspond to entities with their associated descriptions.

3.1 Training DE and CE models on ZeShEL

Following the precedent set by recent papers Wu et al. (2020); Zhang and Stratos (2021), we first train a dual-encoder model on ZeShEL training data using hard negatives. We train a cross-encoder model for the task of zero-shot entity-linking on all eight training domains using cross-entropy loss with ground-truth entity and negative entities mined using the dual-encoder. We refer the reader to Appendix A.1 for more details.

Results on downstream task of Entity Linking

To evaluate the cross-encoder models, we retrieve 64 entities for each test mention using the dual-encoder model and re-rank them using a cross-encoder model. The top-64 entities retrieved by the DE contain the ground-truth entity for 87.95% mentions in test data and 92.04% mentions in dev data. The proposed [emb]-CE model achieves an average accuracy of 65.49 and 66.86 on domains in test and dev set respectively, and performs at par with the widely used and state-of-the-art 222We observe that our implementation of [cls]-CE obtains slightly different results as compared to state-of-the-art (see Table 2 in  Zhang and Stratos (2021) ) likely due to minor implementation/training differences. [cls]-CE architecture which achieves an accuracy of 65.87 and 67.67 on test and dev set respectively. Since [emb]-CE model performs at par with [cls]-CE on the downstream task of entity linking, and rank of the score matrix from [emb]-CE is much lower than that from [cls]-CE, we use [emb]-CE in subsequent experiments.

3.2 Evaluating on -NN search for CE

(a) Top--Recall@ for annCUR and baselines when all methods retrieve and rerank the same number of items . The subscript in refers to the number of anchor items used for embedding the test query.
(b) Top--Recall for annCUR and baselines when all methods operate under a fixed test-time cost budget. Recall that cost is the number of CE calls made during inference for re-ranking retrieved items and, in case of annCUR, it also includes CE calls to embed the test query by comparing with anchor items.
Figure 3: Top--Recall results for domain=YuGiOh and
Experimental Setup

For all experiments in this section, we use the [emb]-CE model trained on original ZeShEL training data on the task of zero-shot entity linking, and evaluate the proposed method and baselines for the task of retrieving -nearest neighbor entities (items) for a given mention (query) according to the cross-encoder model.

We run experiments separately on five domains from ZeShEL containing 10K to 100K items. For each domain, we compute the query-item score matrix for a subset or all of the queries (mentions) and all items (entities) in the domain. We randomly split the query set into a training set () and a test set (). We use the queries in training data to train baseline DE models. For annCUR, we use the training queries as anchor queries and use CE scores between the anchor queries and all items for indexing as described in §2.3. All approaches are then evaluated on the task of finding top- CE items for queries in the corresponding domain’s test split. For a fair comparison, we do not train DE models on multiple domains at the same time.

3.2.1 Baseline Retrieval Methods

tf-idf:

All queries and items are embedded using a tf-idf vectorizer trained on item descriptions and top- items are retrieved using the dot-product of query and item embeddings.

DE models:

We experiment with DEbase, the DE model trained on ZeShEL for the task of entity linking (see §3.1), and the following two DE models trained via distillation from the CE.

  • DEbert+ce: DE initialized with Bert Devlin et al. (2019) and trained only using training signal from the cross-encoder model.

  • DEbase+ce: DEbase model further fine-tuned via distillation using the cross-encoder model.

We refer the reader to Appendix A.3 for hyper-parameter and optimization details.

Evaluation metric

We evaluate all approaches under the following two settings.

  • [topsep=1pt,itemsep=0ex,partopsep=1ex,parsep=1ex]

  • In the first setting, we retrieve items for a given query, re-rank them using exact CE scores and keep top- items. We evaluate each method using Top--Recall@ which is the percentage of top- items according to the CE model present in the retrieved items.

  • In the second setting, we operate under a fixed test-time cost budget where the cost is defined as the number of CE calls made during inference. Baselines such as DE and tf-idf will use the entire cost budget for re-ranking items using exact CE scores while our proposed approach will have to split the budget between the number of anchor items () used for embedding the query (§2.4) and the number of items () retrieved for final re-ranking.

We refer to our proposed method as  when using fixed set of anchor items chosen uniformly at random, and we refer to it as annCUR when operating under a fixed test-time cost budget in which case different values of and are used in each setting.

3.2.2 Results

Figure 4: Bar plot showing Top-100-Recall@Cost=500 for different methods as we increase , the size of indexing/training data for five different domains.

Figures 2(a) and 2(b) show recall of top- cross-encoder nearest neighbors for on ZeShEL domain = YuGiOh when using 500 queries for training, and evaluating on the remaining 2874 test queries. Figure 2(a) shows recall when each method retrieves the same number of items and Figure 2(b) shows recall when each method operates under a fixed inference cost budget.

Performance for

In Figure 2(a), our proposed approach outperforms all baselines at finding top-, and nearest neighbors when all models retrieve the same number of items for re-ranking. In Figure 2(b), when operating under the same cost budget, annCUR outperforms DE baselines at larger cost budgets for , and . Recall that at a smaller cost budget, annCUR is able to retrieve fewer number of items for exact re-ranking than the baselines as it needs to use a fraction of the cost budget i.e. CE calls to compare the test-query with anchor items in order to embed the query for retrieving relevant items. Generally, the optimal budget split between the number of anchor items () and the number of items retrieved for exact re-ranking () allocates around 40-60% of the budget to and the remaining budget to .

Performance for

The top- nearest neighbor according to the CE is likely to be the ground-truth entity (item) for the given mention (query). Note that DEbase was trained using a massive amount of entity linking data (all eight training domains in ZeShEL, see §3.1) using the ground-truth entity (item) as the positive item. Thus, it is natural for top- nearest neighbor for both of these models to be aligned. For this reason, we observe that DEbase and DEbase+ce outperform annCUR for . However, our proposed approach either outperforms or is competitive with DEbert+ce, a DE model trained only using CE scores for 500 queries after initializing with Bert. In Figure 2(a), and outperform DEbert+ce and in Figure 2(b) annCUR outperforms DEbert+ce at larger cost budgets.

We refer the reader to Appendix B.3 for results on all combinations of top- values, domains, and training data size values.

Effect of training data size ()

Figure 4 shows Top-100-Recall@Cost=500 on test queries for various methods as we increase the number of queries in training data (). For DE baselines, the trend is not consistent across all domains. On YuGiOh, the performance consistently improves with . However, on Military, the performance of distilled DE drops on going from 100 to 500 training queries but improves on going from 500 to 2000 training queries. Similarly, on Pro_Wrestling, performance of distilled DEbase+ce does not consistently improve with training data size while it does for DEbert+ce. We suspect that this is due to a combination of various factors such as overfitting on training data, sub-optimal hyper-parameter configuration, divergence of model parameters etc. In contrast, our proposed method, annCUR, always shows consistent improvements as we increase the number of queries in training data, and avoids the perils of gradient-based training that often require large amounts of training data to avoid overfitting as well expensive hyper-parameter tuning in order to consistently work well across various domains.

Figure 5: Bar plot with Top-100-Recall@Cost=500 for five domains in ZeShEL when using queries for training/indexing and a line-plot showing the number of items (entities) in each domain.
Effect of domain size

Figure 5 shows Top-100-Recall@Cost=500 for annCUR and DE baselines on primary y-axis and size of the domain i.e. total number of items on secondary y-axis for five different domains in ZeShEL. Generally, as the number of items in the domain increases, the performance of all methods drops.

Indexing Cost

The indexing process starts by computing query-item CE scores for queries in train split. annCUR uses these scores for indexing the items (see §2.3) while DE baselines use these scores to find ground-truth top- items for each query followed by training DE models using CE query-item scores. For domain=YuGiOh with items, and , the time taken to compute query-item scores for train/anchor queries () hours on an NVIDIA GeForce RTX2080Ti GPU/12GB memory, and training a DE model further takes additional time 4.5 hours on two instances of the same GPU. Both and increase linearly with domain size and , however the query-item score computation can be trivially parallelized. We ignore the time to build a nearest-neighbor search index over item embeddings produced by annCUR or DE as that is negligible in comparison to time spent on CE score computation and training of DE models. We refer the reader to Appendix A.3 for more details.

3.3 Analysis of annCUR

(a) [cls]-CE
(b) [emb]-CE
Figure 6: Top-10-Recall@500 of annCUR for non-anchor queries on domain = YuGiOh for two cross-encoder models – [cls]-CE and [emb]-CE.

We compute the query-item score matrix for both [cls]-CE and [emb]-CE and compute the rank of these matrices using numpy Harris et al. (2020) for domain=YuGiOh with 3374 queries (mentions) and 10031 items (entities). Rank of the score matrix for [cls]-CE = 315 which is much higher than rank of the corresponding matrix for [emb]-CE = 45 due to the query-item score distribution produced by [cls]-CE model being much more skewed than that produced by [emb]-CE model (see Fig. 0(b)).

Figures 5(a) and 5(b) show Top-10-Recall@500 on domain=YuGiOh for [cls]-CE and [emb]-CE respectively on different combinations of number of anchor queries () and anchor items (). Both anchor queries and anchor items are chosen uniformly at random, and for a given set of anchor queries, we evaluate on the remaining set of queries.

[cls]-CE versus [emb]-CE

For the same choice of anchor queries and anchor items, the proposed method performs better with [emb]-CE model as compared [cls]-CE due to the query-item score matrix for [emb]-CE having much lower rank thus making it easier to approximate.

Effect of and

Recall that the indexing time for annCUR is directly proportional to the number of anchor queries () while the number of anchor items () influences the test-time inference latency. Unsurprisingly, performance of annCUR increases as we increase and , and these can be tuned as per user’s requirement to obtain desired recall-vs-indexing time and recall-vs-inference time trade-offs. We refer the reader to Appendix B.2 for a detailed explanation for the drop in performance when .

3.4 Item-Item Similarity Baselines

Figure 7: Bar plot showing Top-100-Recall for domain=YuGiOh when indexing using 500 anchor items for fixedITEM and itemCUR and 500 anchor queries for annCUR.

We additionally compare with the following baselines that index items by comparing against a fixed set of anchor items 333See appendix A.2 for details on computing item-item scores using a CE model trained to score query-item pairs. instead of anchor queries.

  • [topsep=1pt,itemsep=0ex,partopsep=1ex,parsep=1ex]

  • fixedITEM: Embed all items and test-query in using CE scores with a fixed set of items chosen uniformly at random, and retrieve top- items for the test query based on dot-product of these -dim embeddings. We use .

  • itemCUR-: This is similar to the proposed approach except that it indexes the items by comparing them against anchor items instead of anchor queries for computing and matrices in the indexing step in §2.3. At test time, it performs inference just like annCUR (see §2.4) by comparing against a different set of fixed anchor items. We use .

Figure 7 shows Top-100-Recall for fixedITEM, itemCUR, and annCUR on domain = YuGiOh. itemCUR performs better than fixedITEM indicating that the latent item embeddings produced using CUR decomposition of the item-item similarity matrix are better than those built by comparing the items against a fixed set of anchor items. itemCUR performs worse than annCUR apparently because the CE was trained on query-item pairs and was not calibrated for item-item comparisons.

4 Related Work

Matrix Decomposition

Classic matrix decomposition methods such as SVD, QR decomposition have been used for approximating kernel matrices and distance matrices 

Musco and Woodruff (2017); Tropp et al. (2017); Bakshi and Woodruff (2018); Indyk et al. (2019)

. Interpolative decomposition methods such as Nyström method and CUR decomposition allow approximation of the matrix even when given only a subset of rows and columns of the matrix. Unsurprisingly, performance of these methods can be further improved if given the entire matrix as it allows for a better selection of rows and columns on the matrix used in the decomposition process 

Goreinov et al. (1997); Drineas et al. (2005); Kumar et al. (2012); Wang and Zhang (2013). Recent work, Ray et al. (2022) proposes sublinear Nyström approximations and considers CUR-based approaches for approximating non-PSD similarity matrices that arise in NLP tasks such as coreference resolution and document classification. Unlike previous work, our goal is to use the approximate scores to support retrieval of top scoring items. Although matrix decomposition methods for sparse matrices based on SVD Berry (1992); Keshavan et al. (2010); Hastie et al. (2015); Ramlatchan et al. (2018) can be used instead of CUR decomposition, such methods would require a) factorizing a sparse matrix at test time in order to obtain latent embeddings for all items and the test query, and b) indexing the latent item embeddings to efficiently retrieve top-scoring items for the given query. In this work, we use CUR decomposition as, unlike other sparse matrix decomposition methods, CUR decomposition allows for offline computation and indexing of item embeddings and the latent embedding for a test query is obtained simply by using its cross-encoder scores against the anchor items.

Cross-Encoders and Distillation

Due to high computational costs, use of cross-encoders (CE) is often limited to either scoring a fixed set of items or re-ranking items retrieved by a separate (cheaper) retrieval model Logeswaran et al. (2019); Qu et al. (2021); Bhattacharyya et al. (2021); Ayoola et al. (2022). CE models are also widely used for training computationally cheaper models via distillation on the training domain (Wu et al., 2020; Reddi et al., 2021), or for improving performance of these cheaper models on the target domain Chen et al. (2020); Thakur et al. (2021)

by using cross-encoders to score a fixed or heuristically retrieved set of items/datapoints. The DE baselines used in this work, in contrast, are trained using

-nearest neighbors for a given query according to the CE.

Nearest Neighbor Search

For applications where the inputs are described as vectors in , nearest neighbor search has been widely studied for various (dis-)similarity functions such as distance Chávez et al. (2001); Hjaltason and Samet (2003), inner-product Jegou et al. (2010); Johnson et al. (2019); Guo et al. (2020), and Bregman-divergences Cayton (2008). Recent work on nearest neighbor search with non-metric (parametric) similarity functions explores various tree-based Boytsov and Nyberg (2019b) and graph-based nearest neighbor search indices Boytsov and Nyberg (2019a); Tan et al. (2020, 2021). In contrast, our approach approximates the scores of the parametric similarity function using the latent embeddings generated using CUR decomposition and uses off-the-shelf maximum inner product search methods with these latent embeddings to find -nearest neighbors for the CE. An interesting avenue for future work would be to combine our approach with tree-based and graph-based approaches to further improve efficiency of these search methods.

5 Conclusion

In this paper, we proposed, annCUR, a matrix factorization-based approach for nearest neighbor search for a cross-encoder model without relying on an auxiliary model such as a dual-encoder for retrieval. annCUR approximates the test query’s scores with all items by scoring the test-query only with a small number of anchor items, and retrieves items using the approximate scores. Empirically, for , our approach provides test-time recall-vs-computational cost trade-offs superior to the widely-used approach of using cross-encoders to re-rank items retrieved using a dual-encoder or a tf-idf-based model. This work is a step towards enabling efficient retrieval with expensive similarity functions such as cross-encoders, and thus, moving beyond using such models merely for re-ranking items retrieved by auxiliary retrieval models such as dual-encoders and tf-idf-based models.

Acknowledgements

We thank members of UMass IESL for helpful discussions and feedback. This work was supported in part by the Center for Data Science and the Center for Intelligent Information Retrieval, in part by the National Science Foundation under Grant No. NSF1763618, in part by the Chan Zuckerberg Initiative under the project “Scientific Knowledge Base Construction”, in part by International Business Machines Corporation Cognitive Horizons Network agreement number W1668553, and in part using high performance computing equipment obtained under a grant from the Collaborative R&D Fund managed by the Massachusetts Technology Collaborative. Rico Angell was supported by the NSF Graduate Research Fellowship under Grant No. 1938059. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect those of the sponsor(s).

Limitations

In this work, we use cross-encoders parameterized using transformer models. Computing query-item scores using such models can be computationally expensive. For instance, on an NVIDIA GeForce RTX 2080Ti GPU with 12GB memory, we can achieve a throughput of approximately 140 scores/second, and computing a score matrix for 100 queries and 10K items takes about two hours. Although this computation can be trivially parallelized, the total amount of GPU hours required for this computation can be very high. However, note that these scores need to be computed even for distillation based DE baselines as we need to identify -nearest neighbors for each query according to the cross-encoder model for training a dual-encoder model on this task.

Our proposed approach allows for indexing the item set only using scores from cross-encoder without any additional gradient based training but it is not immediately clear how it can benefit from data on multiple target domains at the same time. Parametric models such as dual-encoders on the other hand can benefit from training and knowledge distillation on multiple domains at the same time.

Ethical Consideration

Our proposed approach considers how to speed up the computation of nearest neighbor search for cross-encoder models. The cross-encoder model, which our approach approximates, may have certain biases / error tendencies. Our proposed approach does not attempt to mitigate those biases. It is not clear how those biases would propagate in our approximation, which we leave for future work. An informed user would scrutinize both the cross-encoder model and the resulting approximations used in this work.

References

  • T. Ayoola, S. Tyagi, J. Fisher, C. Christodoulopoulos, and A. Pierleoni (2022) ReFinED: an efficient zero-shot-capable approach to end-to-end entity linking. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Track, pp. 209–220. Cited by: §4.
  • A. Bakshi and D. Woodruff (2018) Sublinear time low-rank approximation of distance matrices. Advances in Neural Information Processing Systems. Cited by: §4.
  • M. W. Berry (1992)

    Large-scale sparse singular value computations

    .
    The International Journal of Supercomputing Applications 6 (1), pp. 13–49. Cited by: §4.
  • S. Bhattacharyya, A. Rooshenas, S. Naskar, S. Sun, M. Iyyer, and A. McCallum (2021)

    Energy-based reranking: improving neural machine translation using energy-based models

    .
    In

    Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

    ,
    pp. 4528–4537. Cited by: §4.
  • L. Boytsov and E. Nyberg (2019a) Accurate and fast retrieval for complex non-metric data via neighborhood graphs. In International Conference on Similarity Search and Applications, pp. 128–142. Cited by: §4.
  • L. Boytsov and E. Nyberg (2019b) Pruning algorithms for low-dimensional non-metric k-nn search: a case study. In International Conference on Similarity Search and Applications, pp. 72–85. Cited by: §4.
  • L. Cayton (2008) Fast nearest neighbor retrieval for bregman divergences. In

    International Conference on Machine Learning

    ,
    pp. 112–119. Cited by: §4.
  • E. Chávez, G. Navarro, R. Baeza-Yates, and J. L. Marroquín (2001) Searching in metric spaces. ACM computing surveys (CSUR) 33 (3), pp. 273–321. Cited by: §4.
  • J. Chen, L. Yang, K. Raman, M. Bendersky, J. Yeh, Y. Zhou, M. Najork, D. Cai, and E. Emadzadeh (2020) DiPair: fast and accurate distillation for trillion-scale text matching and pair modeling. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 2925–2937. Cited by: §1, §4.
  • D. Das, L. Sahoo, and S. Datta (2017) A survey on recommendation system. International Journal of Computer Applications 160 (7). Cited by: §1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Cited by: 1st item.
  • P. Drineas, M. W. Mahoney, and N. Cristianini (2005) On the nyström method for approximating a gram matrix for improved kernel-based learning.. The Journal of Machine Learning Research 6 (12). Cited by: §4.
  • S. A. Goreinov, E. E. Tyrtyshnikov, and N. L. Zamarashkin (1997) A theory of pseudoskeleton approximations. Linear Algebra and its Applications 261 (1-3), pp. 1–21. Cited by: §2.1, §4.
  • R. Guo, P. Sun, E. Lindgren, Q. Geng, D. Simcha, F. Chern, and S. Kumar (2020) Accelerating large-scale inference with anisotropic vector quantization. In International Conference on Machine Learning, pp. 3887–3896. Cited by: §2.4, §4.
  • C. R. Harris, K. J. Millman, S. J. van der Walt, R. Gommers, P. Virtanen, D. Cournapeau, E. Wieser, J. Taylor, S. Berg, N. J. Smith, R. Kern, M. Picus, S. Hoyer, M. H. van Kerkwijk, M. Brett, A. Haldane, J. F. del Río, M. Wiebe, P. Peterson, P. Gérard-Marchant, K. Sheppard, T. Reddy, W. Weckesser, H. Abbasi, C. Gohlke, and T. E. Oliphant (2020) Array programming with NumPy. Nature 585 (7825), pp. 357–362. Cited by: §3.3.
  • T. Hastie, R. Mazumder, J. D. Lee, and R. Zadeh (2015) Matrix completion and low-rank svd via fast alternating least squares. The Journal of Machine Learning Research 16 (1), pp. 3367–3402. Cited by: §4.
  • G. R. Hjaltason and H. Samet (2003) Index-driven similarity search in metric spaces (survey article). ACM Transactions on Database Systems (TODS) 28 (4), pp. 517–580. Cited by: §4.
  • S. Hofstätter, S. Althammer, M. Schröder, M. Sertkan, and A. Hanbury (2020) Improving efficient neural ranking models with cross-architecture knowledge distillation. ArXiv abs/2010.02666. Cited by: §1.
  • S. Hofstätter, N. Rekabsaz, C. Eickhoff, and A. Hanbury (2019) On the effect of low-frequency terms on neural-ir models. In ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1137–1140. Cited by: §1.
  • S. Humeau, K. Shuster, M. Lachaux, and J. Weston (2020) Poly-encoders: transformer architectures and pre-training strategies for fast and accurate multi-sentence scoring. In International Conference on Learning Representations, ICLR, Cited by: §1.
  • P. Indyk, A. Vakilian, T. Wagner, and D. P. Woodruff (2019) Sample-optimal low-rank approximation of distance matrices. In Conference on Learning Theory, pp. 1723–1751. Cited by: §4.
  • H. Jegou, M. Douze, and C. Schmid (2010) Product quantization for nearest neighbor search. IEEE Transactions on Pattern Analysis and Machine Intelligence 33 (1), pp. 117–128. Cited by: §4.
  • J. Johnson, M. Douze, and H. Jégou (2019) Billion-scale similarity search with gpus. IEEE Transactions on Big Data 7 (3), pp. 535–547. Cited by: §2.4, §4.
  • V. Karpukhin, B. Oguz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, and W. Yih (2020) Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6769–6781. Cited by: §1.
  • R. H. Keshavan, A. Montanari, and S. Oh (2010) Matrix completion from a few entries. IEEE transactions on information theory 56 (6), pp. 2980–2998. Cited by: §4.
  • O. Khattab and M. Zaharia (2020) Colbert: efficient and effective passage search via contextualized late interaction over bert. In ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 39–48. Cited by: §1.
  • D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In International Conference on Learning Representations, ICLR, Cited by: §A.1.
  • G. J. Kowalski (2007) Information retrieval systems: theory and implementation. Vol. 1, Springer. Cited by: §1.
  • S. Kumar, M. Mohri, and A. Talwalkar (2012) Sampling methods for the nyström method. The Journal of Machine Learning Research 13 (1), pp. 981–1006. Cited by: §4.
  • F. Liu, Y. Jiao, J. Massiah, E. Yilmaz, and S. Havrylov (2022) Trans-encoder: unsupervised sentence-pair modelling through self-and mutual-distillations. In International Conference on Learning Representations, ICLR, Cited by: §1.
  • L. Logeswaran, M. Chang, K. Lee, K. Toutanova, J. Devlin, and H. Lee (2019) Zero-shot entity linking by reading entity descriptions. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3449–3460. Cited by: §1, §3, §4.
  • W. Lu, J. Jiao, and R. Zhang (2020) Twinbert: distilling knowledge to twin-structured compressed bert models for large-scale retrieval. In ACM International Conference on Information & Knowledge Management, pp. 2645–2652. Cited by: §1.
  • Y. Luan, J. Eisenstein, K. Toutanova, and M. Collins (2021) Sparse, dense, and attentional representations for text retrieval. Transactions of the Association for Computational Linguistics 9, pp. 329–345. Cited by: §1.
  • M. W. Mahoney and P. Drineas (2009) CUR matrix decompositions for improved data analysis. Proceedings of the National Academy of Sciences 106 (3), pp. 697–702. Cited by: §B.2, §1, §2.1.
  • Y. A. Malkov and D. A. Yashunin (2018) Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence 42 (4), pp. 824–836. Cited by: §2.4.
  • C. Musco and D. P. Woodruff (2017) Sublinear time low-rank approximation of positive semidefinite matrices. In IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pp. 672–683. Cited by: §4.
  • Y. Qu, Y. Ding, J. Liu, K. Liu, R. Ren, W. X. Zhao, D. Dong, H. Wu, and H. Wang (2021) RocketQA: an optimized training approach to dense passage retrieval for open-domain question answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 5835–5847. Cited by: §1, §1, §4.
  • A. Ramlatchan, M. Yang, Q. Liu, M. Li, J. Wang, and Y. Li (2018) A survey of matrix completion methods for recommendation systems. Big Data Mining and Analytics 1 (4), pp. 308–323. Cited by: §4.
  • A. Ray, N. Monath, A. McCallum, and C. Musco (2022) Sublinear time approximation of text similarity matrices.

    Proceedings of the AAAI Conference on Artificial Intelligence

    36 (7), pp. 8072–8080.
    Cited by: §B.2, §4.
  • S. Reddi, R. K. Pasumarthi, A. Menon, A. S. Rawat, F. Yu, S. Kim, A. Veit, and S. Kumar (2021) Rankdistil: knowledge distillation for ranking. In International Conference on Artificial Intelligence and Statistics, pp. 2368–2376. Cited by: §4.
  • S. E. Robertson, S. Walker, S. Jones, M. M. Hancock-Beaulieu, M. Gatford, et al. (1995) Okapi at trec-3. NIST Special Publication 109, pp. 109. Cited by: §2.1.
  • S. Tan, W. Zhao, and P. Li (2021) Fast neural ranking on bipartite graph indices. Proceedings of the VLDB Endowment 15 (4), pp. 794–803. Cited by: §4.
  • S. Tan, Z. Zhou, Z. Xu, and P. Li (2020) Fast item ranking under neural network based measures. In International Conference on Web Search and Data Mining, pp. 591–599. Cited by: §4.
  • N. Thakur, N. Reimers, J. Daxenberger, and I. Gurevych (2021) Augmented SBERT: data augmentation method for improving bi-encoders for pairwise sentence scoring tasks. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 296–310. Cited by: §1, §4.
  • J. A. Tropp, A. Yurtsever, M. Udell, and V. Cevher (2017) Randomized single-view algorithms for low-rank matrix approximation. Cited by: §4.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. Advances in Neural Information Processing Systems. Cited by: §2.1.
  • S. Wang and Z. Zhang (2013) Improving cur matrix decomposition and the nyström approximation via adaptive sampling. The Journal of Machine Learning Research 14 (1), pp. 2729–2769. Cited by: §4.
  • L. Wu, F. Petroni, M. Josifoski, S. Riedel, and L. Zettlemoyer (2020) Scalable zero-shot entity linking with dense entity retrieval. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6397–6407. Cited by: §1, §3.1, §4.
  • Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al. (2016) Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144. Cited by: §A.1.
  • H. Zamani, M. Dehghani, W. B. Croft, E. Learned-Miller, and J. Kamps (2018) From neural re-ranking to neural ranking: learning a sparse representation for inverted indexing. In ACM International Conference on Information and Knowledge Management, pp. 497–506. Cited by: §1.
  • W. Zhang and K. Stratos (2021)

    Understanding hard negatives in noise contrastive estimation

    .
    In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1090–1101. Cited by: §1, §3.1, footnote 2.

Appendix A Training Details

a.1 Training DE and CE for Entity Linking on ZeShEL

We initialize all models with bert-base-uncased and train using Adam Kingma and Ba (2015) optimizer with learning rate =

, and warm-up proportion=0.01 for four epochs. We evaluate on dev set five times during each epoch, and pick the model checkpoint that maximises accuracy on dev set. While training the dual-encoder model, we update negatives after each epoch using the latest dual-encoder model parameters to mine hard negatives. We trained the cross-encoder with a fixed set of 63 negatives items (entities) for each query (mention) mined using the dual-encoder model. We use batch size of 8 and 4 for training the dual-encoder and cross-encoder respectively.

Dual-encoder and cross-encoder models took 34 an 44 hours respectively for training on two NVIDIA GeForce RTX 8000 GPUs each with 48GB memory. The dual-encoder model has 2110M parameters as it consists of separate query and item encoder models while the cross-encoder model has 110M parameters.

Tokenization details

We use word-piece tokenization Wu et al. (2016) for with a maximum of 128 tokens including special tokens for tokenizing entities and mentions. The mention representation consists of the word-piece tokens of the context surrounding the mention and the mention itself as

[CLS] ctxtl [Ms] ment [Me] ctxtr [SEP]

where ment, ctxtl, and ctxtr are word-piece tokens of the mention, context before and after the mention respectively, and [Ms], [Me] are special tokens to tag the mention.

The entity representation is also composed of word-piece tokens of the entity title and description. The input to our entity model is:

[CLS] title [ENT] description [SEP]

where title, description are word-pieces tokens of entity title and description, and [ENT] is a special token to separate entity title and description representation.

Domain
Training Data
Military 104520 13063 2400
Pro Wrestling 10133 1392 1392
Doctor Who 40281 8334 4000
American Football 31929 3898 -
Fallout 16992 3286 -
Star Wars 87056 11824 -
World of Warcraft 27677 1437 -
Validation Data
Coronation Street 17809 1464 -
Muppets 21344 2028 -
Ice Hockey 28684 2233 -
Elder Scrolls 21712 4275 -
Test Data
Star Trek 34430 4227 4227
YuGiOh 10031 3374 3374
Forgotten Realms 15603 1200 -
Lego 10076 1199 -
Table 1: Statistics on number of entities (items), total number of mentions (queries), and number of mentions used in -NN experiments () in §3.2 for each domain in ZeShEL dataset.

The cross-encoder model takes as input the concatenated query (mention) and item (entity) representation with the [CLS] token stripped off the item (entity) tokenization as shown below

[CLS] ctxtl [Ms] ment [Me] ctxtr [SEP]
title [ENT] description [SEP]

a.2 Using query-item CE model for computing item-item similarity

We compute item-item similarity using a cross-encoder trained to score query-item pairs as follows. The query and item in our case correspond to mention of an entity with surrounding context and entity with its associated title and description respectively. We feed in first entity in the pair in the query slot by using mention span tokens around the title of the entity, and using entity description to fill in the right context of the mention. We feed in the second entity in the entity slot as usual. The concatenated representation of the entity pair () is given by

[CLS] [Ms] te1 [Me] de1 [SEP] te2 [E] de2 [SEP]

where te1, te2 are the tokenized titles of the entities, de1, de2 are the tokenized entity descriptions, [Me], [Ms] are special tokens denoting mention span boundary and [E] is a special token separating entity title from its description.

a.3 Training DE for -NN retrieval with CE

We train dual-encoder models using

-nearest neighbor items according to cross-encoder model for each query using two loss functions. Let

and be matrices containing score for all items for each query in training data. Let be top- items for query according to the cross-encoder and dual-encoder respectively, and let be top- items for query according to dual-encoder that are not present in .

We use loss functions and described below for training the dual-encoder model using a cross-encoder model.

where is the cross-entropy function, and is the softmax function. In words, is the cross-entropy loss between the dual-encoder and cross-encoder query-item score distribution over all items. Due to computational and memory limitations, we train by minimizing using items in for each query .

treats items in as a positive item, pairs it with hard negatives from , and minimizing increases dual-encoder’s score for items in , thus aligning with for queries in training data.

Training and optimization details

We train all dual-encoder models using Adam optimizer with learning rate= for 10 epochs. We use a separate set of parameters for query and item encoders. We use 10% of training queries for validation and train on the remaining 90% of the queries. For each domain and training data size, we train with both and loss functions, and pick the model that performs best on validation queries for -NN retrieval according to the cross-encoder model.

We train models for with loss on two NVIDIA GeForce RTX 2080Ti GPUs with 12GB GPU memory and with loss on two NVIDIA GeForce RTX 8000 GPUs with 48GB GPU memory as we could not train with on 2080Tis due to GPU memory limitations. For loss , we update the the list of negative items ( ) for each query after each epoch by mining hard negative items using the latest dual-encoder model parameters.

Model
100 DE- 2 2.5 4.5
100 DE- 2 0.5 2.5
100 annCUR 2 - 2
500 DE- 10 4.5 14.5
500 DE- 10 1 11
500 annCUR 10 - 10
2000 DE- 40 11 51
2000 DE- 40 3 43
2000 annCUR 40 - 40
(a) Indexing time (in hrs) for annCUR and distillation based DE baselines for different number of anchor/train queries () for domain=YuGiOh.
Domain (w/ size) Model
YuGiOh-10K DE- 10 4.5 14.5
YuGiOh-10K annCUR 10 - 10
Pro_Wrest-10K DE- 10 4.4 14.4
Pro_Wrest-10K annCUR 10 - 10
Star_Trek-34K DE- 40 5.1 45.1
Star_Trek-34K annCUR 40 - 40
Doctor_Who-40K DE- 40 5.2 45.2
Doctor_Who-40K annCUR 40 - 40
Military-104K DE- 102 5.1 107.1
Military-104K annCUR 102 - 102
(b) Indexing time (in hrs) for annCUR and distillation based DE baselines for various domains when using =500 anchor/train queries.
Table 2: Indexing time breakdown for annCUR and DE models trained via distillation.
Training Dev Set Test Set
Negatives [cls]-CE [emb]-CE [cls]-CE [emb]-CE
Random 59.60 57.74 58.72 56.56
tf-idf 62.19 62.29 58.20 58.36
DE 67.67 66.86 65.87 65.49
(a) Macro-Average of Entity Linking Accuracy for [cls]-CE and [emb]-CE models on test and dev set in ZeShEL.
Training [cls]-CE [emb]-CE
Negatives
Random 816 354
tf-idf 396 67
DE 315 45
(b) Rank of mention-entity cross-encoder score matrix for test domain = YuGiOh
Table 3: Accuracy on the downstream task of entity linking and rank of query-item (mention-entity) score matrix for [cls]-CE and [emb]-CE trained using different types of negatives.
Indexing and Training Time

Table 1(a) shows overall indexing time for the proposed method annCUR and for DE models trained using two distillation losses – and on domain=YuGiOh. Training time () for loss is much less as compared to that for as the former is trained on more powerful GPUs (two NVIDIA RTX8000s with 48GB memory each) due to its GPU memory requirements while the latter is trained on two NVIDIA 2080Ti GPUs with 12 GB memory each. The total indexing time ( for DE models includes the time taken to compute CE score matrix () because in order to train a DE model for the task of -nearest neighbor search for a CE, we need to first find exact -nearest neighbor items for the training queries. Note that this is different from the "standard" way of training of DE models via distillation where the DE is often distilled using CE scores on a fixed or heuristically retrieved set of items, and not on -nearest neighbor items according to the cross-encoder for a given query.

Table 1(b) shows indexing time for annCUR and DEs trained via distillation for five domains in ZeShEL. As the size of the domain increases, the time take for computing cross-encoder scores on training queries () also increases. The time takes to train dual-encoder via distillation roughly remains the same as we train with fixed number of positive and negative items during distillation.

Appendix B Additional Results and Analysis

b.1 Comparing [emb]-CE and [cls]-CE

In addition to training cross-encoder models with negatives mined using a dual-encoder, we train both [cls]-CE and [emb]-CE models using random negatives and negatives mined using tf-idf embeddings of mentions and entities. To evaluate the cross-encoder models, we retrieve 64 entities for each test mention using a dual-encoder model and re-rank them using a cross-encoder model.

Table 2(a) shows macro-averaged accuracy on the downstream task of entity linking over test and dev domains in ZeShEL dataset, and Table 2(b) shows rank of query-item score matrices on domain=YuGiOh for both cross-encoder models. The proposed [emb]-CE model performs at par with the widely used [cls]-CE architecture for all three kinds of negative mining strategies while producing a query-item score matrix with lower rank as compared to [cls]-CE.

Figure 8 shows approximation error of annCUR for different combinations of number of anchor queries and anchor items for [cls]-CE and [emb]-CE. For a given set of anchor queries, the approximation error is evaluated on the remaining set of queries. The error between a matrix and its approximation is measured as where is the Frobenius norm of a matrix. For the same choice of anchor queries and anchor items, the approximation error is lower for [emb]-CE model as compared to [cls]-CE. This aligns with the observation that rank of the query-item score matrix from [emb]-CE is lower than the corresponding matrix from [cls]-CE as shown in Table 2(b).

(a) [cls]-CE
(b) [emb]-CE
Figure 8: Matrix approximation error evaluated on non-anchor queries for CUR decomposition on domain = YuGiOh for [cls]-CE and [emb]-CE models. The total number of queries including both anchor and non-anchor (test) queries is 3374 and the total number of items is 10031.

b.2 Understanding poor performance of annCUR for

(a) Top-10-Recall@Cost=500
(b) Matrix Approx. Error
Figure 9: Performance of annCUR on non-anchor/test queries on domain = YuGiOh using for [emb]-CE. The total number of queries including both anchor and non-anchor (test) queries is 3374 and the total number of items is 10031.

Figure 6 in §3.3 shows Top-10-Recall@500 on domain=YuGiOh for [cls]-CE and [emb]-CE respectively on different combinations of number of anchor queries () and anchor items (). Note that the performance of annCUR drops significantly when . Recall that the indexing step (§2.3) requires computing pseudo-inverse of matrix containing scores between anchor queries () and anchor items (). Performance drops significantly when is a square matrix i.e.

as the matrix tends to be ill-conditioned, with several very small eigenvalues that are ‘blown up’ in

, the pseudo-inverse of . This, in-turn, leads to a significant approximation error Ray et al. (2022). Choosing different number of anchor queries and anchor items yields a rectangular matrix whose eigenvalues are unlikely to be small, thus resulting in much better approximation of the matrix .

Oracle CUR Decomposition

An alternate way of computing the matrix in CUR decomposition of a matrix for a given subset of rows () and columns () is to set . This can provide a much stable approximation of the matrix even when  Mahoney and Drineas (2009). However, it requires computing all values of before computing its low-rank approximation. In our case, we are trying to approximate a matrix which also contains scores between test-queries and all items in order to avoid scoring all items using the CE model at test-time, thus we can not use . Figure 9 shows results for an oracle experiment where we use , and as expected it provides significant improvement when and minor improvement otherwise over using .

b.3 -NN experiment results for all domains

For brevity, we show results for all top- values only for domain=YuGiOh in the main paper. For the sake of completeness and for interested readers, we add results for combinations of top- values, domains, and training data size values. Figure 10 - 24 contain results for top-, for domain YuGiOh, Pro_Wrestling, Doctor_Who, Star_Trek, Military, and training data size . For Pro_Wrestling, since the domain contains 1392 queries, we use maximum value of = 1000 instead of 2000.

(a) Top--Recall@ for annCUR and baselines when all methods retrieve and rerank the same number of items . The subscript in refers to the number of anchor items used for embedding the test query.
(b) Top--Recall for annCUR and baselines when all methods operate under a fixed test-time cost budget. Recall that cost is the number of CE calls made during inference for re-ranking retrieved items and, in case of annCUR, it also includes CE calls to embed the test query by comparing with anchor items.
Figure 10: Top--Recall results for domain=YuGiOh and
(a) Top--Recall@ for annCUR and baselines when all methods retrieve and rerank the same number of items . The subscript in refers to the number of anchor items used for embedding the test query.
(b) Top--Recall for annCUR and baselines when all methods operate under a fixed test-time cost budget. Recall that cost is the number of CE calls made during inference for re-ranking retrieved items and, in case of annCUR, it also includes CE calls to embed the test query by comparing with anchor items.
Figure 11: Top--Recall results for domain=YuGiOh and
(a) Top--Recall@ for annCUR and baselines when all methods retrieve and rerank the same number of items . The subscript in refers to the number of anchor items used for embedding the test query.
(b) Top--Recall for annCUR and baselines when all methods operate under a fixed test-time cost budget. Recall that cost is the number of CE calls made during inference for re-ranking retrieved items and, in case of annCUR, it also includes CE calls to embed the test query by comparing with anchor items.
Figure 12: Top--Recall results for domain=YuGiOh and
(a) Top--Recall@ for annCUR and baselines when all methods retrieve and rerank the same number of items . The subscript in refers to the number of anchor items used for embedding the test query.
(b) Top--Recall for annCUR and baselines when all methods operate under a fixed test-time cost budget. Recall that cost is the number of CE calls made during inference for re-ranking retrieved items and, in case of annCUR, it also includes CE calls to embed the test query by comparing with anchor items.
Figure 13: Top--Recall results for domain=Pro_Wrestling and
(a) Top--Recall@ for annCUR and baselines when all methods retrieve and rerank the same number of items . The subscript in refers to the number of anchor items used for embedding the test query.
(b) Top--Recall for annCUR and baselines when all methods operate under a fixed test-time cost budget. Recall that cost is the number of CE calls made during inference for re-ranking retrieved items and, in case of annCUR, it also includes CE calls to embed the test query by comparing with anchor items.
Figure 14: Top--Recall results for domain=Pro_Wrestling and
(a) Top--Recall@ for annCUR and baselines when all methods retrieve and rerank the same number of items . The subscript in refers to the number of anchor items used for embedding the test query.
(b) Top--Recall for annCUR and baselines when all methods operate under a fixed test-time cost budget. Recall that cost is the number of CE calls made during inference for re-ranking retrieved items and, in case of annCUR, it also includes CE calls to embed the test query by comparing with anchor items.
Figure 15: Top--Recall results for domain=Pro_Wrestling and
(a) Top--Recall@ for annCUR and baselines when all methods retrieve and rerank the same number of items . The subscript in refers to the number of anchor items used for embedding the test query.
(b) Top--Recall for annCUR and baselines when all methods operate under a fixed test-time cost budget. Recall that cost is the number of CE calls made during inference for re-ranking retrieved items and, in case of annCUR, it also includes CE calls to embed the test query by comparing with anchor items.
Figure 16: Top--Recall results for domain=Doctor_Who and
(a) Top--Recall@ for annCUR and baselines when all methods retrieve and rerank the same number of items . The subscript in refers to the number of anchor items used for embedding the test query.
(b) Top--Recall for annCUR and baselines when all methods operate under a fixed test-time cost budget. Recall that cost is the number of CE calls made during inference for re-ranking retrieved items and, in case of annCUR, it also includes CE calls to embed the test query by comparing with anchor items.
Figure 17: Top--Recall results for domain=Doctor_Who and
(a) Top--Recall@ for annCUR and baselines when all methods retrieve and rerank the same number of items . The subscript in refers to the number of anchor items used for embedding the test query.
(b) Top--Recall for annCUR and baselines when all methods operate under a fixed test-time cost budget. Recall that cost is the number of CE calls made during inference for re-ranking retrieved items and, in case of annCUR, it also includes CE calls to embed the test query by comparing with anchor items.
Figure 18: Top--Recall results for domain=Doctor_Who and
(a) Top--Recall@ for annCUR and baselines when all methods retrieve and rerank the same number of items . The subscript in refers to the number of anchor items used for embedding the test query.
(b) Top--Recall for annCUR and baselines when all methods operate under a fixed test-time cost budget. Recall that cost is the number of CE calls made during inference for re-ranking retrieved items and, in case of annCUR, it also includes CE calls to embed the test query by comparing with anchor items.
Figure 19: Top--Recall results for domain=Star_Trek and
(a) Top--Recall@ for annCUR and baselines when all methods retrieve and rerank the same number of items . The subscript in refers to the number of anchor items used for embedding the test query.
(b) Top--Recall for annCUR and baselines when all methods operate under a fixed test-time cost budget. Recall that cost is the number of CE calls made during inference for re-ranking retrieved items and, in case of annCUR, it also includes CE calls to embed the test query by comparing with anchor items.
Figure 20: Top--Recall results for domain=Star_Trek and
(a) Top--Recall@ for annCUR and baselines when all methods retrieve and rerank the same number of items . The subscript in refers to the number of anchor items used for embedding the test query.
(b) Top--Recall for annCUR and baselines when all methods operate under a fixed test-time cost budget. Recall that cost is the number of CE calls made during inference for re-ranking retrieved items and, in case of annCUR, it also includes CE calls to embed the test query by comparing with anchor items.
Figure 21: Top--Recall results for domain=Star_Trek and
(a) Top--Recall@ for annCUR and baselines when all methods retrieve and rerank the same number of items . The subscript in refers to the number of anchor items used for embedding the test query.
(b) Top--Recall for annCUR and baselines when all methods operate under a fixed test-time cost budget. Recall that cost is the number of CE calls made during inference for re-ranking retrieved items and, in case of annCUR, it also includes CE calls to embed the test query by comparing with anchor items.
Figure 22: Top--Recall results for domain=Military and
(a) Top--Recall@ for annCUR and baselines when all methods retrieve and rerank the same number of items . The subscript in refers to the number of anchor items used for embedding the test query.
(b) Top--Recall for annCUR and baselines when all methods operate under a fixed test-time cost budget. Recall that cost is the number of CE calls made during inference for re-ranking retrieved items and, in case of annCUR, it also includes CE calls to embed the test query by comparing with anchor items.
Figure 23: Top--Recall results for domain=Military and
(a) Top--Recall@ for annCUR and baselines when all methods retrieve and rerank the same number of items . The subscript in refers to the number of anchor items used for embedding the test query.
(b) Top--Recall for annCUR and baselines when all methods operate under a fixed test-time cost budget. Recall that cost is the number of CE calls made during inference for re-ranking retrieved items and, in case of annCUR, it also includes CE calls to embed the test query by comparing with anchor items.
Figure 24: Top--Recall results for domain=Military and