1 Introduction
Finding top scoring items for a given query is a fundamental subroutine of recommendation and information retrieval systems Kowalski (2007); Das et al. (2017)
. For instance, in question answering systems, the query corresponds to a question and the item corresponds to a document or a passage. Neural networks are widely used to model the similarity between a query and an item in such applications
Zamani et al. (2018); Hofstätter et al. (2019); Karpukhin et al. (2020); Qu et al. (2021). In this work, we focus on efficient nearest neighbor search for one such similarity function – the crossencoder model.Crossencoder models output a scalar similarity score by jointly encoding the queryitem pair and often generalize better to new domains and unseen data Chen et al. (2020); Wu et al. (2020); Thakur et al. (2021) as compared to dualencoder ^{1}^{1}1also referred to as twotower models, Siamese networks models which independently embed the query and the item in a vector space, and use simple functions such as dotproduct to measure similarity. However, due to the blackbox nature of the crossencoder based similarity function, the computational cost for brute force search with crossencoders is prohibitively high. This often limits the use of crossencoder models to reranking items retrieved using a separate retrieval model such as a dualencoder or a tfidfbased model Logeswaran et al. (2019); Zhang and Stratos (2021); Qu et al. (2021). The accuracy of such a twostage approach is upper bounded by the recall of relevant items by the initial retrieval model. Much of recent work either attempts to distill information from an expensive but more expressive crossencoder model into a cheaper student model such as a dualencoder Wu et al. (2020); Hofstätter et al. (2020); Lu et al. (2020); Qu et al. (2021); Liu et al. (2022), or focuses on cheaper alternatives to the crossencoder model while attempting to capture finegrained interactions between the query and the item Humeau et al. (2020); Khattab and Zaharia (2020); Luan et al. (2021).
In this work, we tackle the fundamental task of efficient nearest neighbor search for a given query according to the crossencoder. Our proposed approach, annCUR, uses CUR decomposition Mahoney and Drineas (2009), a matrix factorization approach, to approximate crossencoder scores for all items, and retrieves nearest neighbor items while only making a small number of calls to the crossencoder. Our proposed method selects a fixed set of anchor queries and anchor items, and uses scores between anchor queries and all items to generate latent embeddings for indexing the item set. At test time, we generate latent embedding for the query using crossencoder scores for the test query and anchor items, and use it to approximate scores of all items for the given query and/or retrieve top items according to the approximate scores. In contrast to distillationbased approaches, our proposed approach does not involve any additional computeintensive training of a student model such as dualencoder via distillation.
In general, the performance of a matrix factorizationbased method depends on the rank of the matrix being factorized. In our case, the entries of the matrix are crossencoder scores for queryitem pairs. To further improve rank of the score matrix, and inturn performance of the proposed matrix factorization based approach, we propose [emb]CE which uses a novel dotproduct based scoring mechanism for crossencoder models (see Figure 0(a)). In contrast to the widely used [cls]CE approach of pooling queryitem representation into a single vector followed by scoring using a linear layer, [emb]CE produces a score matrix with a much lower rank while performing at par with [cls]CE on the downstream task.
We run extensive experiments with crossencoder models trained for the downstream task of entity linking. The query and item in this case correspond to a mention of an entity in text and a document with an entity description respectively. For the task of retrieving nearest neighbors according to the crossencoder, our proposed approach presents superior recallvscomputational cost tradeoffs over using dualencoders trained via distillation as well as over unsupervised tfidfbased methods (§3.2). We also evaluate the proposed method for various indexing and testtime cost budgets as well as study the effect of various design choices in §3.3 and §3.4.
2 Matrix Factorization for Nearest Neighbor Search
2.1 Task Description and Background
Given a scoring function that maps a queryitem pair to a scalar score, and a query , the nearest neighbor task is to retrieve top scoring items according to the given scoring function from a fixed item set .
In NLP, queries and items are typically represented as a sequence of tokens and the scoring function is typically parameterized using deep neural models such as transformers Vaswani et al. (2017). There are two popular choices for the scoring function – the crossencoder (CE) model, and the dualencoder (DE) model. The CE model scores a given queryitem pair by concatenating the query and the item using special tokens, passing them through a model (such as a transformer ) to obtain representation for the input pair followed by computing the score using linear weights .
While effective, computing a similarity between a queryitem pair requires a full forward pass of the model, which is often quite computationally burdensome. As a result, previous work uses auxiliary retrieval models such as BM25 Robertson et al. (1995) or a trained dualencoder (DE) model to approximate the CE. The DE model independently embeds the query and the item in , for instance by using a transformer followed by pooling the final layer representations into a single vector (e.g. using cls token). The DE score for the queryitem pair is computed using dotproduct of the query embedding and the item embedding.
In this work, we propose a method based on CUR matrix factorization that allows efficient retrieval of top items by directly approximating the crossencoder model rather than using an auxiliary (trained) retrieval model.
CUR Decomposition
Mahoney and Drineas (2009) In CUR matrix factorization, a matrix is approximated using a subset of its rows , a subset of its columns and a joining matrix as follows
where and are the indices corresponding to rows and columns respectively, and the joining matrix optimizes the approximation error. In this work, we set to be the MoorePenrose pseudoinverse of , the intersection of matrices and , in which case is known as the skeleton approximation of Goreinov et al. (1997).
2.2 Proposed Method Overview
Our proposed method annCUR, which stands for Approximate Nearest Neighbor search using CUR decomposition, begins by selecting a fixed set of anchor queries () and anchor items (). It uses scores between queries and all items to index the item set by generating latent item embeddings. At test time, we compute exact scores between the test query and the anchor items , and use it to approximate scores of all items for the given query and/or retrieve top items according to the approximate scores. We could optionally retrieve items, rerank them using exact scores and return top items.
Let and where contains scores for the anchor queries and all items, contains scores for a test query and all items, contains scores for the anchor queries and the anchor items, and contains scores for the test query paired with the anchor items.
Using CUR decomposition, we can approximate using a subset of its columns () corresponding to the anchor items and a subset of its rows () corresponding to the anchor queries as
Figure 2 shows CUR decomposition of matrix . At test time, containing approximate item scores for the test query can be computed using , and where contains exact scores between the test query and the anchor items. Matrices and can be computed offline as these are independent of the test query.
2.3 Offline Indexing
The indexing process first computes containing scores between the anchor queries and all items.
We embed all items in as
where is the pseudoinverse of . Each column of corresponds to a latent item embedding.
2.4 Testtime Inference
At test time, we embed the test query in using scores between and anchor items .
We approximate the score for a queryitem pair () using innerproduct of and where is the embedding of item .
We can use along with an offtheshelf nearestneighbor search method for maximum innerproduct search Malkov and Yashunin (2018); Johnson et al. (2019); Guo et al. (2020) and retrieve topscoring items for the given query according to the approximate queryitem scores without explicitly approximating scores for all the items.
2.5 Time Complexity
During indexing stage, we evaluate for queryitem pairs, and compute the pseudoinverse of a matrix. The overall time complexity of the indexing stage is , where is the cost of computing the pseudoinverse of a matrix, and is the cost of computing on a queryitem pair. For CE models used in this work, we observe that .
At test time, we need to compute for queryitem pairs followed by optionally reranking items retrieved by maximum innerproduct search (MIPS). Overall time complexity for inference is , where is the time complexity of MIPS over items to retrieve top items.
2.6 Improving score distribution of CE models for matrix factorization
The rank of the queryitem score matrix, and in turn, the approximation error of a matrix factorization method depends on the scores in the matrix. Figure 0(b) shows a histogram of queryitem score distribution (adjusted to have zero mean) for a dualencoder and [cls]CE model. We use [cls]CE to refer to a crossencoder model parameterized using transformers which uses cls token to compute a pooled representation of the input queryitem pair. Both the models are trained for zeroshot entity linking (see §3.1 for details). As shown in the figure, the queryitem score distribution for the [cls]CE
model is significantly skewed with only a small fraction of items (entities) getting high scores while the score distribution for a dualencoder model is less so as it is generated explicitly using dotproduct of query and item embeddings. The skewed score distribution from
[cls]CE leads to a high rank queryitem score matrix, which results in a large approximation error for matrix decomposition methods.We propose a small but important change to the scoring mechanism of the crossencoder so that it yields a less skewed score distribution, thus making it much easier to approximate the corresponding queryitem score matrix without adversely affecting the downstream task performance. Instead of using cls token representation to score a given queryitem pair, we add special tokens amongst the query and the item tokens and extract contextualized query and item representations using the special tokens after jointly encoding the queryitem pair using a model such as a transformer .
The final score for the given queryitem pair is computed using dotproduct of the contextualized query and item embeddings.
We refer to this model as [emb]CE. Figure 0(a) shows highlevel model architecture for dualencoders, [cls]CE and [emb]CE model.
As shown in Figure 0(b), the queryitem score distribution from an [emb]CE model resembles that from a DE model. Empirically, we observe that rank of the queryitem score matrix for [emb]CE model is much lower than the rank of a similar matrix computed using [cls]CE, thus making it much easier to approximate using matrix decomposition based methods.
3 Experiments
In our experiments, we use CE models trained for zeroshot entity linking on ZeShEL dataset (§3.1). We evaluate the proposed method and various baselines on the task of finding nearest neighbors for crossencoder models in §3.2, and evaluate the proposed method for various indexing and testtime cost budgets as well as study the effect of various design choices in §3.3 and §3.4. All resources for the paper including code for all experiments and model checkpoints is available at https://github.com/iesl/anncur
ZeShEL Dataset
The ZeroShot Entity Linking (ZeShEL) dataset was constructed by Logeswaran et al. (2019) from Wikia. The task of zeroshot entity linking involves linking entity mentions in text to an entity from a list of entities with associated descriptions. The dataset consists of 16 different domains with eight, four, and four domains in training, dev, and test splits respectively. Each domain contains nonoverlapping sets of entities, thus at test time, mentions need to be linked to unseen entities solely based on entity descriptions. Table 1 in the appendix shows dataset statistics. In this task, queries correspond to mentions of entities along with the surrounding context, and items correspond to entities with their associated descriptions.
3.1 Training DE and CE models on ZeShEL
Following the precedent set by recent papers Wu et al. (2020); Zhang and Stratos (2021), we first train a dualencoder model on ZeShEL training data using hard negatives. We train a crossencoder model for the task of zeroshot entitylinking on all eight training domains using crossentropy loss with groundtruth entity and negative entities mined using the dualencoder. We refer the reader to Appendix A.1 for more details.
Results on downstream task of Entity Linking
To evaluate the crossencoder models, we retrieve 64 entities for each test mention using the dualencoder model and rerank them using a crossencoder model. The top64 entities retrieved by the DE contain the groundtruth entity for 87.95% mentions in test data and 92.04% mentions in dev data. The proposed [emb]CE model achieves an average accuracy of 65.49 and 66.86 on domains in test and dev set respectively, and performs at par with the widely used and stateoftheart ^{2}^{2}2We observe that our implementation of [cls]CE obtains slightly different results as compared to stateoftheart (see Table 2 in Zhang and Stratos (2021) ) likely due to minor implementation/training differences. [cls]CE architecture which achieves an accuracy of 65.87 and 67.67 on test and dev set respectively. Since [emb]CE model performs at par with [cls]CE on the downstream task of entity linking, and rank of the score matrix from [emb]CE is much lower than that from [cls]CE, we use [emb]CE in subsequent experiments.
3.2 Evaluating on NN search for CE
Experimental Setup
For all experiments in this section, we use the [emb]CE model trained on original ZeShEL training data on the task of zeroshot entity linking, and evaluate the proposed method and baselines for the task of retrieving nearest neighbor entities (items) for a given mention (query) according to the crossencoder model.
We run experiments separately on five domains from ZeShEL containing 10K to 100K items. For each domain, we compute the queryitem score matrix for a subset or all of the queries (mentions) and all items (entities) in the domain. We randomly split the query set into a training set () and a test set (). We use the queries in training data to train baseline DE models. For annCUR, we use the training queries as anchor queries and use CE scores between the anchor queries and all items for indexing as described in §2.3. All approaches are then evaluated on the task of finding top CE items for queries in the corresponding domain’s test split. For a fair comparison, we do not train DE models on multiple domains at the same time.
3.2.1 Baseline Retrieval Methods
tfidf:
All queries and items are embedded using a tfidf vectorizer trained on item descriptions and top items are retrieved using the dotproduct of query and item embeddings.
DE models:
We experiment with DE_{base}, the DE model trained on ZeShEL for the task of entity linking (see §3.1), and the following two DE models trained via distillation from the CE.

DE_{bert+ce}: DE initialized with Bert Devlin et al. (2019) and trained only using training signal from the crossencoder model.

DE_{base+ce}: DE_{base} model further finetuned via distillation using the crossencoder model.
We refer the reader to Appendix A.3 for hyperparameter and optimization details.
Evaluation metric
We evaluate all approaches under the following two settings.

[topsep=1pt,itemsep=0ex,partopsep=1ex,parsep=1ex]

In the first setting, we retrieve items for a given query, rerank them using exact CE scores and keep top items. We evaluate each method using TopRecall@ which is the percentage of top items according to the CE model present in the retrieved items.

In the second setting, we operate under a fixed testtime cost budget where the cost is defined as the number of CE calls made during inference. Baselines such as DE and tfidf will use the entire cost budget for reranking items using exact CE scores while our proposed approach will have to split the budget between the number of anchor items () used for embedding the query (§2.4) and the number of items () retrieved for final reranking.
We refer to our proposed method as when using fixed set of anchor items chosen uniformly at random, and we refer to it as annCUR when operating under a fixed testtime cost budget in which case different values of and are used in each setting.
3.2.2 Results
Figures 2(a) and 2(b) show recall of top crossencoder nearest neighbors for on ZeShEL domain = YuGiOh when using 500 queries for training, and evaluating on the remaining 2874 test queries. Figure 2(a) shows recall when each method retrieves the same number of items and Figure 2(b) shows recall when each method operates under a fixed inference cost budget.
Performance for
In Figure 2(a), our proposed approach outperforms all baselines at finding top, and nearest neighbors when all models retrieve the same number of items for reranking. In Figure 2(b), when operating under the same cost budget, annCUR outperforms DE baselines at larger cost budgets for , and . Recall that at a smaller cost budget, annCUR is able to retrieve fewer number of items for exact reranking than the baselines as it needs to use a fraction of the cost budget i.e. CE calls to compare the testquery with anchor items in order to embed the query for retrieving relevant items. Generally, the optimal budget split between the number of anchor items () and the number of items retrieved for exact reranking () allocates around 4060% of the budget to and the remaining budget to .
Performance for
The top nearest neighbor according to the CE is likely to be the groundtruth entity (item) for the given mention (query). Note that DE_{base} was trained using a massive amount of entity linking data (all eight training domains in ZeShEL, see §3.1) using the groundtruth entity (item) as the positive item. Thus, it is natural for top nearest neighbor for both of these models to be aligned. For this reason, we observe that DE_{base} and DE_{base+ce} outperform annCUR for . However, our proposed approach either outperforms or is competitive with DE_{bert+ce}, a DE model trained only using CE scores for 500 queries after initializing with Bert. In Figure 2(a), and outperform DE_{bert+ce} and in Figure 2(b) annCUR outperforms DE_{bert+ce} at larger cost budgets.
We refer the reader to Appendix B.3 for results on all combinations of top values, domains, and training data size values.
Effect of training data size ()
Figure 4 shows Top100Recall@Cost=500 on test queries for various methods as we increase the number of queries in training data (). For DE baselines, the trend is not consistent across all domains. On YuGiOh, the performance consistently improves with . However, on Military, the performance of distilled DE drops on going from 100 to 500 training queries but improves on going from 500 to 2000 training queries. Similarly, on Pro_Wrestling, performance of distilled DE_{base+ce} does not consistently improve with training data size while it does for DE_{bert+ce}. We suspect that this is due to a combination of various factors such as overfitting on training data, suboptimal hyperparameter configuration, divergence of model parameters etc. In contrast, our proposed method, annCUR, always shows consistent improvements as we increase the number of queries in training data, and avoids the perils of gradientbased training that often require large amounts of training data to avoid overfitting as well expensive hyperparameter tuning in order to consistently work well across various domains.
Effect of domain size
Figure 5 shows Top100Recall@Cost=500 for annCUR and DE baselines on primary yaxis and size of the domain i.e. total number of items on secondary yaxis for five different domains in ZeShEL. Generally, as the number of items in the domain increases, the performance of all methods drops.
Indexing Cost
The indexing process starts by computing queryitem CE scores for queries in train split. annCUR uses these scores for indexing the items (see §2.3) while DE baselines use these scores to find groundtruth top items for each query followed by training DE models using CE queryitem scores. For domain=YuGiOh with items, and , the time taken to compute queryitem scores for train/anchor queries () hours on an NVIDIA GeForce RTX2080Ti GPU/12GB memory, and training a DE model further takes additional time 4.5 hours on two instances of the same GPU. Both and increase linearly with domain size and , however the queryitem score computation can be trivially parallelized. We ignore the time to build a nearestneighbor search index over item embeddings produced by annCUR or DE as that is negligible in comparison to time spent on CE score computation and training of DE models. We refer the reader to Appendix A.3 for more details.
3.3 Analysis of annCUR
We compute the queryitem score matrix for both [cls]CE and [emb]CE and compute the rank of these matrices using numpy Harris et al. (2020) for domain=YuGiOh with 3374 queries (mentions) and 10031 items (entities). Rank of the score matrix for [cls]CE = 315 which is much higher than rank of the corresponding matrix for [emb]CE = 45 due to the queryitem score distribution produced by [cls]CE model being much more skewed than that produced by [emb]CE model (see Fig. 0(b)).
Figures 5(a) and 5(b) show Top10Recall@500 on domain=YuGiOh for [cls]CE and [emb]CE respectively on different combinations of number of anchor queries () and anchor items (). Both anchor queries and anchor items are chosen uniformly at random, and for a given set of anchor queries, we evaluate on the remaining set of queries.
[cls]CE versus [emb]CE
For the same choice of anchor queries and anchor items, the proposed method performs better with [emb]CE model as compared [cls]CE due to the queryitem score matrix for [emb]CE having much lower rank thus making it easier to approximate.
Effect of and
Recall that the indexing time for annCUR is directly proportional to the number of anchor queries () while the number of anchor items () influences the testtime inference latency. Unsurprisingly, performance of annCUR increases as we increase and , and these can be tuned as per user’s requirement to obtain desired recallvsindexing time and recallvsinference time tradeoffs. We refer the reader to Appendix B.2 for a detailed explanation for the drop in performance when .
3.4 ItemItem Similarity Baselines
We additionally compare with the following baselines that index items by comparing against a fixed set of anchor items ^{3}^{3}3See appendix A.2 for details on computing itemitem scores using a CE model trained to score queryitem pairs. instead of anchor queries.

[topsep=1pt,itemsep=0ex,partopsep=1ex,parsep=1ex]

fixedITEM: Embed all items and testquery in using CE scores with a fixed set of items chosen uniformly at random, and retrieve top items for the test query based on dotproduct of these dim embeddings. We use .

itemCUR: This is similar to the proposed approach except that it indexes the items by comparing them against anchor items instead of anchor queries for computing and matrices in the indexing step in §2.3. At test time, it performs inference just like annCUR (see §2.4) by comparing against a different set of fixed anchor items. We use .
Figure 7 shows Top100Recall for fixedITEM, itemCUR, and annCUR on domain = YuGiOh. itemCUR performs better than fixedITEM indicating that the latent item embeddings produced using CUR decomposition of the itemitem similarity matrix are better than those built by comparing the items against a fixed set of anchor items. itemCUR performs worse than annCUR apparently because the CE was trained on queryitem pairs and was not calibrated for itemitem comparisons.
4 Related Work
Matrix Decomposition
Classic matrix decomposition methods such as SVD, QR decomposition have been used for approximating kernel matrices and distance matrices
Musco and Woodruff (2017); Tropp et al. (2017); Bakshi and Woodruff (2018); Indyk et al. (2019). Interpolative decomposition methods such as Nyström method and CUR decomposition allow approximation of the matrix even when given only a subset of rows and columns of the matrix. Unsurprisingly, performance of these methods can be further improved if given the entire matrix as it allows for a better selection of rows and columns on the matrix used in the decomposition process
Goreinov et al. (1997); Drineas et al. (2005); Kumar et al. (2012); Wang and Zhang (2013). Recent work, Ray et al. (2022) proposes sublinear Nyström approximations and considers CURbased approaches for approximating nonPSD similarity matrices that arise in NLP tasks such as coreference resolution and document classification. Unlike previous work, our goal is to use the approximate scores to support retrieval of top scoring items. Although matrix decomposition methods for sparse matrices based on SVD Berry (1992); Keshavan et al. (2010); Hastie et al. (2015); Ramlatchan et al. (2018) can be used instead of CUR decomposition, such methods would require a) factorizing a sparse matrix at test time in order to obtain latent embeddings for all items and the test query, and b) indexing the latent item embeddings to efficiently retrieve topscoring items for the given query. In this work, we use CUR decomposition as, unlike other sparse matrix decomposition methods, CUR decomposition allows for offline computation and indexing of item embeddings and the latent embedding for a test query is obtained simply by using its crossencoder scores against the anchor items.CrossEncoders and Distillation
Due to high computational costs, use of crossencoders (CE) is often limited to either scoring a fixed set of items or reranking items retrieved by a separate (cheaper) retrieval model Logeswaran et al. (2019); Qu et al. (2021); Bhattacharyya et al. (2021); Ayoola et al. (2022). CE models are also widely used for training computationally cheaper models via distillation on the training domain (Wu et al., 2020; Reddi et al., 2021), or for improving performance of these cheaper models on the target domain Chen et al. (2020); Thakur et al. (2021)
by using crossencoders to score a fixed or heuristically retrieved set of items/datapoints. The DE baselines used in this work, in contrast, are trained using
nearest neighbors for a given query according to the CE.Nearest Neighbor Search
For applications where the inputs are described as vectors in , nearest neighbor search has been widely studied for various (dis)similarity functions such as distance Chávez et al. (2001); Hjaltason and Samet (2003), innerproduct Jegou et al. (2010); Johnson et al. (2019); Guo et al. (2020), and Bregmandivergences Cayton (2008). Recent work on nearest neighbor search with nonmetric (parametric) similarity functions explores various treebased Boytsov and Nyberg (2019b) and graphbased nearest neighbor search indices Boytsov and Nyberg (2019a); Tan et al. (2020, 2021). In contrast, our approach approximates the scores of the parametric similarity function using the latent embeddings generated using CUR decomposition and uses offtheshelf maximum inner product search methods with these latent embeddings to find nearest neighbors for the CE. An interesting avenue for future work would be to combine our approach with treebased and graphbased approaches to further improve efficiency of these search methods.
5 Conclusion
In this paper, we proposed, annCUR, a matrix factorizationbased approach for nearest neighbor search for a crossencoder model without relying on an auxiliary model such as a dualencoder for retrieval. annCUR approximates the test query’s scores with all items by scoring the testquery only with a small number of anchor items, and retrieves items using the approximate scores. Empirically, for , our approach provides testtime recallvscomputational cost tradeoffs superior to the widelyused approach of using crossencoders to rerank items retrieved using a dualencoder or a tfidfbased model. This work is a step towards enabling efficient retrieval with expensive similarity functions such as crossencoders, and thus, moving beyond using such models merely for reranking items retrieved by auxiliary retrieval models such as dualencoders and tfidfbased models.
Acknowledgements
We thank members of UMass IESL for helpful discussions and feedback. This work was supported in part by the Center for Data Science and the Center for Intelligent Information Retrieval, in part by the National Science Foundation under Grant No. NSF1763618, in part by the Chan Zuckerberg Initiative under the project “Scientific Knowledge Base Construction”, in part by International Business Machines Corporation Cognitive Horizons Network agreement number W1668553, and in part using high performance computing equipment obtained under a grant from the Collaborative R&D Fund managed by the Massachusetts Technology Collaborative. Rico Angell was supported by the NSF Graduate Research Fellowship under Grant No. 1938059. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect those of the sponsor(s).
Limitations
In this work, we use crossencoders parameterized using transformer models. Computing queryitem scores using such models can be computationally expensive. For instance, on an NVIDIA GeForce RTX 2080Ti GPU with 12GB memory, we can achieve a throughput of approximately 140 scores/second, and computing a score matrix for 100 queries and 10K items takes about two hours. Although this computation can be trivially parallelized, the total amount of GPU hours required for this computation can be very high. However, note that these scores need to be computed even for distillation based DE baselines as we need to identify nearest neighbors for each query according to the crossencoder model for training a dualencoder model on this task.
Our proposed approach allows for indexing the item set only using scores from crossencoder without any additional gradient based training but it is not immediately clear how it can benefit from data on multiple target domains at the same time. Parametric models such as dualencoders on the other hand can benefit from training and knowledge distillation on multiple domains at the same time.
Ethical Consideration
Our proposed approach considers how to speed up the computation of nearest neighbor search for crossencoder models. The crossencoder model, which our approach approximates, may have certain biases / error tendencies. Our proposed approach does not attempt to mitigate those biases. It is not clear how those biases would propagate in our approximation, which we leave for future work. An informed user would scrutinize both the crossencoder model and the resulting approximations used in this work.
References
 ReFinED: an efficient zeroshotcapable approach to endtoend entity linking. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Track, pp. 209–220. Cited by: §4.
 Sublinear time lowrank approximation of distance matrices. Advances in Neural Information Processing Systems. Cited by: §4.

Largescale sparse singular value computations
. The International Journal of Supercomputing Applications 6 (1), pp. 13–49. Cited by: §4. 
Energybased reranking: improving neural machine translation using energybased models
. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
, pp. 4528–4537. Cited by: §4.  Accurate and fast retrieval for complex nonmetric data via neighborhood graphs. In International Conference on Similarity Search and Applications, pp. 128–142. Cited by: §4.
 Pruning algorithms for lowdimensional nonmetric knn search: a case study. In International Conference on Similarity Search and Applications, pp. 72–85. Cited by: §4.

Fast nearest neighbor retrieval for bregman divergences.
In
International Conference on Machine Learning
, pp. 112–119. Cited by: §4.  Searching in metric spaces. ACM computing surveys (CSUR) 33 (3), pp. 273–321. Cited by: §4.
 DiPair: fast and accurate distillation for trillionscale text matching and pair modeling. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 2925–2937. Cited by: §1, §4.
 A survey on recommendation system. International Journal of Computer Applications 160 (7). Cited by: §1.
 BERT: pretraining of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Cited by: 1st item.
 On the nyström method for approximating a gram matrix for improved kernelbased learning.. The Journal of Machine Learning Research 6 (12). Cited by: §4.
 A theory of pseudoskeleton approximations. Linear Algebra and its Applications 261 (13), pp. 1–21. Cited by: §2.1, §4.
 Accelerating largescale inference with anisotropic vector quantization. In International Conference on Machine Learning, pp. 3887–3896. Cited by: §2.4, §4.
 Array programming with NumPy. Nature 585 (7825), pp. 357–362. Cited by: §3.3.
 Matrix completion and lowrank svd via fast alternating least squares. The Journal of Machine Learning Research 16 (1), pp. 3367–3402. Cited by: §4.
 Indexdriven similarity search in metric spaces (survey article). ACM Transactions on Database Systems (TODS) 28 (4), pp. 517–580. Cited by: §4.
 Improving efficient neural ranking models with crossarchitecture knowledge distillation. ArXiv abs/2010.02666. Cited by: §1.
 On the effect of lowfrequency terms on neuralir models. In ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1137–1140. Cited by: §1.
 Polyencoders: transformer architectures and pretraining strategies for fast and accurate multisentence scoring. In International Conference on Learning Representations, ICLR, Cited by: §1.
 Sampleoptimal lowrank approximation of distance matrices. In Conference on Learning Theory, pp. 1723–1751. Cited by: §4.
 Product quantization for nearest neighbor search. IEEE Transactions on Pattern Analysis and Machine Intelligence 33 (1), pp. 117–128. Cited by: §4.
 Billionscale similarity search with gpus. IEEE Transactions on Big Data 7 (3), pp. 535–547. Cited by: §2.4, §4.
 Dense passage retrieval for opendomain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6769–6781. Cited by: §1.
 Matrix completion from a few entries. IEEE transactions on information theory 56 (6), pp. 2980–2998. Cited by: §4.
 Colbert: efficient and effective passage search via contextualized late interaction over bert. In ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 39–48. Cited by: §1.
 Adam: A method for stochastic optimization. In International Conference on Learning Representations, ICLR, Cited by: §A.1.
 Information retrieval systems: theory and implementation. Vol. 1, Springer. Cited by: §1.
 Sampling methods for the nyström method. The Journal of Machine Learning Research 13 (1), pp. 981–1006. Cited by: §4.
 Transencoder: unsupervised sentencepair modelling through selfand mutualdistillations. In International Conference on Learning Representations, ICLR, Cited by: §1.
 Zeroshot entity linking by reading entity descriptions. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3449–3460. Cited by: §1, §3, §4.
 Twinbert: distilling knowledge to twinstructured compressed bert models for largescale retrieval. In ACM International Conference on Information & Knowledge Management, pp. 2645–2652. Cited by: §1.
 Sparse, dense, and attentional representations for text retrieval. Transactions of the Association for Computational Linguistics 9, pp. 329–345. Cited by: §1.
 CUR matrix decompositions for improved data analysis. Proceedings of the National Academy of Sciences 106 (3), pp. 697–702. Cited by: §B.2, §1, §2.1.
 Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence 42 (4), pp. 824–836. Cited by: §2.4.
 Sublinear time lowrank approximation of positive semidefinite matrices. In IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pp. 672–683. Cited by: §4.
 RocketQA: an optimized training approach to dense passage retrieval for opendomain question answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 5835–5847. Cited by: §1, §1, §4.
 A survey of matrix completion methods for recommendation systems. Big Data Mining and Analytics 1 (4), pp. 308–323. Cited by: §4.

Sublinear time approximation of text similarity matrices.
Proceedings of the AAAI Conference on Artificial Intelligence
36 (7), pp. 8072–8080. Cited by: §B.2, §4.  Rankdistil: knowledge distillation for ranking. In International Conference on Artificial Intelligence and Statistics, pp. 2368–2376. Cited by: §4.
 Okapi at trec3. NIST Special Publication 109, pp. 109. Cited by: §2.1.
 Fast neural ranking on bipartite graph indices. Proceedings of the VLDB Endowment 15 (4), pp. 794–803. Cited by: §4.
 Fast item ranking under neural network based measures. In International Conference on Web Search and Data Mining, pp. 591–599. Cited by: §4.
 Augmented SBERT: data augmentation method for improving biencoders for pairwise sentence scoring tasks. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 296–310. Cited by: §1, §4.
 Randomized singleview algorithms for lowrank matrix approximation. Cited by: §4.
 Attention is all you need. Advances in Neural Information Processing Systems. Cited by: §2.1.
 Improving cur matrix decomposition and the nyström approximation via adaptive sampling. The Journal of Machine Learning Research 14 (1), pp. 2729–2769. Cited by: §4.
 Scalable zeroshot entity linking with dense entity retrieval. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6397–6407. Cited by: §1, §3.1, §4.
 Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144. Cited by: §A.1.
 From neural reranking to neural ranking: learning a sparse representation for inverted indexing. In ACM International Conference on Information and Knowledge Management, pp. 497–506. Cited by: §1.

Understanding hard negatives in noise contrastive estimation
. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1090–1101. Cited by: §1, §3.1, footnote 2.
Appendix A Training Details
a.1 Training DE and CE for Entity Linking on ZeShEL
We initialize all models with bertbaseuncased and train using Adam Kingma and Ba (2015) optimizer with learning rate =
, and warmup proportion=0.01 for four epochs. We evaluate on dev set five times during each epoch, and pick the model checkpoint that maximises accuracy on dev set. While training the dualencoder model, we update negatives after each epoch using the latest dualencoder model parameters to mine hard negatives. We trained the crossencoder with a fixed set of 63 negatives items (entities) for each query (mention) mined using the dualencoder model. We use batch size of 8 and 4 for training the dualencoder and crossencoder respectively.
Dualencoder and crossencoder models took 34 an 44 hours respectively for training on two NVIDIA GeForce RTX 8000 GPUs each with 48GB memory. The dualencoder model has 2110M parameters as it consists of separate query and item encoder models while the crossencoder model has 110M parameters.
Tokenization details
We use wordpiece tokenization Wu et al. (2016) for with a maximum of 128 tokens including special tokens for tokenizing entities and mentions. The mention representation consists of the wordpiece tokens of the context surrounding the mention and the mention itself as
[CLS] ctxt_{l} [M_{s}] ment [M_{e}] ctxt_{r} [SEP] 
where ment, ctxt_{l}, and ctxt_{r} are wordpiece tokens of the mention, context before and after the mention respectively, and [M_{s}], [M_{e}] are special tokens to tag the mention.
The entity representation is also composed of wordpiece tokens of the entity title and description. The input to our entity model is:
[CLS] title [ENT] description [SEP] 
where title, description are wordpieces tokens of entity title and description, and [ENT] is a special token to separate entity title and description representation.
Domain  
Training Data  
Military  104520  13063  2400 
Pro Wrestling  10133  1392  1392 
Doctor Who  40281  8334  4000 
American Football  31929  3898   
Fallout  16992  3286   
Star Wars  87056  11824   
World of Warcraft  27677  1437   
Validation Data  
Coronation Street  17809  1464   
Muppets  21344  2028   
Ice Hockey  28684  2233   
Elder Scrolls  21712  4275   
Test Data  
Star Trek  34430  4227  4227 
YuGiOh  10031  3374  3374 
Forgotten Realms  15603  1200   
Lego  10076  1199   
The crossencoder model takes as input the concatenated query (mention) and item (entity) representation with the [CLS] token stripped off the item (entity) tokenization as shown below
[CLS] ctxt_{l} [M_{s}] ment [M_{e}] ctxt_{r} [SEP]  
title [ENT] description [SEP] 
a.2 Using queryitem CE model for computing itemitem similarity
We compute itemitem similarity using a crossencoder trained to score queryitem pairs as follows. The query and item in our case correspond to mention of an entity with surrounding context and entity with its associated title and description respectively. We feed in first entity in the pair in the query slot by using mention span tokens around the title of the entity, and using entity description to fill in the right context of the mention. We feed in the second entity in the entity slot as usual. The concatenated representation of the entity pair () is given by
[CLS] [M_{s}] t_{e1} [M_{e}] d_{e1} [SEP] t_{e2} [E] d_{e2} [SEP] 
where t_{e1}, t_{e2} are the tokenized titles of the entities, d_{e1}, d_{e2} are the tokenized entity descriptions, [M_{e}], [M_{s}] are special tokens denoting mention span boundary and [E] is a special token separating entity title from its description.
a.3 Training DE for NN retrieval with CE
We train dualencoder models using
nearest neighbor items according to crossencoder model for each query using two loss functions. Let
and be matrices containing score for all items for each query in training data. Let be top items for query according to the crossencoder and dualencoder respectively, and let be top items for query according to dualencoder that are not present in .We use loss functions and described below for training the dualencoder model using a crossencoder model.
where is the crossentropy function, and is the softmax function. In words, is the crossentropy loss between the dualencoder and crossencoder queryitem score distribution over all items. Due to computational and memory limitations, we train by minimizing using items in for each query .
treats items in as a positive item, pairs it with hard negatives from , and minimizing increases dualencoder’s score for items in , thus aligning with for queries in training data.
Training and optimization details
We train all dualencoder models using Adam optimizer with learning rate= for 10 epochs. We use a separate set of parameters for query and item encoders. We use 10% of training queries for validation and train on the remaining 90% of the queries. For each domain and training data size, we train with both and loss functions, and pick the model that performs best on validation queries for NN retrieval according to the crossencoder model.
We train models for with loss on two NVIDIA GeForce RTX 2080Ti GPUs with 12GB GPU memory and with loss on two NVIDIA GeForce RTX 8000 GPUs with 48GB GPU memory as we could not train with on 2080Tis due to GPU memory limitations. For loss , we update the the list of negative items ( ) for each query after each epoch by mining hard negative items using the latest dualencoder model parameters.




Indexing and Training Time
Table 1(a) shows overall indexing time for the proposed method annCUR and for DE models trained using two distillation losses – and on domain=YuGiOh. Training time () for loss is much less as compared to that for as the former is trained on more powerful GPUs (two NVIDIA RTX8000s with 48GB memory each) due to its GPU memory requirements while the latter is trained on two NVIDIA 2080Ti GPUs with 12 GB memory each. The total indexing time ( for DE models includes the time taken to compute CE score matrix () because in order to train a DE model for the task of nearest neighbor search for a CE, we need to first find exact nearest neighbor items for the training queries. Note that this is different from the "standard" way of training of DE models via distillation where the DE is often distilled using CE scores on a fixed or heuristically retrieved set of items, and not on nearest neighbor items according to the crossencoder for a given query.
Table 1(b) shows indexing time for annCUR and DEs trained via distillation for five domains in ZeShEL. As the size of the domain increases, the time take for computing crossencoder scores on training queries () also increases. The time takes to train dualencoder via distillation roughly remains the same as we train with fixed number of positive and negative items during distillation.
Appendix B Additional Results and Analysis
b.1 Comparing [emb]CE and [cls]CE
In addition to training crossencoder models with negatives mined using a dualencoder, we train both [cls]CE and [emb]CE models using random negatives and negatives mined using tfidf embeddings of mentions and entities. To evaluate the crossencoder models, we retrieve 64 entities for each test mention using a dualencoder model and rerank them using a crossencoder model.
Table 2(a) shows macroaveraged accuracy on the downstream task of entity linking over test and dev domains in ZeShEL dataset, and Table 2(b) shows rank of queryitem score matrices on domain=YuGiOh for both crossencoder models. The proposed [emb]CE model performs at par with the widely used [cls]CE architecture for all three kinds of negative mining strategies while producing a queryitem score matrix with lower rank as compared to [cls]CE.
Figure 8 shows approximation error of annCUR for different combinations of number of anchor queries and anchor items for [cls]CE and [emb]CE. For a given set of anchor queries, the approximation error is evaluated on the remaining set of queries. The error between a matrix and its approximation is measured as where is the Frobenius norm of a matrix. For the same choice of anchor queries and anchor items, the approximation error is lower for [emb]CE model as compared to [cls]CE. This aligns with the observation that rank of the queryitem score matrix from [emb]CE is lower than the corresponding matrix from [cls]CE as shown in Table 2(b).
b.2 Understanding poor performance of annCUR for
Figure 6 in §3.3 shows Top10Recall@500 on domain=YuGiOh for [cls]CE and [emb]CE respectively on different combinations of number of anchor queries () and anchor items (). Note that the performance of annCUR drops significantly when . Recall that the indexing step (§2.3) requires computing pseudoinverse of matrix containing scores between anchor queries () and anchor items (). Performance drops significantly when is a square matrix i.e.
as the matrix tends to be illconditioned, with several very small eigenvalues that are ‘blown up’ in
, the pseudoinverse of . This, inturn, leads to a significant approximation error Ray et al. (2022). Choosing different number of anchor queries and anchor items yields a rectangular matrix whose eigenvalues are unlikely to be small, thus resulting in much better approximation of the matrix .Oracle CUR Decomposition
An alternate way of computing the matrix in CUR decomposition of a matrix for a given subset of rows () and columns () is to set . This can provide a much stable approximation of the matrix even when Mahoney and Drineas (2009). However, it requires computing all values of before computing its lowrank approximation. In our case, we are trying to approximate a matrix which also contains scores between testqueries and all items in order to avoid scoring all items using the CE model at testtime, thus we can not use . Figure 9 shows results for an oracle experiment where we use , and as expected it provides significant improvement when and minor improvement otherwise over using .
b.3 NN experiment results for all domains
For brevity, we show results for all top values only for domain=YuGiOh in the main paper. For the sake of completeness and for interested readers, we add results for combinations of top values, domains, and training data size values. Figure 10  24 contain results for top, for domain YuGiOh, Pro_Wrestling, Doctor_Who, Star_Trek, Military, and training data size . For Pro_Wrestling, since the domain contains 1392 queries, we use maximum value of = 1000 instead of 2000.