Retrieval of most relevant items for a particular query is a key ingredient of a wide range of machine learning applications, e.g., recommender services, retrieval-based chatbot systems, web search engines. In these applications, item relevance for a particular query is usually predicted by pretrained models, such as deep neural networks (DNNs) or gradient boosted decision trees (GBDTs). Typically, relevance models get a query-item pair as input and output the relevance of the item to the query. In the most general case, queries and items are described by different sets of features and belong to different spaces. For instance, in recommender systems, queries are typically described by user gender, age, usage history, while the item features mostly contain information describing the content. Let us denote thequery space by and the item space by . Then the problem of maximal relevance retrieval can formally be stated as follows. Let us have a large finite set of items , query and the function that maps query-item pairs to relevance values:
For a given query we aim to find an item that maximizes the relevance function :
or, more generally to find top- items from that provide maximal relevance values. An important special case of problem (2) when and
However, current applications typically use more complex and highly-nonlinear relevance functions . For instance, many modern recommender services  use the deep neural networks for relevance prediction, while chatbots  and many other applications use GBDT models. Naive exhaustive search requires relevance function computations, which is not feasible for million-scale databases and computationally expensive models. In this paper, we propose a method that provides an approximate solution of high quality while computing only for a small fraction of .
. This approach organizes the set of items in a graph, where close items are connected by edges, and the search process is performed via a greedy exploration of this graph. In this paper, we extend similarity graphs to the setting, when there is no similarity measure defined on item pairs. Specifically, we describe each item by a vector of relevance values for a representative subset of queries, and experimentally show that the graph exploration can be successfully guided by the DNN/GBDT models. Below we refer to our method asRelevance Proximity Graphs (RPG).
The contributions of the paper can be summarized as follows:
We tackle a new problem of general relevance retrieval, which so far received little attention from the community.
We extend the similarity graphs framework to the setting without a similarity measure defined in the item space.
We open-source the implementation111https://github.com/stanis-morozov/rpg of the proposed method as well as two million-scale datasets and state-of-the-art pretrained models for further research on general relevance retrieval.
The rest of the paper is organized as follows: in Section 2, we briefly review prior work related to the proposed approach. Section 3 formally describes the construction and the usage of RPG. In Section 4, we perform an extensive evaluation of RPG on several datasets and several models and empirically confirm its practical usefulness. Section 5 concludes the paper.
2 Related work
Relevance retrieval problem.Probably, the closest work to ours is , which also notes that large-scale relevance retrieval is computationally infeasible for multi-layer DNN models. They tackle this problem by learning a hierarchical model with a specific structure, which organizes the set of items into a tree during the training stage. While this model allows non-exhaustive retrieval, their approach cannot be directly applied to existing models and requires training the model from scratch. Another work  proposes a cascade scheme when cheap auxiliary models provide short-lists of promising candidates and expensive relevance models are used only for candidate reranking. While being efficient, such schemes can result in low recall if the capacity of auxiliary models is insufficient to extract precise candidates lists. In contrast, the proposed RPG approach directly maximizes relevance given by arbitrary off-the-shelf models. We confirm this claim for several GBDT and DNN models in the experimental section.
Nearest neighbor search. As mentioned above, problem (2) generalizes the well-known problem of nearest neighbor search (NNS). Overall, machine learning community includes three separate lines of research on NNS: locality-sensitive hashing (LSH) [11, 1], partition trees [3, 5, 6] and similarity graphs . While LSH-based and tree-based methods provide solid theoretical guarantees, the performance of graph-based methods was shown to be much higher . The proposed approach is based on the similarity graph framework, hence we provide a brief review of its main ideas below.
For a set of items , the directed similarity graph has a vertex corresponding to each of the items. Vertices and are connected by an edge if belongs to the set of nearest neighbors of in terms of similarity function . The usage of similarity graphs for the NNS problem was initially proposed in the seminal work . This approach constructs the similarity graph and then performs the greedy walk on this graph on the retrieval stage. The search process starts from a random vertex and then on each step moves from the current vertex to a neighbor, that appears to be the closest to the query. The process terminates when we reach a local minimum, i.e., there are no adjacent vertices, closer to the query. Since 
the general pipeline described above, was modified by a large set of additional heuristics[14, 7, 8], outperforming LSH-based and tree-based methods.
The proposed approach for the relevance retrieval problem could be based on any state-of-the-art similarity graph. In this paper, we employ Hierarchical Navigable Small World (HNSW)  graphs implementation that is publicly available. In HNSW, the search is performed in a semi-greedy fashion, via a variant of beam search , as described in Algorithm 1 in detail. During the construction stage, HNSW builds the graph via incrementally adding vertices one-by-one. There is a parameter , denoting for the maximum degree of vertices in the graph. For each vertex we perform the search described in Algorithm 1 and connect to the closest vertices that already exist in the graph. HNSW builds a nested hierarchy of graphs, where the lowest layer contains all vertices, and each higher layer contains only a subset of the vertices of the lower layer. The search is performed from top to bottom and the result from each upper layer is used as an entry point to the lower layer.
Beyond the NNS problem, similarity graphs were also shown to provide decent performance for more general similarity functions, e.g., inner product, Kullback-Leibler divergence, cosine distance, Itakura-Saito distance, and others. In this paper, we show that this framework can be extended to work with the setting where there is no specified similarity measure between the item space elements and the relevance functions defined by state-of-the-art ML models, such as DNNs or GBDTs.
3 Relevance Proximity Graphs (RPG)
The key idea of our approach is to represent the set of items as a graph and to perform the search on this graph using the given relevance function. The retrieval stage remains unchanged, that is, we perform semi-greedy graph exploration, guided by the relevance function (Algorithm 1). As will be shown in the experiments, the state-of-the-art DNN and GBDT models successfully guide the graph exploration process given that the item set is organized in an appropriate graph.
However, the question ”How to construct an appropriate graph?” becomes nontrivial as items and queries belong to different spaces. Moreover, in some scenarios, there is no similarity defined in the item space, hence the existing approaches to graph construction cannot be directly applied.
Relevance-aware similarity in the item space
For graph construction, we exploit the natural idea that two items and are similar if the corresponding functions and are ”close”, i.e. the items are both relevant or irrelevant for the same query. As a straightforward way to define the similarity between functions, we use distance over some measure defined on the query space (we put minus sign as similarity search is traditionally formulated as a maximization problem):
We choose the proper measure over the query space based on the following intuition. Let us have a probability space defined on the query space. For most applications, it is natural to force functions and to be closer in the regions where the density of the query distribution is high. Then, it corresponds to the following similarity function:
that is equivalent to the expectation over the probability space:
In practice, we use the Monte-Carlo estimate of this value. Let us have a random sampleof size from the query distribution. We enumerate this random sample:
Then we define a vector corresponding to the item in the following way:
We refer to the vector as a relevance vector as it contains the relevance values for the item and queries in the sample . Note that we choose the sample only once and it remains the same for all the items. Then the similarity between items and can be defined as:
Given this similarity measure, we can apply the existing graph construction method from . Note, that for the fair evaluation, only hold-out training queries were used to obtain the relevance vectors in all the experiments, while the relevance retrieval accuracy was calculated for a separate set of test queries.
We summarize the graph construction scheme more formally. Let us have the item set and the train query set . The main parameter of our scheme is a dimensionality of relevance vectors, which is denoted by .
Select — queries from , which will be used to construct the relevance vectors.
Compute the relevance vectors for items from :
Build a similarity graph on , using distance metric on the relevance vectors as a similarity measure, via HNSW method .
In this section, we present the experimental evaluation of the proposed RPG approach for the top-K relevance retrieval problem on three real-world datasets. Our code is written in C++, and the implementation of similarity graphs is based on the open-source HNSW implementation222https://github.com/yurymalkov/hnsw. In our experiments, we use two standard performance measures. Commonly used Recall measure is defined as the rate of successfully found neighbors, averaged over a set of queries. The second is Average relevance that is defined as the average of relevance values for the query and retrieved top-K items, averaged over the set of queries.
We report experimental results obtained on three datasets described below. To the best of our knowledge, there are no publicly available large-scale benchmarks for relevance retrieval with highly-nonlinear models without the similarity measure between item space elements, therefore we collect and open-source two datasets. We expect that these datasets will be valuable for the community, given the abundance of relevance retrieval problem in applications.
Collections dataset. This dataset originates from a proprietary image recommendation service. Here we also sampled one million most-viewed images and random users. Each user-item pair is characterized by features, where of them are item features, are user features and are pairwise user-item features. Then we trained the state-of-the-art GBDT model  on these features, which we open-source along with the dataset.
Video dataset. This dataset originates from a proprietary video recommendation service. We sampled one million most-viewed videos and random users. Each user-item pair is characterized by features, where of them are item features, are user features and are pairwise user-item features. We trained the GBDT model  on these features, which we open-source as well.
Pinterest. We also evaluate our approach on the medium-scale Pinterest dataset  with the deep neural network model for relevance prediction proposed in . This datasest contains items and queries without any feature representations, i.e. only a rating matrix is provided.
For all datasets we randomly selected users as train queries and users as test queries. We use the train queries for the relevance vector computation, and we average the evaluation measures over the test queries.
To demonstrate the reasonableness of the proposed scheme we evaluate RPG on two common nearest neighbor benchmarks SIFT1M  and DEEP1M  with the euclidean distance between queries and items, that is and
On Figure 0(a) and Figure 0(b) we provide the results of the comparison of RPG with HNSW . For both methods we use and -dimensional relevance vectors for RPG. Indeed, the graphs constructed based on distances between relevance vectors (9) are less accurate but still provide decent retrieval performance. In particular, it is sufficient to perform only a few thousand distance evaluations to achieve recall level.
We conjecture that the reason of the decent performance even with suboptimal graphs is that on the graph exploration stage the search process is ”guided” by the correct similarity measure, which is negative distance between the original data vectors in this experiment. Furthermore, as we show in the experiments below, the graphs constructed on relevance vectors perform exceptionally well even when the relevance function is based on highly-nonlinear DNN or GBDT models.
Ablation and preliminary experiments
Now we investigate RPG performance with varying database sizes and different parameter values.
RPG vertex degree. First, we investigate the dependence of RPG performance on the vertex degree . For the Collections dataset, the recall-vs-complexity curves for the different values are shown in Figure 3. Here, the length of the relevance vectors is equal to for all values. Surprisingly, Figure 3 demonstrates that the best results are obtained for a quite small degree , which is smaller than the typical vertex degrees in graphs for metric nearest neighbor search . In all the experiments below we use for all datasets.
Length of relevance vectors. Next, we investigate how the RPG accuracy depends on the length of relevance vectors . We used a random sample of queries for the computation of relevance vectors (RPG) and evaluated recall for for all three datasets. The results are shown in Figure 4 and illustrate Recall for different number of model computations. As expected, higher results in the more accurate retrieval due to the Monte Carlo estimates in (6) becoming more accurate. On the other hand, Figure 4 demonstrates diminishing returns from higher as the performance difference between and is only marginal.
Search scalability. Finally, to investigate the empirical scalability of RPG we varied Collections database size in and determined the number of relevance function computations, required to achieve Recall for the top size . The results, shown in Figure 2, imply the power dependence . Note, however, that the exponent of the power law is less than (approximately ), hence the empirical scalability of RPG is sublinear in database size.
Comparison with baselines
We compare the proposed RPG method with several baselines for general relevance retrieval. In particular, we evaluate the following methods:
For every item, we compute its average relevance values for train queries and select items with the maximal global query-independent relevance. Then we rerank these items based on the actual relevance value, computed by the query-item relevance model. We vary to achieve runtime/accuracy trade-off. Intuitively, this selects the most ”popular” items.
- Item-based graph
is the baseline that uses the similarity graph, constructed on the item features only, instead of the relevance vectors. Let us denote by the -normalized vector of features of item . Then the similarity between two items can be defined as
Note, that compared to RPG, the item-based graph has two crucial deficiencies:
The item-based graph construction does not use any information about the query distribution or the relevance prediction model.
The dataset could lack item-only features (e.g., Pinterest), hence in such cases the item-based graph could not be constructed.
In practice, one could use a less accurate, computationally cheaper model (e.g., linear) to produce a list of candidates that are then reranked by the expensive GBDT/DNN model. To compare the proposed RPG framework with such two-stage approaches we propose the following baseline:
We learn a ”two-tower” DNN that encodes query and item features into 50-dimensional embeddings. The DNN has separate query and item branches, consisting of three fully-connected layers, each having neurons for Collections and neurons for Video
with ELU non-linearity and Batch Normalization. The relevance for a query-item pair is predicted as a dot product of the corresponding embeddings. We train this model with the same target as the original GBDT model, with the Adam optimizer and OneCycle  learning rate schedule. During the retrieval stage, we select items that provide maximum dot-product  with a given query and rerank them based on the actual relevance value. We vary for runtime/accuracy trade-off. An important weakness of the Two-tower baseline is that it ignores the query-item pairwise features, when producing candidates, and we show that this weakness can be crucial.
Note, however, that the usage of cheaper models for candidate selection could be nicely combined with the RPG search, as described in the following RPG+ modification of our approach.
The pure RPG uses the same predefined entry vertex to start the graph exploration. However, if there is given a promising candidate from an auxiliary model, then we can use it as an entry point instead. In RPG+ we start from the best candidate achieved with the DNN from the Two-tower model. Note, that we do not need any relevance function computations to obtain the candidate. Intuitively, starting from the sufficiently relevant entry vertex, the graph exploration in RPG+ requires much smaller hops to reach the ”relevant region” of the database.
|Dataset||Item features||User features||Pairwise features|
Figure 5 and Figure 6 present the dependence of Recall and Average relevance on the number of relevance function computations, respectively. Figure 6 reports also the ideal values of Average relevance, obtained via exhaustive search. For all datasets, these plots show that the proposed RPG method outperforms all baselines by a large margin in the high-recall regions. Furthermore, RPG+ can boost the performance in the low-recall operating point, given a cheap candidate selection model. Note that RPG reaches almost ideal average relevance in a few numbers of model computations. In these experiments, we report the performance when items are retrieved, but we claim that RPG consistently outperforms the baselines for larger as well. Figure 7 presents the dependence of Recall on the number of relevance function computations for and confirms the superiority of the proposed techniue over baselines.
Interestingly, the baselines perform differently on different datasets. In particular, the Two-tower baseline is quite competitive on Collections, while giving poor results on Video. To explain this observation, we compare the feature importance, computed by the GBDT model333https://catboost.ai/docs/concepts/fstr.html
. In a nutshell, for every feature, the importance value shows how the loss function, computed on the train set, changes if this feature is removed. Then we sum the importances across all item, user, and pairwise features and report them in Table1. Note, that for the Collections dataset item features are more important, while for the Video dataset the pairwise features contain most signal. Consequently, the Top scored and Two-tower baselines show decent performance on Collections, as they could capture the signal from the user and item features and provide precise candidate lists for reranking. Meanwhile, they are not competitive on Video, because they lose the information from the pairwise features, which are the most important on this dataset. The RPG/RPG+ provides top performance for both datasets.
Reducing to matrix factorization problem
The problem of maximal relevance retrieval can potentially be solved by the matrix factorization methods . Let us have a fixed set of queries . Then one can construct embedding vectors for all items and embedding vectors for all the queries that are obtained via a low-rank decomposition of the full relevance matrix , where , , , and denotes the decomposition rank. Then for a given query we can retrieve best items in terms of dot product and then rerank these top- items exhaustively based on the values of the original relevance function . We evaluate the described baseline, performing approximate matrix factorization via Alternating Least Squares implementation from the Implicit library444https://github.com/benfred/implicit. The comparison of ALS with the graph-based methods for two datasets is presented on the Figure 8. On this figure ALS- means that we randomly selected items for each query from , computed the corresponding relevance values and performed ALS for the obtained sparse relevance matrix. Note, that the described approach is able to retrieve the relevant items only for queries from and does not directly generalizes to unseen queries. As operating points, we use and for Video and and for Pinterest. Figure 8 demonstrates that ALS cannot reach the quality of the graph-based methods.
As an upper bound for baselines, which construct dot-product based embeddings for items and users we implemented SVD for matrix . Note, that this is an extremely infeasible baseline as it requires an explicit computation of the full matrix and this is the same computationally hard as to precompute answers for all the users by exhaustive search. Despite this, SVD still cannot reach the graph methods accuracy. In particular, for the Video dataset and , SVD achieves recall and for the Pinterest dataset and it achieves recall .
In this paper, we have proposed and evaluated the Relevance Proximity Graph (RPG) framework for non-exhaustive maximal relevance retrieval with highly-nonlinear models. Our approach generalizes similarity graphs to the scenario, where the relevance function is given for query-item pairs, and there may be no similarity measure for items. Our framework can be applied to a comprehensive class of relevance models, including deep neural networks and gradient boosted decision trees. While being conceptually simple, RPG successfully solves the relevance retrieval problem for million-scale databases and state-of-the-art models, as demonstrated by extensive experiments. As an additional contribution, we open-source the implementation of our method as well as two large-scale relevance retrieval datasets to support further research in this area.
-  (2008) Near-optimal hashing algorithms for near neighbor problem in high dimension. Communications of the ACM 51 (1), pp. 117–122. Cited by: §2.
-  (2016) Efficient indexing of billion-scale datasets of deep descriptors. In , pp. 2055–2063. Cited by: §4.
-  (1975) Multidimensional binary search trees used for associative searching. Communications of the ACM 18 (9), pp. 509–517. Cited by: §1, §2.
-  (2017) Efficient and accurate non-metric k-nn search with applications to text matching. Ph.D. Thesis, Carnegie Mellon University. Cited by: §2, item Two-tower.
Random projection trees and low dimensional manifolds.
Proceedings of the Fortieth Annual ACM Symposium on Theory of Computing, STOC ’08. Cited by: §2.
-  (2013) Randomized partition trees for exact nearest neighbor search. In Conference on Learning Theory, pp. 317–337. Cited by: §1, §2.
Efanna: an extremely fast approximate nearest neighbor search algorithm based on knn graph. arXiv preprint arXiv:1609.07228. Cited by: §2.
-  (2017) Fast approximate nearest neighbor search with the navigating spreading-out graph. arXiv preprint arXiv:1707.00143. Cited by: §1, §2.
-  (2015) Learning image and user features for recommendation in social networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4274–4282. Cited by: §4.
-  (2017) Neural collaborative filtering. In Proceedings of the 26th International Conference on World Wide Web, pp. 173–182. Cited by: §1, §4.
Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the Thirtieth Annual ACM Symposium on the Theory of Computing, Dallas, Texas, USA, May 23-26, 1998, pp. 604–613. Cited by: §1, §2.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: item Two-tower.
-  (2004) Distinctive image features from scale-invariant keypoints. International journal of computer vision 60 (2), pp. 91–110. Cited by: §4.
-  (2016) Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. arXiv preprint arXiv:1603.09320. Cited by: §1, §2, §2, §2, item 3, §3, §3, §4, §4.
-  (2017) A review on matrix factorization techniques in recommender systems. In 2017 2nd International Conference on Communication Systems, Computing and IT Applications (CSCITA), Cited by: §4.
-  (2002) Searching in metric spaces by spatial approximation. The VLDB Journal 11 (1), pp. 28–46. Cited by: §1, §2, §2.
-  (2018) CatBoost: unbiased boosting with categorical features. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada., pp. 6639–6649. Cited by: §4, §4.
Encyclopedia of artificial intelligence. Cited by: §2.
-  (2017) Super-convergence: very fast training of neural networks using large learning rates. arXiv preprint arXiv:1708.07120. Cited by: item Two-tower.
-  (2011) A cascade ranking model for efficient ranked retrieval. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, pp. 105–114. Cited by: §2.
Modelling domain relationships for transfer learning on retrieval-based question answering systems in e-commerce. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, WSDM 2018, Marina Del Rey, CA, USA, February 5-9, 2018, pp. 682–690. Cited by: §1.
-  (2018) Learning tree-based deep model for recommender systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1079–1088. Cited by: §2.