1 Introduction
Retrieval of most relevant items for a particular query is a key ingredient of a wide range of machine learning applications, e.g., recommender services, retrievalbased chatbot systems, web search engines. In these applications, item relevance for a particular query is usually predicted by pretrained models, such as deep neural networks (DNNs) or gradient boosted decision trees (GBDTs). Typically, relevance models get a queryitem pair as input and output the relevance of the item to the query. In the most general case, queries and items are described by different sets of features and belong to different spaces. For instance, in recommender systems, queries are typically described by user gender, age, usage history, while the item features mostly contain information describing the content. Let us denote the
query space by and the item space by . Then the problem of maximal relevance retrieval can formally be stated as follows. Let us have a large finite set of items , query and the function that maps queryitem pairs to relevance values:(1) 
For a given query we aim to find an item that maximizes the relevance function :
(2) 
or, more generally to find top items from that provide maximal relevance values. An important special case of problem (2) when and
(3) 
is a wellknown problem of nearest neighbor search, which was investigated by the machine learning community for decades [3, 11, 6].
However, current applications typically use more complex and highlynonlinear relevance functions . For instance, many modern recommender services [10] use the deep neural networks for relevance prediction, while chatbots [21] and many other applications use GBDT models. Naive exhaustive search requires relevance function computations, which is not feasible for millionscale databases and computationally expensive models. In this paper, we propose a method that provides an approximate solution of high quality while computing only for a small fraction of .
The proposed method expands the approach of similarity graphs, which was shown to provide exceptional performance for the nearest neighbor search problem [16, 14, 8]
. This approach organizes the set of items in a graph, where close items are connected by edges, and the search process is performed via a greedy exploration of this graph. In this paper, we extend similarity graphs to the setting, when there is no similarity measure defined on item pairs. Specifically, we describe each item by a vector of relevance values for a representative subset of queries, and experimentally show that the graph exploration can be successfully guided by the DNN/GBDT models. Below we refer to our method as
Relevance Proximity Graphs (RPG).The contributions of the paper can be summarized as follows:

We tackle a new problem of general relevance retrieval, which so far received little attention from the community.

We extend the similarity graphs framework to the setting without a similarity measure defined in the item space.

We opensource the implementation^{1}^{1}1https://github.com/stanismorozov/rpg of the proposed method as well as two millionscale datasets and stateoftheart pretrained models for further research on general relevance retrieval.
The rest of the paper is organized as follows: in Section 2, we briefly review prior work related to the proposed approach. Section 3 formally describes the construction and the usage of RPG. In Section 4, we perform an extensive evaluation of RPG on several datasets and several models and empirically confirm its practical usefulness. Section 5 concludes the paper.
2 Related work
Relevance retrieval problem.Probably, the closest work to ours is [22], which also notes that largescale relevance retrieval is computationally infeasible for multilayer DNN models. They tackle this problem by learning a hierarchical model with a specific structure, which organizes the set of items into a tree during the training stage. While this model allows nonexhaustive retrieval, their approach cannot be directly applied to existing models and requires training the model from scratch. Another work [20] proposes a cascade scheme when cheap auxiliary models provide shortlists of promising candidates and expensive relevance models are used only for candidate reranking. While being efficient, such schemes can result in low recall if the capacity of auxiliary models is insufficient to extract precise candidates lists. In contrast, the proposed RPG approach directly maximizes relevance given by arbitrary offtheshelf models. We confirm this claim for several GBDT and DNN models in the experimental section.
Nearest neighbor search. As mentioned above, problem (2) generalizes the wellknown problem of nearest neighbor search (NNS). Overall, machine learning community includes three separate lines of research on NNS: localitysensitive hashing (LSH) [11, 1], partition trees [3, 5, 6] and similarity graphs [16]. While LSHbased and treebased methods provide solid theoretical guarantees, the performance of graphbased methods was shown to be much higher [14]. The proposed approach is based on the similarity graph framework, hence we provide a brief review of its main ideas below.
For a set of items , the directed similarity graph has a vertex corresponding to each of the items. Vertices and are connected by an edge if belongs to the set of nearest neighbors of in terms of similarity function . The usage of similarity graphs for the NNS problem was initially proposed in the seminal work [16]. This approach constructs the similarity graph and then performs the greedy walk on this graph on the retrieval stage. The search process starts from a random vertex and then on each step moves from the current vertex to a neighbor, that appears to be the closest to the query. The process terminates when we reach a local minimum, i.e., there are no adjacent vertices, closer to the query. Since [16]
the general pipeline described above, was modified by a large set of additional heuristics
[14, 7, 8], outperforming LSHbased and treebased methods.The proposed approach for the relevance retrieval problem could be based on any stateoftheart similarity graph. In this paper, we employ Hierarchical Navigable Small World (HNSW) [14] graphs implementation that is publicly available. In HNSW, the search is performed in a semigreedy fashion, via a variant of beam search [18], as described in Algorithm 1 in detail. During the construction stage, HNSW builds the graph via incrementally adding vertices onebyone. There is a parameter , denoting for the maximum degree of vertices in the graph. For each vertex we perform the search described in Algorithm 1 and connect to the closest vertices that already exist in the graph. HNSW builds a nested hierarchy of graphs, where the lowest layer contains all vertices, and each higher layer contains only a subset of the vertices of the lower layer. The search is performed from top to bottom and the result from each upper layer is used as an entry point to the lower layer.
Beyond the NNS problem, similarity graphs were also shown to provide decent performance for more general similarity functions, e.g., inner product, KullbackLeibler divergence, cosine distance, ItakuraSaito distance, and others
[4]. In this paper, we show that this framework can be extended to work with the setting where there is no specified similarity measure between the item space elements and the relevance functions defined by stateoftheart ML models, such as DNNs or GBDTs.3 Relevance Proximity Graphs (RPG)
The key idea of our approach is to represent the set of items as a graph and to perform the search on this graph using the given relevance function. The retrieval stage remains unchanged, that is, we perform semigreedy graph exploration, guided by the relevance function (Algorithm 1). As will be shown in the experiments, the stateoftheart DNN and GBDT models successfully guide the graph exploration process given that the item set is organized in an appropriate graph.
However, the question ”How to construct an appropriate graph?” becomes nontrivial as items and queries belong to different spaces. Moreover, in some scenarios, there is no similarity defined in the item space, hence the existing approaches to graph construction cannot be directly applied.
Relevanceaware similarity in the item space
For graph construction, we exploit the natural idea that two items and are similar if the corresponding functions and are ”close”, i.e. the items are both relevant or irrelevant for the same query. As a straightforward way to define the similarity between functions, we use distance over some measure defined on the query space (we put minus sign as similarity search is traditionally formulated as a maximization problem):
(4) 
We choose the proper measure over the query space based on the following intuition. Let us have a probability space defined on the query space. For most applications, it is natural to force functions and to be closer in the regions where the density of the query distribution is high. Then, it corresponds to the following similarity function:
(5) 
that is equivalent to the expectation over the probability space:
(6) 
In practice, we use the MonteCarlo estimate of this value. Let us have a random sample
of size from the query distribution. We enumerate this random sample:(7) 
Then we define a vector corresponding to the item in the following way:
(8) 
We refer to the vector as a relevance vector as it contains the relevance values for the item and queries in the sample . Note that we choose the sample only once and it remains the same for all the items. Then the similarity between items and can be defined as:
(9) 
Given this similarity measure, we can apply the existing graph construction method from [14]. Note, that for the fair evaluation, only holdout training queries were used to obtain the relevance vectors in all the experiments, while the relevance retrieval accuracy was calculated for a separate set of test queries.
RPG construction
We summarize the graph construction scheme more formally. Let us have the item set and the train query set . The main parameter of our scheme is a dimensionality of relevance vectors, which is denoted by .

Select — queries from , which will be used to construct the relevance vectors.

Compute the relevance vectors for items from :

Build a similarity graph on , using distance metric on the relevance vectors as a similarity measure, via HNSW method [14].
4 Experiments
In this section, we present the experimental evaluation of the proposed RPG approach for the topK relevance retrieval problem on three realworld datasets. Our code is written in C++, and the implementation of similarity graphs is based on the opensource HNSW implementation^{2}^{2}2https://github.com/yurymalkov/hnsw. In our experiments, we use two standard performance measures. Commonly used Recall measure is defined as the rate of successfully found neighbors, averaged over a set of queries. The second is Average relevance that is defined as the average of relevance values for the query and retrieved topK items, averaged over the set of queries.
Datasets
We report experimental results obtained on three datasets described below. To the best of our knowledge, there are no publicly available largescale benchmarks for relevance retrieval with highlynonlinear models without the similarity measure between item space elements, therefore we collect and opensource two datasets. We expect that these datasets will be valuable for the community, given the abundance of relevance retrieval problem in applications.
Collections dataset. This dataset originates from a proprietary image recommendation service. Here we also sampled one million mostviewed images and random users. Each useritem pair is characterized by features, where of them are item features, are user features and are pairwise useritem features. Then we trained the stateoftheart GBDT model [17] on these features, which we opensource along with the dataset.
Video dataset. This dataset originates from a proprietary video recommendation service. We sampled one million mostviewed videos and random users. Each useritem pair is characterized by features, where of them are item features, are user features and are pairwise useritem features. We trained the GBDT model [17] on these features, which we opensource as well.
Pinterest. We also evaluate our approach on the mediumscale Pinterest dataset [9] with the deep neural network model for relevance prediction proposed in [10]. This datasest contains items and queries without any feature representations, i.e. only a rating matrix is provided.
For all datasets we randomly selected users as train queries and users as test queries. We use the train queries for the relevance vector computation, and we average the evaluation measures over the test queries.
Sanity check
To demonstrate the reasonableness of the proposed scheme we evaluate RPG on two common nearest neighbor benchmarks SIFT1M [13] and DEEP1M [2] with the euclidean distance between queries and items, that is and
(10) 
On Figure 0(a) and Figure 0(b) we provide the results of the comparison of RPG with HNSW [14]. For both methods we use and dimensional relevance vectors for RPG. Indeed, the graphs constructed based on distances between relevance vectors (9) are less accurate but still provide decent retrieval performance. In particular, it is sufficient to perform only a few thousand distance evaluations to achieve recall level.
We conjecture that the reason of the decent performance even with suboptimal graphs is that on the graph exploration stage the search process is ”guided” by the correct similarity measure, which is negative distance between the original data vectors in this experiment. Furthermore, as we show in the experiments below, the graphs constructed on relevance vectors perform exceptionally well even when the relevance function is based on highlynonlinear DNN or GBDT models.
Ablation and preliminary experiments
Now we investigate RPG performance with varying database sizes and different parameter values.
RPG vertex degree. First, we investigate the dependence of RPG performance on the vertex degree . For the Collections dataset, the recallvscomplexity curves for the different values are shown in Figure 3. Here, the length of the relevance vectors is equal to for all values. Surprisingly, Figure 3 demonstrates that the best results are obtained for a quite small degree , which is smaller than the typical vertex degrees in graphs for metric nearest neighbor search [14]. In all the experiments below we use for all datasets.
Length of relevance vectors. Next, we investigate how the RPG accuracy depends on the length of relevance vectors . We used a random sample of queries for the computation of relevance vectors (RPG) and evaluated recall for for all three datasets. The results are shown in Figure 4 and illustrate Recall for different number of model computations. As expected, higher results in the more accurate retrieval due to the Monte Carlo estimates in (6) becoming more accurate. On the other hand, Figure 4 demonstrates diminishing returns from higher as the performance difference between and is only marginal.
Search scalability. Finally, to investigate the empirical scalability of RPG we varied Collections database size in and determined the number of relevance function computations, required to achieve Recall for the top size . The results, shown in Figure 2, imply the power dependence . Note, however, that the exponent of the power law is less than (approximately ), hence the empirical scalability of RPG is sublinear in database size.
Comparison with baselines
We compare the proposed RPG method with several baselines for general relevance retrieval. In particular, we evaluate the following methods:

[align=left]
 Topscored

For every item, we compute its average relevance values for train queries and select items with the maximal global queryindependent relevance. Then we rerank these items based on the actual relevance value, computed by the queryitem relevance model. We vary to achieve runtime/accuracy tradeoff. Intuitively, this selects the most ”popular” items.

[align=left]
 Itembased graph

is the baseline that uses the similarity graph, constructed on the item features only, instead of the relevance vectors. Let us denote by the normalized vector of features of item . Then the similarity between two items can be defined as
(11) Note, that compared to RPG, the itembased graph has two crucial deficiencies:

The itembased graph construction does not use any information about the query distribution or the relevance prediction model.

The dataset could lack itemonly features (e.g., Pinterest), hence in such cases the itembased graph could not be constructed.

In practice, one could use a less accurate, computationally cheaper model (e.g., linear) to produce a list of candidates that are then reranked by the expensive GBDT/DNN model. To compare the proposed RPG framework with such twostage approaches we propose the following baseline:

[align=left]
 Twotower

We learn a ”twotower” DNN that encodes query and item features into 50dimensional embeddings. The DNN has separate query and item branches, consisting of three fullyconnected layers, each having neurons for Collections and neurons for Video
with ELU nonlinearity and Batch Normalization. The relevance for a queryitem pair is predicted as a dot product of the corresponding embeddings. We train this model with the same target as the original GBDT model, with the Adam optimizer
[12] and OneCycle [19] learning rate schedule. During the retrieval stage, we select items that provide maximum dotproduct [4] with a given query and rerank them based on the actual relevance value. We vary for runtime/accuracy tradeoff. An important weakness of the Twotower baseline is that it ignores the queryitem pairwise features, when producing candidates, and we show that this weakness can be crucial.
Note, however, that the usage of cheaper models for candidate selection could be nicely combined with the RPG search, as described in the following RPG+ modification of our approach.

[align=left]
 RPG+

The pure RPG uses the same predefined entry vertex to start the graph exploration. However, if there is given a promising candidate from an auxiliary model, then we can use it as an entry point instead. In RPG+ we start from the best candidate achieved with the DNN from the Twotower model. Note, that we do not need any relevance function computations to obtain the candidate. Intuitively, starting from the sufficiently relevant entry vertex, the graph exploration in RPG+ requires much smaller hops to reach the ”relevant region” of the database.
Dataset  Item features  User features  Pairwise features 

Collections  0.1466  0.0260  0.0642 
Video  0.0099  0.0027  0.4114 
Figure 5 and Figure 6 present the dependence of Recall and Average relevance on the number of relevance function computations, respectively. Figure 6 reports also the ideal values of Average relevance, obtained via exhaustive search. For all datasets, these plots show that the proposed RPG method outperforms all baselines by a large margin in the highrecall regions. Furthermore, RPG+ can boost the performance in the lowrecall operating point, given a cheap candidate selection model. Note that RPG reaches almost ideal average relevance in a few numbers of model computations. In these experiments, we report the performance when items are retrieved, but we claim that RPG consistently outperforms the baselines for larger as well. Figure 7 presents the dependence of Recall on the number of relevance function computations for and confirms the superiority of the proposed techniue over baselines.
Interestingly, the baselines perform differently on different datasets. In particular, the Twotower baseline is quite competitive on Collections, while giving poor results on Video. To explain this observation, we compare the feature importance, computed by the GBDT model^{3}^{3}3https://catboost.ai/docs/concepts/fstr.html
. In a nutshell, for every feature, the importance value shows how the loss function, computed on the train set, changes if this feature is removed. Then we sum the importances across all item, user, and pairwise features and report them in Table
1. Note, that for the Collections dataset item features are more important, while for the Video dataset the pairwise features contain most signal. Consequently, the Top scored and Twotower baselines show decent performance on Collections, as they could capture the signal from the user and item features and provide precise candidate lists for reranking. Meanwhile, they are not competitive on Video, because they lose the information from the pairwise features, which are the most important on this dataset. The RPG/RPG+ provides top performance for both datasets.Reducing to matrix factorization problem
The problem of maximal relevance retrieval can potentially be solved by the matrix factorization methods [15]. Let us have a fixed set of queries . Then one can construct embedding vectors for all items and embedding vectors for all the queries that are obtained via a lowrank decomposition of the full relevance matrix , where , , , and denotes the decomposition rank. Then for a given query we can retrieve best items in terms of dot product and then rerank these top items exhaustively based on the values of the original relevance function . We evaluate the described baseline, performing approximate matrix factorization via Alternating Least Squares implementation from the Implicit library^{4}^{4}4https://github.com/benfred/implicit. The comparison of ALS with the graphbased methods for two datasets is presented on the Figure 8. On this figure ALS means that we randomly selected items for each query from , computed the corresponding relevance values and performed ALS for the obtained sparse relevance matrix. Note, that the described approach is able to retrieve the relevant items only for queries from and does not directly generalizes to unseen queries. As operating points, we use and for Video and and for Pinterest. Figure 8 demonstrates that ALS cannot reach the quality of the graphbased methods.
As an upper bound for baselines, which construct dotproduct based embeddings for items and users we implemented SVD for matrix . Note, that this is an extremely infeasible baseline as it requires an explicit computation of the full matrix and this is the same computationally hard as to precompute answers for all the users by exhaustive search. Despite this, SVD still cannot reach the graph methods accuracy. In particular, for the Video dataset and , SVD achieves recall and for the Pinterest dataset and it achieves recall .
5 Conclusion
In this paper, we have proposed and evaluated the Relevance Proximity Graph (RPG) framework for nonexhaustive maximal relevance retrieval with highlynonlinear models. Our approach generalizes similarity graphs to the scenario, where the relevance function is given for queryitem pairs, and there may be no similarity measure for items. Our framework can be applied to a comprehensive class of relevance models, including deep neural networks and gradient boosted decision trees. While being conceptually simple, RPG successfully solves the relevance retrieval problem for millionscale databases and stateoftheart models, as demonstrated by extensive experiments. As an additional contribution, we opensource the implementation of our method as well as two largescale relevance retrieval datasets to support further research in this area.
References
 [1] (2008) Nearoptimal hashing algorithms for near neighbor problem in high dimension. Communications of the ACM 51 (1), pp. 117–122. Cited by: §2.

[2]
(2016)
Efficient indexing of billionscale datasets of deep descriptors.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 2055–2063. Cited by: §4.  [3] (1975) Multidimensional binary search trees used for associative searching. Communications of the ACM 18 (9), pp. 509–517. Cited by: §1, §2.
 [4] (2017) Efficient and accurate nonmetric knn search with applications to text matching. Ph.D. Thesis, Carnegie Mellon University. Cited by: §2, item Twotower.

[5]
(2008)
Random projection trees and low dimensional manifolds.
In
Proceedings of the Fortieth Annual ACM Symposium on Theory of Computing
, STOC ’08. Cited by: §2.  [6] (2013) Randomized partition trees for exact nearest neighbor search. In Conference on Learning Theory, pp. 317–337. Cited by: §1, §2.

[7]
(2016)
Efanna: an extremely fast approximate nearest neighbor search algorithm based on knn graph
. arXiv preprint arXiv:1609.07228. Cited by: §2.  [8] (2017) Fast approximate nearest neighbor search with the navigating spreadingout graph. arXiv preprint arXiv:1707.00143. Cited by: §1, §2.
 [9] (2015) Learning image and user features for recommendation in social networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4274–4282. Cited by: §4.
 [10] (2017) Neural collaborative filtering. In Proceedings of the 26th International Conference on World Wide Web, pp. 173–182. Cited by: §1, §4.

[11]
(1998)
Approximate nearest neighbors: towards removing the curse of dimensionality
. In Proceedings of the Thirtieth Annual ACM Symposium on the Theory of Computing, Dallas, Texas, USA, May 2326, 1998, pp. 604–613. Cited by: §1, §2.  [12] (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: item Twotower.
 [13] (2004) Distinctive image features from scaleinvariant keypoints. International journal of computer vision 60 (2), pp. 91–110. Cited by: §4.
 [14] (2016) Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. arXiv preprint arXiv:1603.09320. Cited by: §1, §2, §2, §2, item 3, §3, §3, §4, §4.
 [15] (2017) A review on matrix factorization techniques in recommender systems. In 2017 2nd International Conference on Communication Systems, Computing and IT Applications (CSCITA), Cited by: §4.
 [16] (2002) Searching in metric spaces by spatial approximation. The VLDB Journal 11 (1), pp. 28–46. Cited by: §1, §2, §2.
 [17] (2018) CatBoost: unbiased boosting with categorical features. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 38 December 2018, Montréal, Canada., pp. 6639–6649. Cited by: §4, §4.

[18]
(1987)
Encyclopedia of artificial intelligence
. Cited by: §2.  [19] (2017) Superconvergence: very fast training of neural networks using large learning rates. arXiv preprint arXiv:1708.07120. Cited by: item Twotower.
 [20] (2011) A cascade ranking model for efficient ranked retrieval. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, pp. 105–114. Cited by: §2.

[21]
(2018)
Modelling domain relationships for transfer learning on retrievalbased question answering systems in ecommerce
. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, WSDM 2018, Marina Del Rey, CA, USA, February 59, 2018, pp. 682–690. Cited by: §1.  [22] (2018) Learning treebased deep model for recommender systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1079–1088. Cited by: §2.
Comments
There are no comments yet.