Log In Sign Up

A Novel User Representation Paradigm for Making Personalized Candidate Retrieval

Candidate retrieval is a crucial part in recommendation system, where quality candidates need to be selected in realtime for user's recommendation request. Conventional methods would make use of feature similarity directly for highly scalable retrieval, yet their retrieval quality can be limited due to inferior user interest modeling. In contrast, deep learning-based recommenders are precise in modeling user interest, but they are difficult to be scaled for efficient candidate retrieval. In this work, a novel paradigm Synthonet is proposed for both precise and scalable candidate retrieval. With Synthonet, user is represented as a compact vector known as retrieval key. By developing an Actor-Critic learning framework, the generation of retrieval key is optimally conducted, such that the similarity between retrieval key and item's representation will accurately reflect user's interest towards the corresponding item. Consequently, quality candidates can be acquired in realtime on top of highly efficient similarity search methods. Comprehensive empirical studies are carried out for the verification of our proposed methods, where consistent and remarkable improvements are achieved over a series of competitive baselines, including representative variations on metric learning.


News Recommendation with Candidate-aware User Modeling

News recommendation aims to match news with personalized user interest. ...

A multimodal deep learning framework for scalable content based visual media retrieval

We propose a novel, efficient, modular and scalable framework for conten...

kNN-Embed: Locally Smoothed Embedding Mixtures For Multi-interest Candidate Retrieval

Candidate generation is the first stage in recommendation systems, where...

Learning Personalized Item-to-Item Recommendation Metric via Implicit Feedback

This paper studies the item-to-item recommendation problem in recommende...

Probabilistic Metric Learning with Adaptive Margin for Top-K Recommendation

Personalized recommender systems are playing an increasingly important r...

Approximate Nearest Neighbor Search under Neural Similarity Metric for Large-Scale Recommendation

Model-based methods for recommender systems have been studied extensivel...

Hybrid Encoder: Towards Efficient and Precise Native AdsRecommendation via Hybrid Transformer Encoding Networks

Transformer encoding networks have been proved to be a powerful tool of ...

1. Introduction

Recommendation system plays a crucial role in modern web services, e.g., online advertisement and e-commerce, as users’ interested items can be automatically delivered on top of the analysis of their intensive behavioral data. Instead of selecting suitable items in-one-shot, a typical recommendation system would consecutively execute two fundamental operations (illustrated as Figure 1): candidate retrieval and ranking (Cheng et al., 2016; Covington et al., 2016). Given a recommendation request from user, the retrieval module will select a small set of relevant candidates in realtime from the tremendous pool of items; then, the ranking module will further refine the candidates with higher precision, where those top ranked candidates are to be generated as the final recommendation result. It is apparent that the candidate retrieval operation severely affects the overall recommendation quality, whose performance has to be optimized within the integral system.

Figure 1. Sketch of the typical workflow for a recommendation system (better viewed in colour).

Because of its distinct role in the recommendation system, candidate retrieval is desirable of satisfying two pieces of properties. For one thing, considering tremendous the scale of item set in reality, candidate retrieval must be temporally scalable so as to maintain a tolerable running cost. For anther thing, to have high quality retrieval result, user interest must be precisely modeled while searching for appropriate candidates.

Conventionally, it is a common practice (Chen et al., 2018)

to represent user and item with a shared group of features (e.g., keywords, categorical tags or LDA vectors), so that relevant candidates of a user can be retrieved based on their feature similarity. During such an operation, index structures like LSH (locality sensitive hashing) and KNN graph are pre-constructed for all the items, which will significantly short-cut the retrieval process thanks to their superior efficiency on similarity search. Following the same candidate retrieval paradigm, more proficient algorithms are developed on top of metric learning, such as DSSM

(Huang et al., 2013) and CML (Hsieh et al., 2017)

, where more expressive latent features are learned for user and item to better characterize their mutual relevance. Regardless of different formulations, all the above approaches have to represent user and item in the same space, and retrieve the candidates purely based on feature similarity (e.g., Cosine or Euclidean distance between user and item’s feature vectors). Such a highly restricted workflow will probably give rise to inferior modeling of user interest, thus harming the retrieval quality.

By comparison, various types of deep learning-based recommenders have been developed recently, such as Wide&Deep (Cheng et al., 2016), Deep&Cross (Wang et al., 2017a), DeepFM (Guo et al., 2018) and xDeepFM (Lian et al., 2018), where user interest can be captured with high precision. However, such approaches would rely on complex functional relationship between user and item’s features; in other words, user interest is no longer reflected by feature similarity. As a result, it is hard to index the items w.r.t. the user interest learned by such recommenders. Therefore, they are mainly used for the ranking operation rather than candidate retrieval.

In short, conventional methods based on feature similarity are scalable for realtime retrieval, but their retrieval quality can be limited by inferior user interest modeling. Meanwhile, deep learning-based recommenders are precise in modeling user interest, yet they are difficult to be scaled for candidate retrieval.

In this work, we propose a novel personalized candidate retrieval paradigm Synthonet, which makes the best of both worlds. On the one hand, user interest is precisely captured by an arbitrary type of deep recommendation model; on the other hand, relevant candidates can be directly obtained via similarity search, which enables items to be indexed for efficient retrieval. The underlying idea of Synthonet is quite intuitive. Particularly, a deep recommendation (e.g., (Cheng et al., 2016; Wang et al., 2017a; Guo et al., 2018; Lian et al., 2018)) is employed for the precise modeling of user interest; meanwhile, another model is learned to synthesize a “virtual item representation” for a user, referred as the retrieval key. Importantly, it is expected that user interest captured by can be well approximated by the similarity between retrieval key and those real items’ representations. In other words, given user , who prefers item over ; then u’s retrieval key will always be more similar with ’s representation:



is a certain similarity function, e.g., Cosine similarity,

indicates the retrieval key, and stands for the item’s representation. Given the satisfaction of the above relationship, user’s interested items will be confined within the neighborhood of ; thus, it will enable high relevance candidates to be retrieved efficiently via similarity search.

To realize the above candidate retrieval paradigm, an Actor-Critic style learning framework (Konda and Tsitsiklis, 2000; Bhatnagar et al., 2009) is developed. With the Actor module, the retrieval key is synthesized for each user, together with representations generated for each item, based on both parties’ inherent information. And with the Critic module, supervision signals are generated for Actor, such that the relationship in Eq. 1 can be achieved. In Critic, a compound reward is integrated from three sources: evaluator, validator, and referencer, whose individual functionality is elaborated as follows.

The evaluator is an arbitrary form of deep recommendation model, which is pre-trained for the accurate modeling of user’s underlying interest. Now that the retrieval key can be regarded as the representation of a virtual item, the well-trained evaluator is employed to measure user’s degree of interest towards such a virtual item.

Because of deep models’ inherent unrobustness (Goodfellow et al., [n. d.]; Carlini and Wagner, 2017; Madry et al., 2018), the evaluator might falsely reward an inferior synthesization. To get rid of this potential defect, an auxiliary discriminative model called validitor is introduced, which resists the evaluator from being fooled by adversarial cases.

The referencer is a certain type of similarity measurement, e.g., Cosine or Euclidean, which is used to encourage the maximization of similarity between the retrieval key and candidates’ representations.

Figure 2. Illustration of Synthonet’s objective: to have the similarity between the retrieval key and item’s representation aligned with user interest.

With the maximization of rewards in (1) and (2), will become a local optima of user interest; in addition, with the maximization of reward in (3), representations of user’s interested items will be anchored in the neighborhood of . Finally, by maximizing all the rewards simultaneously, the “concentric diagram” is formed (shown as Figure 2), with user’s most interested point being the center (the red point), representations of user’s interested items being ’s neighborhood (the green points), and representations of those less interested items distributed away from (the blue points). In other words, user interest becomes almost positively related with the similarity between retrieval key and item representation. Therefore, the relationship stated by Eq. 1 will come into existence.

To summarize, the major contribution of this work is highlighted as follows.

  1. A novel paradigm Synthonet is proposed in this work, which learns to synthesize the virtual item representation (i.e., retrieval key) for both precise and temporally scalable candidate retrieval.

  2. An Actor-Critic infrastructure is developed for Synthonet, which enables user interest to be well approximated by the similarity between retrieval key and item representation.

  3. Extensive empirical studies are conducted with a series of real-world datasets, where consistent and remarkable improvements are achieved representative baseline methods, such as those based on metric-learning.

The subsequent contents of this work are organized as follows. First of all, preliminaries and formulation of our problem are presented in Section 2. Secondly, Synthonet’s architecture is over-viewed in Section 3, followed by its instantiation discussed in Section 4. The experimental studies are made in Section 5; and the related works are reviewed in Section 6. Finally, the paper is concluded in Section 7.

Figure 3. Schematic illustration of Synthonet’s infrastructure. (I) shows Synthonet’s offline construction; (II) illustrates Synthonet’s online workflow (better viewed in colour).

2. Problem Formulation

In this section, preliminaries for recommendation system and candidate retrieval are introduced in the first place, on top of which definition and quantitive formulation are presented for personalized candidate retrieval problem.

2.1. Preliminaries and Definition

2.1.1. Typical Workflow of Recommendation

As sketched by Figure 1, a typical recommendation system operates with the following two consecutive steps.

Candidate Retrieval. Given a recommendation request, which consists of user and context information, a small group of relevant candidates are retrieved in realtime from the whole item set.

Candidate Ranking. The retrieved candidates will be further refined by the ranking module, which is of higher precision; the top ranked candidates will be returned as the recommendation result.

Given both parts’ distinction in functionality, different techniques have to be developed in practice. Particularly, the ranking module mainly emphasizes on accuracy, whose developed algorithm should be as precise as possible; in contrast, the candidate retrieval module has to jointly consider precision and scalability, given that the whole items’ scale could be huge in reality.

2.1.2. Candidate Retrieval

Despite diverse formulations in detail, mainstream candidate retrieval methods share the common framework. Particularly, they would first determine the way of representation for the items (e.g., bag-of-keywords), along with their similarity measurement (e.g., keyword co-occurrence). Then items are organized with a certain index structure (e.g., hashing tables), where similar items can be grouped in common units. Once a recommendation request is issued from a user, the retrieval key will be generated for her, which follows the identical representation format as that of item. Finally, the retrieval key is used to search the pre-constructed index, where items with similar representations will be retrieved as candidates. Thanks to the sub-linear time complexity of such a similarity search paradigm, the candidate retrieval can be highly scalable, thus being able to be completed in realtime.

However, it remains an open question of designing better schemes so that high-relevance candidates can be comprehensively acquired. Particularly, the similarity search paradigm must be well aligned with user’s underlying interest, such that user’s top interested items will also be those whose representations are highly similar with the retrieval key. With the above concern in mind, the optimal personalized candidate retrieval (OPCR) problem is defined as follows.

Definition 2.1 ().

In OPCR, user’s most interested items will also become those whose representations are most similar with the retrieval key; therefore, the top relevance candidates can be retrieved efficiently via similarity search.

2.2. Quantative Formulation

In this part, the quantitive formulation of OPCR is presented, which also helps to illustrate our intuition of solving it in practice.

Let / be an arbitrary user/item, respectively. Suppose there is an “almost perfect” deep recommendation model , which precisely measures user interest as , where stands for the representations of the corresponding entity. In addition, there is a synthesization model , which generates as the retrieval key. With a recommendation request from user , the following optimization problem is formulated, which specifies the retrieval of user’s most interested candidates:


where stands for the retrieved candidates. The first constraint indicates that every candidate’s representation is among the top-K nearest neighbours to . Meanwhile, the second constraint requires the incorporation of the “ground truth”, i.e., user’s consumed items in reality (whose validity as candidates is self-evident) needs to be included in the selected candidates.

As is formulated by Eq. 2, the OPCR looks for the optimal configuration of synthesization model and item representation , which will give rise to the retrieval of the best candidates. However, because of the combinatorial nature of Eq. 2, it will probably be difficult to obtain the optimal solution. As a result, a few mild relaxations are introduced for its approximate solution.

Firstly, we would let the KNN requirement replaced by a threshold constraint on similarity:


where indicates the similarity measurement, and stands for the similarity threshold which filters high-relevance candidates from the whole items. Since and are required to be highly similar, we may adapt the objective function by approximation:


Intuitively, it requires the maximization of user’s interest towards ; thus, user’s interest to the candidates can be maximized as well, thanks to their representations’ high similarity with . Moreover, the similarity constraint is relaxed from to , and the above problem is transformed into its Lagrangian relaxation form:


Finally, it it constraint-free and fully differentiable (given that common similarity measurement, e.g., or , are adopted). Therefore, it becomes solvable via gradient ascent.

Now we may come to the following high-level framework for OPCR’s solution.

Firstly, a certain type of deep recommendation model is selected and pre-trained based on user history, where user interest can be accurately predicted;

Secondly, the synthesization model and item representation are trained to maximize the objective function in Eq. 5 via gradient ascent. In this place, both functions and are employed as the “critics” for and ’s performances, whereby providing the supervision signals for both parties’ iterative updates.

Finally, we will get the near optimal solution to the original optimization problem in Eq. 2, which realizes our proposed objectives in OPCR.

3. General Infrastructure of Synthonet

Synthonet is a abstractive candidate retrieval paradigm, whose concrete realization can be flexibly adapted according to specific application scenarios. In this section, an overview is made for Synthonet’s general infrastructure, which is illustrated with Figure 3. Our discussion is partitioned into two parts: 1) Synthonet’s offline construction and 2) its online workflow.

3.1. Synthonet’s Construction

Synthonet is constructed in the offline stage. The construction is carried out in an Actor-Critic pipeline shown as Figure 3 (I).

3.1.1. Actor’s Role

The Actor is to generate synthesized retrieval key and item representation based on user and item’s raw features. Particularly, the Actor includes three components: (1) the synthesization module, where user’s recommendation request is encoded as the retrieval key; (2) the user encoder and (3) the item encoder, where latent vectors are generated for user and item as their representations. It’s worth nothing that the retrieval key and user representation are generated for different purposes: one for candidate retrieval, thus needs to follow the same format as the item representation; while the other one is for the deep recommendation model, whose format can be chosen flexibly for the best performance.

3.1.2. Critic’s Role

The Critic is to provide supervision signals for the Actor’s generation process. Particularly, a compound reward is integrated from three sources: evaluator, validator, and referencer.

The evaluator is a pre-trained deep recommendation model, which determines user’s interest towards an item given both parties’ representations. Once deployed in the Critic, it is used to calculate user’s degree of interest towards the retrieval key (i.e., the representation of a virtual item). Such a reward can be interpreted as in Eq. 5.

The validator is a discriminative model, which tells whether the synthesized retrieval key is within the distributed scope of those real items’ representations. Such a reward helps to eliminate the adversarial cases, where the evaluator will falsely reward a meaningless synthesization111The evaluator will become ineffective and generate unreliable reward, once the retrieval key is out of its working domain, which is also the distributed scope of real items’ representations.

The referencer compares the similarity between the retrieval key and the representation of a candidate item. Such a reward is corresponding to in Eq. 5, which encourages the retrieval key and its candidates to be located in the same neighborhood.

3.1.3. Construction Process

Given the above framework, Synthonet’s construction is carried out via the interaction between Actor and Critic. First of all, the evaluator and validator are learned so that the Critic can be deployed before hand. Secondly, given user’s historical behaviors, retrieval keys and user/item representations are consecutively generated by the Actor, which will further get rewarded by the Critic. The Actor will then be updated ascendingly w.r.t. the partial gradients of its acquired reward. Finally, the above generation-reward-update process will be iteratively conducted until its convergence, where the compound reward can be maximized.

3.2. Synthonet’s Workflow

The Actor is isolated from the well constructed Synthonet and deployed for the candidate retrieval operation. Particularly, all the items are encoded as their representations and organized with a certain index structure in the offline stage. Once an recommendation request is issued by the online service, the retrieval key will be synthesized and used for the ANN (approximate nearest neighbor) search over pre-constructed index. Finally, the top-K similar items can be identified, which will be returned as the retrieved candidates.

4. Synthonet’s Instantiation

As introduced, Synthonet is a general paradigm for personalized candidate retrieval, whose formulation can be flexibly adapted for different recommendation scenarios. However, to better demonstrate how Synthonet works in practice, it is instantiated for “text-rich” scenario, which is common for a wide variety of applications, such as online advertisement, e-commerce and news recommendation (e.g., (Zhai et al., 2016; Tay et al., 2018; Wu et al., 2019)). Particularly, settings for the discussed scenario are briefly introduced as follows.

Item. Each item is associated with its text feature, which is organized as a sequence-of-word (e.g., the title of ad/news article).

User. Each user is associated with her historical behaviors, where each behavior stands for a specific item consumed by the user (e.g., the whole ads/news articles clicked in history).

For the next part, concrete structures are designed so that Synthonet will be instantiated for the above scenario222Although there can be various alternative structures for Synthonet, only one representative is discussed here for demonstration. However, more necessary alternatives are to be analyzed in the empirical studies.. To facilitate comprehension, the frequently used notations are summarized in Table 1.

Table 1. Frequently Used Notations.

4.1. Actor

4.1.1. Item Encoder

The item encoder’s structure is shown by Figure 4. First of all, the item’s word sequence is transformed into the word embedding sequence by transferring each token into its embedding vector:


The word embedding sequence is further processed by a 1-D convolutional network (CNN) (Kim, 2014) so as to better extract its local information:


To highlight the meaningful information, the vector is introduced, which attentively aggregates the whole sequence:


Finally, the aggregated vector is used as the item’s representation.

4.1.2. User Encoder

The structure of user encoder is shown as Figure 5. Particularly, user representation () is generated by attentively aggregating the representations of user’s consumed items . In this place, multi-head attentive pooling (Lin et al., 2017) is employed for the aggregation. First of all, a total of K pooling heads are employed, each of which is used to generate a unique aggregated vector via attention:


As a result, a total of K aggregated vectors are obtained. All these vectors are concatenated along the column and multiplied by a ( is the dimension of ) mapping matrix , where the original representation dimension will be kept:


4.1.3. Synthesizer

The retrieval key () is synthesized via two consecutive steps: firstly, user representation is generated with the user encoder; secondly, a M-layer feed-forward network (FFN) is employed where user representation is translated into the retrieval key:


where and stand for the mapping matrix and bias of the

-th perception layer. Notice that the activation function is removed for the last layer so that the retrieval key can be an arbitrary real value vector (otherwise it will be confined in certain scope and probably unable to approach its candidate’s embedding).

4.2. Critic

4.2.1. Evaluator

A bi-channel deep recommendation model is pre-trained for evaluation. Particularly, the item representation and user representation are delivered to two different multi-layer feed-forward networks: and , where user’s interest towards the item is measured with the weighted summation of both outputs’ element-wise product:


Here, and

indicate the sigmoid function and element-wise product, and

is the learnable weighting vector. The well-trained recommendation model is deployed as the evaluator. It will treat the retrieval key as a virtual item’s representation, and calculate the evaluation reward as:


Obviously, by maximizing the value of , will get aligned with user interest as much as possible.

Figure 4. Item encoder’s structure.
Figure 5. User encoder’s structure.

4.2.2. Validator

Inspired by the idea of generative adversarial network, A binary classifier

is trained for validation, which determines whether a vector comes from the real items’ representation (with label 1), or those synthesized retrieval keys (with label 0). Once deployed, the validator takes a retrieval key and calculate its log-likelihood of being positive:


Apparently, will be within the valid scope of real item’s representation when is close to zero, thereby resisting the false reward from evaluator.

One more special thing about the validator is that it needs to be iteratively adapted along with the training progress of Actor, as the retrieval key’s distribution is changed from time to time. Particularly, everytime one round of training is completed for the Actor, the validator will be refined with the up-to-date .

4.2.3. Referencer

The referencer is a parameter-free function, which measures the similarity (e.g., Cosine) between the retrieval key and the representation of its high-relevance candidate. The range of similarity is mapped to (0, 1) so as to keep consistent in scale with other rewards:


Here indicates the user’s consumed item in reality, thus being qualified to be a high-relevance candidate. By maximizing the value of , the retrieval key and candidate’s representation will get close to each other as much as possible w.r.t. the chosen similarity.

(1) Pre-training:
while not converge do
       for  do
             for  do
                   encode to be as Eq. 7 and 8;
                   encode to be as Eq. 9 and 10;
                   encode negative samples as ;
                   calculate the overall binary cross-entropy loss as Eq:12: ;
                   update recommendation model (evaluator, user encoder, item encoder) w.r.t. ;
(2) Validator’s Initialization:
while not converge do
       for  do
             encode as as Eq. 7 and 8;
             synthesize the retrieval key as Eq. 11 based on a piece of randomly sampled user history;
             calculate the overall binary cross-entropy loss as Eq. 14: ;
             update validator w.r.t. ;
(3) Actor’s Training:
while not converge do
       for  do
             for  do
                   generate retrieval key as Eq. 11;
                   calculate compound reward : as Eq. 13, 14, 15.;
                   update Actor w.r.t. ;
      refine validator as step (2);
Algorithm 1 Synthonet’s Training Process

4.3. Training of Synthonet

Putting together every component of Actor and Critic, Synthonet’s training process is summarized as Algorithm 1.

Firstly, the pre-training step is carried out, where our deep recommendation model is learned to capture user interest. Notice that the item encoder, user encoder and evaluator will participate the pre-training process; therefore, they will all be learned from such an operation.

Secondly, the validator is initialized, which is to distinguish the real items representations generated by the item encoder and the retrieval keys generated by the initial synthesizer.

Thirdly, the Actor is trained: given each user’s consumed item, the retrieval key is generated based on her history before the consumption; then the compound reward is produced by the Critic so that the Actor can be updated w.r.t. the its gradient. Since Synthesizer and Item encoder’s updates will change the distribution of retrieval key and item representation, the original validator will gradually expire; therefore, it will be adapted iteratively along with Actor’s update.

Table 2. Dataset Statistics: total number of users and items, average number of item per user, and average number of word per item.
Table 3. Experiment Result On News Dataset (top values marked in bold).

5. Experimental Studies

5.1. Experiment Settings

Experimental studies are carried out for the evaluation of retrieval quality. As a general retrieval pipeline for all the methods in comparison: retrieval key and item representation will be generated in the first place based on both parties’ inherent features; then items are sorted according to their representations similarities with the retrieval key; finally, items with the top-K similarity will be selected as the candidates. Detailed configuration about the experiments are introduced as follows.

5.1.1. Baselines.

Two groups of baselines are compared in our experiments. On the one hand, the metric-learning based methods are taken into comparison, which learn to represent user and item in the common latent space. Particularly, two representative approaches are adopted: one is based on Deep Structured Semantic Model (DSSM) (Huang et al., 2013), the other one is adapted from Collaborative Metric Learning (CML) (Hsieh et al., 2017).
In DSSM, user and item representations are generated by encoding their raw features via two independent networks: the user encoder and item encoder.
In CML333An adaption is made here, as item embedding is used by the original CML. However, the item embedding incurs huge information loss in our experiment, which severely limits its performance., user is represented by the embedding vector, which is associated with her ID; while item is still represented with item encoder as DSSM.
For the sake of fair comparison, user/item encoders in DSSM/CML will use the same structures as our proposed methods, which are illustrated in Section 4.

On the other hand, it is still popular in practice where candidate relevance is directly derived from raw features (e.g., those based on keyword similarity). As a result, we consider those learning-free methods, where user/item’s representations are acquired beforehand, instead of specifically learned for candidate retrieval. Two representative approaches are adopted: one is based on average word embedding (AVG), the other one uses the sentence level embedding from BERT (SLE).
In AVG, an item is represented as the average vector of its words’ embeddings (GloVe-300444 is used in our experiment), and a user is represented as the average of her items’ embeddings. Despite of simplicity, such a method is a common and effective baseline in many NLP tasks (Socher et al., 2013a; Socher et al., 2013b; Le and Mikolov, 2014).
In SLE, an item is represented via the sentence level embedding of BERT (Devlin et al., 2018), i.e., the embedding of token is used as the item’s embedding. Meanwhile, and a user is still represented as the average of her items’ embeddings as AVG.
Notice that the context feature in recommendation request is not directly comparable with item’s text feature, thus they are ignored in AVG and SLE, where user embedding is used as the retrieval key.

Table 4. Experiment Result On E-Commerce Dataset (top values marked in bold).

5.1.2. Variations of Our Approach

As is introduced, Synthonet can be instantiated in different ways. Therefore, alternative implementations are systematically tested so as to verify its generalizability. Variational Similarity Measurement. Cosine similarity (Eq. 15) is chosen as default similarity measurement; besides, Euclidean similarity is also included in the experiments.
Variational Form of Evaluator. The bi-channel feed-forward network (BI) introduced in Section 4.2.1

is used as default evaluator; meanwhile, the fully-connected feed forward network (FC) is also taken into account, where user and item representations are concatenated along the column and processed by a 2-layer feed-forward network for final logit.

Variational Form of User Encoder. In addition to our default user encoder introduced in Section 4.1.2, the way of user representation in CML is also considered, where users are represented by the embedding vectors associated with their IDs.
Variational Input. Apart from user history, there can be other available information when recommendation is to be made, such as context and user’s intent. The auxiliary information is encoded in parallel with user history, and the encoded vectors of both parts’ are concatenated along column for the final user representation.

5.1.3. Metrics.

The following three evaluation measurements are considered in our experiments.
Recall Performance. The retrieval quality is directly reflected by its recall rate, as the ultimate goal of retrieval is to obtain all the quality candidates, instead of giving the final recommendation list. The recall rate is measured with Recall@K, where K is scale of retrieval set.
Ranking Performance. To know more about the retrieval precision of different methods, the ranking performance is further compared, which is measured with MRR and NDCG@K.

5.1.4. Datasets.

Two real-world datasets are used in our experiments. One is the industrial dataset 555 from MSN News, which records users’ news browsing behaviors. In this dataset, each user is associated with her browsed news articles in history, and each article is associated with its titles; additionally, user’s intent is explicitly specified for each of her browsing behavior, i.e, the type of news (e.g., political or financial news) she’s looking for. In our experiment, user history is used as by default; and user’s intent is used while evaluating the effect of auxiliary information. Another public dataset of Amazon reviews on Movies and TV666 (referred as E-Commerce) is adopted in the experiment, which records users’ online shopping behaviors: each user is associated with her purchased items in history, and each item is associated with its title and description. Detailed statistics about both datasets are illustrated as Table 2.

Table 5. Experiment Result Using Auxiliary Input (top values marked in bold).
Table 6. Experiment Result Using Different User Encoder; Default: with default user encoder, Alter: with user encoder from CML (top values marked in bold).

5.2. Experiment Analysis

5.2.1. Findings From Main Results

Main experimental results on both datasets are shown by Table 3 and 4. As is can be observed, variations of Synthonet, SYN (B) and SYN (F) consistently outperform all the baselines in terms of recall performance, indicating that better candidates can be retrieved with our proposed method. It may also be inferred that Synthonet’s superiority on recall performance is resulted from its higher capability of identifying quality candidates, as top ranking scores can always be generated from it. Besides, it can be observed that all the learned representations (with SYN (*), DSSM, CML) outperform those learning-free methods significantly. The explanation about this phenomenon is twofold. For one thing, learning-free approaches, like AVG, are too simple to exploit user and item’s raw features effectively, thus unable to identify candidate’s relevance in fine-granularity. For another thing, although some other learning free methods, like SLE, are sophisticated to fully utilize raw features, their generated user/item representations may not relevant in terms of similarity, thus unsuitable for candidate retrieval task.

Apart from the above obvious observations, some other interesting phenomenons can also be derived from the main result.
Effect of Similarity Measurement. For both similarity measurements (COS and EUC), SYN (*) consistently gives rise to the highest recall/ranking performances for all the testing cases; in addition, Synthonet’s fluctuations across different similarity measurements are comparably smaller than those of baselines. The above phenomenons indicate that Synthonet is robust to the change of similarity measurement. One probable explanation is that Synthonet takes advantage of multiple supervision signals in its training process, thus making it less sensitive to each individual one. A more detailed analysis is to be made in Section 5.2.3.
Effect of Evaluator’s Form. For both forms of evaluators (BI and FC), consistent improvements are achieved over the baselines under the same setting, indicating that all forms of evaluators contribute substantially to the Synthonet’s performances. However, distinct results might be generated by different forms of evaluators, which suggests that Synthonet’s performance can be optimized by selecting more effective evaluators.

Table 7. Experiment Result For Component Analysis (top values marked in bold).

5.2.2. Additional Studies

A series of additional studies are carried out for the complement of the main results. Because of redundant observation, analysis is only carried out for the result on News dataset.
Synthonet with Auxiliary Input.

As introduced before, user’s intent is available for the News dataset, and it is adopted as our auxiliary input. The intent is virtually a categorical variable specifying the type of news the user’s looking for, therefore it is represented by the embedding vector associated with its ID. According to the experiment results demonstrated in Table

5, SYN (*) still gives rise to the best recall/ranking performances in contrast to the baselines (the learning-free methods are omitted due to their incapability of using auxiliary input). Besides, recall/ranking performances are remarkably improved (compared with those in Table 3) thanks to the presence of additional information. As a result, it validates that Synthonet is able to effectively exploit auxiliary input for better retrieval performance.
Synthonet with Different User Encoder Performances with different user encoders are demonstrated in Table 6. As it can be observed, SYN with our default user encoder consistently outperform those with CML’s user encoder. Together with our conclusion in Section 5.2.1, we may have the following conclusion. That as a general candidate retrieval paradigm, Synthonet consistently outperforms those metric-learning baselines under the same settings (similarity measurement, user/item encoders); meanwhile, the performance of Synthonet itself can be further enhanced by selecting more appropriate configurations, such as forms of evaluator and user/item encoders.
Alignment with User Interest. In addition to our metrics on recall/ranking performance, we would also like to know user’s degree of interest towards the retrieved candidates, which can be measured as the negative log-likelihood of the top K retrieved candidates (denoted as LL@K):


where stands for user’s probability of being interested in item . Apparently, a smaller NLL@K indicates a larger degree of interest. Since there is no way to acquire user’s exact interest, the well-trained evaluator used by SYN (B) is employed for approximation.
According to the demonstrated result in Table 8, the degree of interest is almost aligned with the recall/ranking performances reported in Table 3, despite that the employed user model is not fully accurate. As a result, it indicates that the retrieved candidates from Synthonet better meet user’s interest.

5.2.3. Component Analysis for Critic

Experiments are conducted to evaluate each individual component’s effect in Critic, where EVA, VAL, REF indicate the presence of evaluator, validitor and references in Critic777The combination of Val-Ref is not considered as validator needs to work along with evaluator.; while Compound stands for the inclusion of all these components. Because of duplicated observations, results are only reported for SYN with evaluator BI.

As demonstrated in Table 7, top performances are achieved by Compound in most testing cases. Meanwhile, improvements can be observed when evaluator work jointly with either validator or referencer. Therefore, it indicates that both components contribute substantially to Synthonet’s performance. Moreover, Synthonet’s performance is already no lower than the best baseline (DSSM) merely with the evaluator; and in some cases, top retrieval results can be obtained merely with evaluator and validator. Both phenomenons suggest that maximizing user’s interest towards the retrieval key is crucial for the retrieval quality. In fact, such a point is also consistent with our problem formulation in Eq. 5.

5.2.4. Summarization

Major findings of the experimental studies are summarized with the following points.

Consistent and remarkable improvements are achieved by Synthonet in terms of recall/ranking performances, whereby validating its effectiveness on retrieving quality candidates.

As a general candidate retrieval paradigm, Synthonet can be tuned flexibly for the optimal performance by selecting the most suitable configuration of each specific scenario.

All the components in Critic contribute substantially to Synthonet’s performance, which jointly gives rise to its superior candidate retrieval quality.

6. Related Work

In this section, related studies are reviewed from two perspectives: deep recommendation algorithms and candidate retrieval.

6.1. Deep Recommendation Algorithms

Leveraging the recent progress of deep learning, today’s recommendation algorithms become more and more proficient in capturing user’s underlying interest. Roughly speaking, deep learning techniques contribute to the development of recommendation algorithms in two ways. For one thing, thanks to deep neural networks’ superior capability on function approximation, complex user-item relationships, e.g., high-order feature interaction  

(Cheng et al., 2016; Guo et al., 2018; Lian et al., 2018; Zhou et al., 2018; Wang et al., 2017a), temporal behavioral patterns (Zhu et al., 2017; Zhou et al., [n. d.]), can be effectively learned from user’s behavioral data. For another thing, the employment of deep neural networks facilitates the effective exploitation of diverse data, such as textual (Tay et al., 2018; Wu et al., 2019), visual (Wang et al., 2017b; Joonseok et al., 2018; Xu et al., 2018) relational (Wang et al., 2018a; Chang et al., 2015; Ying et al., 2018) information, and common-sense knowledge from KB  (Zhang et al., 2016; Wang et al., 2018b). It’s noticeable that most of these advanced algorithms mainly emphasize the ranking efficacy, yet contribute little to the candidate retrieval due to their inherent limitation on temporal scalability.

Table 8. Degree of User Interest (top values marked in bold).

6.2. Candidate Retrieval

6.2.1. Conventional Way of Candidate Retrieval.

As discussed, candidate has to be selected in realtime from a tremendous pool of items. Therefore, mainstream candidate retrieval approaches would take advantage of structurized data so as to achieve feasible running efficiency. In early days, one of the most well-known representatives is based on inverted-index (Büttcher et al., 2016) (still widely applied in practice), where items are indexed w.r.t. a certain type of raw feature (e.g., keywords), and candidates are retrieved for a user if there exist a shared feature value. Later on, improved methods are consecutively proposed, where multiple raw features can be jointly utilized for candidate retrieval, and personalized features weights are learned for more precise retrieval (Borisyuk et al., 2016; Anagnostopoulos et al., 2008; Asadi and Lin, 2013). In more generalized setting, user and item’s relevance is derived based on their feature similarity. As a result, candidate retrieval can be conducted with even higher flexibility; meanwhile, thanks to the superior index structures like LSH and KNN graph (Chen et al., 2018; Wang and Li, 2012; Sugawara et al., 2016), the retrieval operation can be efficiently conducted with O(1) time complexity.

6.2.2. Metric Learning.

Metric learning is general machine learning paradigm

(Lebanon, 2006; Weinberger and Saul, 2009), which is developed to represent entities such that those from the same class can be mutually close to each other in the representation space. Obviously, it turns out to be a natural choice for candidate retrieval, as user and item can be represented in the common space and their relevance will be measured by representations’ similarity (Schroff et al., 2015; Hsieh et al., 2017; Joonseok et al., 2018). In contrast to those conventional methods, metric learning is able to identify quality candidates more accurately, as representations are carefully learned from user-item interactions, and raw features can be better exploited on top of more advanced structures.

In fact, both Synthonet and metric learning will represent user as the retrieval key for candidate retrieval. However, there is fundamental distinction on how the retrieval key is generated. For one thing, metric learning merely cares about the overall similarity between the retrieval key and user’s consumed items in history, whereas user’s interest towards to retrieval key itself is not taken into account. As a result, the retrieval key may stray away from user’s interested region, which will falsely introduce inaccurate candidates and impair the retrieval quality. On the other hand, Synthonet makes user’s interest towards the retrieval key a priority, which is to be optimized simultaneous along with the similarity part. Therefore, more accurate candidate retrieval can be delivered. In this place, a toy example is presented for better illustration.

Figure 6. Toy example for the comparison of learning to generate retrieval key based on (I) metric learning and (II) Synthonet, respectively (better viewed in colour).
Example 6.1 ().

Suppose that representations of a user’s interest items are confined within the blue region S in Figure 6. Meanwhile, vertex a indicates the representation of the user’s consumed item in history. In metric learning, similarity (or distance) becomes the only factor to be considered, and it will determine the whole neighborhood of a (i.e., circle (a, )) to be the potential region for user representation. Consequently, user’s representation could be falsely mapped to vertex b, which is out of user interested region despite its similarity with a; and all the items within b’s neighbourhood (i.e., circle O(b, )) will be retrieved as candidates. Apparently, only limited part of user’s interested area can be covered (i.e., s1), and many of the relevant candidates could be left out from the retrieval result. On the other hand, by taking user interest into account, Synthonet will identify vertex c to be a much more appropriate user representation, as it is not only similar with a but also accurately aligned with user interest. Therefore, items within circle (c, ) will be selected as the retrieval result, where much more of user’s interested items (i.e., those within s2) can be obtained.

7. Conclusion

In this work, a novel paradigm Synthonet is proposed for personalized candidate retrieval, where the virtual item representation (i.e., retrieval key) is synthesized optimally for the efficient acquisition of high-quality candidates. With the developed Actor-Critic infrastructure, user’s underlying interest becomes accurately aligned with the similarity between retrieval key and item representation. Therefore, high-quality candidates can be effectively identified and efficiently retrieved via similarity search. Extensive empirical studies are carried out with real-world datasets, whose results validate the effectiveness of our proposed method.


  • (1)
  • Anagnostopoulos et al. (2008) Aris Anagnostopoulos, Andrei Broder, and Kunal Punera. 2008. Effective and efficient classification on a search-engine model. In Knowledge and information systems. Springer.
  • Asadi and Lin (2013) Nima Asadi and Jimmy Lin. 2013. Effectiveness/efficiency tradeoffs for candidate generation in multi-stage retrieval architectures. In SIGIR. ACM, 997–1000.
  • Bhatnagar et al. (2009) Shalabh Bhatnagar, Richard S. Sutton, Mohammad Ghavamzadeh, and Mark Lee. 2009. Natural actor-critic algorithms. In Automatica.
  • Borisyuk et al. (2016) Fedor Borisyuk, Krishnaram Kenthapadi, David Stein, and Bo Zhao. 2016. CaSMoS: A framework for learning candidate selection models over structured queries and documents. In KDD.
  • Büttcher et al. (2016) Stefan Büttcher, Charles LA Clarke, and Gordon V Cormack. 2016. Information retrieval: Implementing and evaluating search engines. Mit Press.
  • Carlini and Wagner (2017) Nicholas Carlini and David Wagner. 2017. Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy.
  • Chang et al. (2015) Shiyu Chang, Wei Han, Jiliang Tang, Guo-Jun Qi, Charu C. Aggarwal, and Thomas S. Huang. 2015. Heterogeneous Network Embedding via Deep Architectures. In KDD.
  • Chen et al. (2018) Qi Chen, Haidong Wang, Mingqin Li, Gang Ren, Scarlett Li, Jeffery Zhu, Jason Li, Chuanjie Liu, Lintao Zhang, and Jingdong Wang. 2018. SPTAG: A library for fast approximate nearest neighbor search.
  • Cheng et al. (2016) Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al. 2016. Wide & deep learning for recommender systems. In inproceedings of the 1st Workshop on Deep Learning for Recommender Systems.
  • Covington et al. (2016) Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networks for youtube recommendations. In RecSys.
  • Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. In arXiv preprint arXiv:1810.04805.
  • Goodfellow et al. ([n. d.]) Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. [n. d.]. Explaining and harnessing adversarial examples. In arXiv preprint arXiv:1412.6572.
  • Guo et al. (2018) Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, Xiuqiang He, and Zhenhua Dong. 2018. DeepFM: An End-to-End Wide & Deep Learning Framework for CTR Prediction. In IJCAI.
  • Hsieh et al. (2017) Cheng-Kang Hsieh, Longqi Yang, Yin Cui, Tsung-Yi Lin, Serge Belongie, and Deborah Estrin. 2017. Collaborative metric learning. In WWW. 193–201.
  • Huang et al. (2013) Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. 2013. Learning deep structured semantic models for web search using clickthrough data. In CIKM.
  • Joonseok et al. (2018) Lee Joonseok, Abu-El-Haija Sami, Varadarajan Balakrishnan, and Natsev Apostol. 2018. Collaborative Deep Metric Learning for Video Understanding. In KDD.
  • Kim (2014) Yoon Kim. 2014. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882.
  • Konda and Tsitsiklis (2000) Vijay Konda and John Tsitsiklis. 2000. Actor-Critic Algorithms. In NIPS.
  • Le and Mikolov (2014) Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In ICML.
  • Lebanon (2006) Guy Lebanon. 2006. Metric learning for text documents. In IEEE Transactions on Pattern Analysis and Machine Intelligence.
  • Lian et al. (2018) Jianxun Lian, Xiaohuan Zhou, Fuzheng Zhang, Zhongxia Chen, Xing Xie, and Guangzhong Sun. 2018. xdeepfm: Combining explicit and implicit feature interactions for recommender systems. In KDD.
  • Lin et al. (2017) Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. 2017. A structured self-attentive sentence embedding. In arXiv preprint arXiv:1703.03130.
  • Madry et al. (2018) Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. 2018. Towards deep learning models resistant to adversarial attacks. In ICLR.
  • Schroff et al. (2015) Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015.

    Facenet: A unified embedding for face recognition and clustering. In

  • Socher et al. (2013a) Richard Socher, Danqi Chen, Christopher D Manning, and Andrew Ng. 2013a.

    Reasoning with neural tensor networks for knowledge base completion. In

  • Socher et al. (2013b) Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng, and Christopher Potts. 2013b. Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP.
  • Sugawara et al. (2016) Kohei Sugawara, Hayato Kobayashi, and Masajiro Iwasaki. 2016. On approximately searching for similar word embeddings. In ACL.
  • Tay et al. (2018) Yi Tay, Anh Tuan Luu, and Siu Cheung Hui. 2018. Multi-Pointer Co-Attention Networks for Recommendation. In KDD.
  • Wang et al. (2018b) Hongwei Wang, Fuzheng Zhang, Xing Xie, and Minyi Guo. 2018b. DKN: Deep knowledge-aware network for news recommendation. In WWW.
  • Wang et al. (2018a) Jizhe Wang, Pipei Huang, Huan Zhao, Zhibo Zhang, Binqiang Zhao, and Dik Lun Lee. 2018a. Billion-scale Commodity Embedding for E-commerce Recommendation in Alibaba. In KDD.
  • Wang and Li (2012) Jingdong Wang and Shipeng Li. 2012. Query-driven iterated neighborhood graph search for large scale indexing. In ACM Multimedia 2012. 179–188.
  • Wang et al. (2017a) Ruoxi Wang, Bin Fu, Gang Fu, and Mingliang Wang. 2017a. Deep & cross network for ad click predictions. In Proceedings of the ADKDD’17. ACM, 12.
  • Wang et al. (2017b) Suhang Wang, Yilin Wang, Jiliang Tang, Kai Shu, Suhas Ranganath, and Huan Liu. 2017b. What Your Images Reveal: Exploiting Visual Contents for Point-of-Interest Recommendation. In WWW.
  • Weinberger and Saul (2009) Kilian Q Weinberger and Lawrence K Saul. 2009. Distance metric learning for large margin nearest neighbor classification. In Journal of Machine Learning Research.
  • Wu et al. (2019) Chuhan Wu, Fangzhao Wu, Junxin Liu, Shaojian He, Yongfeng Huang, and Xing Xie. 2019.

    Neural Demographic Prediction using Search Query. In

  • Xu et al. (2018) Xiaoran Xu, Laming Chen, Songpeng Zu, and Hanning Zhou. 2018. Hulu video recommendation: from relevance to reasoning. In RecSys.
  • Ying et al. (2018) Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L. Hamilton, and Jure Leskovec. 2018.

    Graph Convolutional Neural Networks for Web-Scale Recommender Systems. In

  • Zhai et al. (2016) Shuangfei Zhai, Keng-hao Chang, Ruofei Zhang, and Zhongfei Mark Zhang. 2016.

    Deepintent: Learning attentions for online advertising with recurrent neural networks.

  • Zhang et al. (2016) Fuzheng Zhang, Nicholas Jing Yuan, Defu Lian, Xing Xie, and Wei-Ying Ma. 2016. Collaborative knowledge base embedding for recommender systems. In KDD.
  • Zhou et al. ([n. d.]) Guorui Zhou, Na Mou, Ying Fan, Qi Pi, Weijie Bian, Chang Zhou, Xiaoqiang Zhu, and Kun Gai. [n. d.]. Deep Interest Evolution Network for Click-Through Rate Prediction. In arXiv preprint arXiv:1809.03672.
  • Zhou et al. (2018) Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi Jin, Han Li, and Kun Gai. 2018. Deep interest network for click-through rate prediction. In KDD.
  • Zhu et al. (2017) Yu Zhu, Hao Li, Yikang Liao, Beidou Wang, Ziyu Guan, Haifeng Liu, and Deng Cai. 2017. What to do next: Modeling user behaviors by time-lstm. In IJCAI.