1. Introduction
Recommender Systems(RS) attracts a lot of attention with the booming of information on the Internet. Typical RS algorithms include collaborative methods ((Sarwar et al., 2001), (Koren et al., 2009), (He et al., 2016)), content based methods or hybrid methods of the two((Adomavicius and Tuzhilin, 2015)). Among those researches, several topics have drawn more attentions in recent years: ContextAware Recommendation((Adomavicius and Tuzhilin, 2015)) seeks to model the scenes where the users are feed more precisely; TimeAware RS((Campos et al., 2014)) considers both the transformation of the user interest and the change of the value of the news content; Diversified Ranking((Wilhelm et al., 2018)) focuses on addressing the intralist correlations.
We focus on the last issue that is related to the arrangement of items in one exposure. It’s important to consider the correlations between the items in many realistic RS in order to maximize the overall utility of the recommended list. While there have been rich resources of studies ((Sarwar et al., 2001), (Koren et al., 2009), (He et al., 2016)) towards improving the precision of recommending a single item at one time, the research on intralist correlations usually collapses to the problem of diversity ((Wilhelm et al., 2018)) in many previous works. Yet we think that the investigation in this correlation is far from enough in the several aspects: Firstly, the evaluation of diversity misses a gold standard for a long time. Though there have been different metrics such as Coverage, IntraList Similarity etc((Ziegler et al., 2005), (Shani and Gunawardana, 2011)), those evaluations are typically subjective, and not directly related to the user feedbacks. Secondly, the previous works imposed strong propositions on how the items are correlated. Algorithms such as DPP(Wilhelm et al., 2018) and submodular ranking use handcrafted kernels or functions for the diversity, they cannot effectively model all possible correlation forms. Thirdly, the traditional stepwise greedy ranking method for the Kitem RS cannot guarantee the global optimum. Take the submodular ranking as an example, even if there are no errors in its proposition, the lower bound of the ratio of greedy choice to the global optimum is ((Yue and Guestrin, 2011)). Yet, the loss of the local optimum compared with the global optimum is totally unclear for realistic intralist correlations, when taking into account more kinds of forms of correlations besides the sumodularity.
In this paper, we propose a new solution framework for addressing intralist correlations: the EvaluatorGenerator framework. On the one hand of this framework, the Evaluator aims at fully modeling all possible forms of intralist interaction. It is trained on user logs to precisely predict the user feedback on each single item conditioned on all its surrounding items. The Evaluator is able to predict the utility of the whole list, such that the global optimum is pursued. In the meantime, the diversity and other requirements from the users will be satisfied to a proper extent when seeking for this target. On the other hand, the Generator is the sequence generating model trained by the user logs or the Evaluator. Then traditional beam search methods or sampling methods will be applied to generate different sequences, from which the Evaluator will choose one sequence as the final recommendation. Our experiments on the offline analysis and the online test demonstrates the consistency of our proposed framework on offline and online environments. In other words, the generated sequences that are preferred by the Evaluator tend to have better performance in our realistic online RS.
There are several contributions in our work:

In order to fully consider the intralist correlations in the recommended list, we proposed a generalized Evaluator model which is directly optimized to maximize the the overall utility of the whole list. The requirements like diversity will be satisfied naturally as a side effect.

We proposed the EvaluatorGenerator framework as a practical largescale recommender system, in order to recommend the best combination of contents to the user given a candidate set of sequences. With such design, our framework is able to pursue the global optimum in the way of improving the Generator or applying heuristic search algorithms.

We proved that the Evaluator can be used not only as an offline simulator, but also a selector online. The experimental results both offline and online shows that the Evaluator has a very high correlation with the online performance. We also conducted thorough investigations to understand the intralist correlations. This framework is fully launched online in our system to influence hundreds of millions of online visits every single day.
2. Related Works
Our work is closely related to the diversified ranking algorithms. Recent works on this topic include the submodularity((Yan et al., 2017), (Azar and Gamzu, 2011)), graph based methods((Zhu et al., 2007)), and Determinantal Point Process (DPP) ((Borodin, 2009), (Kulesza et al., 2012), (Wilhelm et al., 2018)). However, as mentioned above, those previous works either treat diversity as an independent goal (such as Coverage), or impose strong hypothesis on the form of item correlations. E.g., the diversified ranking with predefined submodular functions supposes that diversity are homogeneous on different topics, and independent of the user. DPP and submodular ranking also suppose that coexposed items always have a negative impact on the local item. In contrast to those propositions, the correlations in the realistic online RS are often found to violate those rules. Our statistics show that some combinations of the items yields better performance than the same single item appearing in other combinations. Other works have also found similar intriguing phenomenons, such as Serpentining(Zhao et al., 2017). The users are found to prefer discontinuous clicks when browsing the list. Thus it is better to spread those high quality items uniformly over the list instead of clustering them on the top positions. Such phenomenon shows the that intralist correlations is complex and blended with position bias.
The position bias is also an important and practical issue in RS. Click models are widely studied in Information Retrieval systems, such as the Cascade Click Model((Craswell et al., 2008)
) and the Dynamic Bayesian Network(
(Chapelle and Zhang, 2009)). It is found that the position bias is not only related to the user’s personal habit, but also related to the layout design etc((Chuklin et al., 2015)). Thus click models often need to be considered casebycase. Few of the previous works have studied the intralist correlation and the position bias together. Set based recommendation algorithms including submodular ranking and DPP do not consider the position bias at all. In contrast, our framework has naturally addressed the position bias and intralist correlations altogether.We also borrow ideas from applying Reinforcement Learning to Combinatorial Optimization(CO) problems. The pointer network
(Vinyals et al., 2015) is proposed as neural tool for universal CO. (Bello et al., 2016) further extended pointernetwork by using policy gradients for learning. (Dai et al., 2017) applied QLearning to CO problems on graphs. The aforementioned works have been focused on universal CO problems including Travelling Salesman Problem, Minimum Vertex Cover etc.Some recent works have been addresses longterm rewards in recommender systems. (Feng et al., 2018) also work on intralist correlations, which applies policy gradient and Monte Carlo Tree Search(MCTS) to optimize the NDCG in diversified ranking for the global optimum. Other works pursues longterm rewards in interlist recommendations ((Zhao et al., 2018), (Zheng et al., 2018)). Though (Zhao et al., 2018) also proposed treating a page of recommendation list as a whole, the intralist correlations are not well modelled and analyzed in their work. Also their work is not testified on realistic online RS. In this paper, we more thoroughly investigates the general form of the intralist correlations. We formalize the utility of the whole list as our final target, but we have also used itemlevel feedbacks. This has not been sufficiently studied in the above works.
3. Backgrounds
3.1. Problem Setup
We formulate the Kitem RS problem as follows: The environment exposes the user profile and a candidate set to the RS, where is the cardinality of the candidate set and denotes the th item. The system picks a recommendation sequence where and . S is exposed to the user, and the user returns the feedback of . Furthermore, we denote as the preceding recommendations above the th position, and as the recommendations that follows. We define the final objective as maximizing the expected overall utility of the sequence S, which is written as , where the utility is defined by Eq. 1.
(1) 
In Eq. 1 we assume that is not only related to the user and item in position , but also related to its preceding and following items and . Also notice that we made no proposition on the correlation, which means there are not only possibilities of negative and positive correlations, but also higher order correlations (such as correlation of three items). This is essentially different from many previous proposed diversified ranking algorithms.
3.2. Item Independent Ranking
Most of the previous RS are based on stepbystep greedy ranking. It goes onebyone without considering the item correlation, which is written as
(2) 
We use the formulation here to represent that ”A is to approximate B”. The parameterized function with trainable parameter is optimized to approximate the click through rate(CTR) , the subscript here means that the item could appear with any and , and Eq. 2 simply averaged out all those information. We named the Item Independent Prediction(IIP).
3.3. Preceding Items Aware Prediction
It is natural to improve Eq. 2 by taking into account. We define the Preceding Items Aware Prediction(PIAP) as Eq 3.
(3) 
Here we ommit the position in and simply write , because the position can be recovered from , and thus encoding automatically incorporates the position bias. A lot of previous diversified recommendation algorithms can be considered as specific paradigm of Eq. 3, e.g., the linear submodular recommendation((Azar and Gamzu, 2011)). However, we have not made any assumption here about the formula of yet. We argue that the expression of Eq. 3 essentially accounts for all kinds of correlations, including positive, negative and higherorder correlations.
However, two defects of Eq. 3 need to be mentioned. Firstly, onebyone ranking paradigm faces the issue of local optimum and the performance of the sequence as a whole is not guaranteed. Secondly, the following items is not taken into account, but results has been found to reveal that does have an impact in the overall performance of position, including click models((Wang et al., 2015)) and mouse tracking studies((Diaz et al., 2013)). However, knowing is not feasible in the ranking process. In the next part we introduce the EvaluatorGenerator framework which solves the two issues properly.
4. EvaluatorGenerator Framework
4.1. The Framework Overall
In order to fully catch the intralist correlations and the global optimum, we propose to separate the list generation with the evaluation process. The evaluation process considers only about predicting the value of a known sequence, which we call the Evaluator. The generative process is responsible for generating highvalue sequences, which we call the Generator. During the inference process, it is possible to use multiple generators to generate multiple lists, from which the Evaluator selects the best one(Fig. 1). We argue that this workflow is beneficial in the following aspects.

Thorough consideration of intralist correlations. An Evaluator encodes the information of the whole sequence together, the intralist correlation and position bias is incoporated effectively. It thus achieves higher accuracy compared with either IIP or PIAP in itemlevel.

LightWeighted Generators. Limitations on the generator exist as the generative process do not have full access to the sequences information. The EvaluatorGenerator allows generating multiple lists from which to choose the best one. It is able to use lightweighted generators to explore highquality candidate lists.

Global optimum is ensured. The Evaluator pursues global optimum explicitly by directly comparing the utility of the whole list. E.g., by encoding the whole sequence S, the utility score not only encourage items that achieves higher clickthroughrate itself, but also encourage items that attract following examination and clicks to the other items in the result page.
4.2. The Evaluator
4.2.1. Contextual Items Aware Prediction
We propose the Contextual Items Aware Prediction(CIAP), which encodes both and to predict the feedback on th position, which is written as.
(4) 
We sum up the predictions at each single position for a sequence as an estimation of
alltogether, that is,(5) 
We can see that is an approximation to the expectation of Eq. 1. Given sequences , the Evaluator can be used to select the sequence greedily by .
4.2.2. Determinantal Point Process
Among all previous works, Determinant Point Process(DPP, (Kulesza et al., 2012)) and Submodular ranking((Borodin et al., 2012)
) is closely related to our work. We briefly review the DPP models for RS. DPP models the probability of a subset being clicked as
. It explicitly set the correlation to be , where is the kernel function that maps to a real value symmetrically. Assuming that , and represent the submatrix of with only the index of rows and columns in Y. DPP proposes that obeys the following rule:(6) 
With minor mathematical calculations it is shown that the function in Eq. 6 satisfies submodularity. (Wilhelm et al., 2018) proposed a practical framework for deploying DPP to diversified ranking(Deep DPP). Each element of
is decomposed to a multiplication of quality term and a correlation term, both of which was calculated separately from a deep neural network.
DPP can be interpreted as one special type of Evaluator which estimate the value based on a full set(or sequence) instead of a single item. Though DPP only models the setwise click probability , it is not strictly correlated to the sum of clicks . However, DPP maximizes in its ranking process, just like Evaluator maximizing , thus we treat as the evaluation score in the experiments.
4.3. The Generators
We denote the Generator policy as (with slight abuse of notation, is again the parameter for the generator) that represents the possibility of generating the S given the user and the candidates C. Though various paradigm to generate a list exist, here we mainly use sequential decision policy, which is define as Eq. 7.
(7) 
4.3.1. Softmax Sampling
The simplest way is to use some heuristic values to generate multiple sequences by applying softmax sampling, in which the decision probability is written as
(8) 
We call the heuristic value the priority score. It is straightforward to use IIP or PIAP as , results in IIP + Softmax and PIAP + Softmax policies. Also notice that if Eq. 8 degenerate to greedy policy.
4.3.2. Reinforcement Learning
It is possible to interpret Eq. 1 as a Combinatorial Optimization (CO) problem with stochastic evaluation. The pointernetwork and Reinforcement Learning approach to solve CO has been found to be promising recently by (Vinyals et al., 2015) and (Bello et al., 2016)
. We formulate the generation of the ranking list as Markov Decision Process, which is described as follow: At each step
, the state is decided by the user profile , the index of the items that has been chosen , and the candidates C, i.e . The policy is represented by similarly as Eq. 8. Each time after choosing the item at th step, a new state and a reward is exposed.As we know that vanilla policy gradient often suffers from large variance and low data efficiency. Recently the sample efficient Qlearning has been applied to combinatorial optimization problem(
(Khalil et al., 2017)). We denote the value function of stateaction pair that follows the generative policy as , which is written as(9) 
We use the value function approximation to approximate the value function of the optimal policy, . The aforementioned IIP, PIAP model here can be used as the function approximator. The QLearning minimizes the loss in Eq. 10.
(10) 
is the parameters of the target model that follows with delay. Also notice that the discount factor of reinforcement learning in our problem. A detailed explanation of Deep Q Learning can be found in (Mnih et al., 2013). Qlearning is an OffPolicy learning metric, thus we can use user intreactions offline to train.
4.3.3. Reinforcement Learning with Virtual Reward
As RL often requires large amount of data, we propose using the Evaluator as an offline estimator and reward function, i.e. replacing the real user feedback R with the estimated rewards of in Eq. 5. This is similiar to the counterfactual estimators((Gilotte et al., 2018)) and the reward model((Leike et al., 2018)). We call it learning from Virtual Reward. There are risks to directly learn from the offline estimator: Neural networks are known to be vulnerable to adversarial attacks((Lin et al., 2017)), while the learning process of Generator resembles black box adversarial attack. Once there is discrepancy between the Evaluator and the realistic environment, the Generator would deviate from the real target.
However, learning from the Evaluators is more data efficient. In our offline experiments, we also evaluate the Generator using the Evaluator as the gold standard. In other words, we treat the Evaluator as a simulator(or Environment). We also carried out online experiments on the virtual reward trained Generator in order to test its validity. We will come to this again in the experiments section.
5. Model Architectures
Notice that until now we have made no proposition on the paradigm of the policy or prediction model. There are whole bunches of neural network models that proposed before to tackle different aspects of RS, including the well known Youtube RS((Covington et al., 2016)), Google deepandwide RS((Cheng et al., 2016)). The model structure is highly coupled with the problem and the features. In this paper, in order to compare the framework itself, we propose several simple and comparable different model architectures. It is worth mentioning that we can surely add more complexity to those structures to account for more features and factors (such as dynamic user interest), but those are not the main focus of this paper.
DNN for IIP. Considering Eq. 2, where
is only related to user, item and its position, we use MultiLayer Perceptrons(MLP) to build the mapping. We represent the user with embeddings
, item with , and position with . In our dataset, the representation of item is the concatenation of the item id embedding and other features (such as categories and titles etc). We use two layer MLP ( for simple) to process the concatenated information(Fig. 2), thus we have(11) 
GRNN for PIAP. To model the context items
we use recurrent neural structures (Such as Gated Recurrent Neural Network, GRNN). After encoding the preceding items, we concatenate the represention of
, and apply the twolayer MLP(Fig. 2). The model can be written as(12) 
Notice that position does not appear explicitly in the GRNN model. We require the model to encode the position implicitly by forward passing of GRU.
BiGRNN for CIAP. For the Evaluators, to fully take into account, we apply an second GRNN in reversed direction, followed by the same twolayer MLP(Fig. 2), which we call Bidirectional((Ma and Hovy, 2016)) GRNN(BiGRNN).
(13) 
Transformer for CIAP. Transformer is proposed by (Vaswani et al., 2017)
for neural machine translation. Recently there has been great amount of successes in using Transformer to encoder sequences. We use Transformer as alternative structures in place of the BiGRNN. We first concatenate the user descriptor with item representation and position embedding, and then we apply 2layer Transformer to predict the probability of click in each position. The details are shown in Eq.
14. In our attention models, we use single head attentions.
(14) 
Deep DPP. We use the similar kernel to model the correlation between each pair of items as proposed by (Wilhelm et al., 2018). A MLP header is used to predict the quality term for user and item and a kernel matrix K is introduced to represent the correlations.
(15) 
where are the Jaccard distances of item and item
Pointer Network(Vinyals et al., 2015). Pointer Network is used to copy and readjust the order of the input sequence. Here we use Pointer Network to generate sequence S from the original unordered set C. The pointer network encodes the candidate set C using recursive structure, and applies a decoder to generate the sequence S. We write the generation process as
(16) 
The vanilla pointer network outputs the attention over candidates . Here when trained with QLearning, we use to approximate the Q value, and then we use to inference. Moreover, the vanilla pointer network is not devised to take the other context information(such as user ) into account. To improve this we inject the user feature at both the end of the encoding process and each decoding step, in order for the pointer network to address the user preference thoroughly.
6. Experiments
In our experiments, we use the user interaction records of 100 million lists for training and 1 million lists for testing, which were collected from Baidu App News Feed System (BANFS), one of the largest RS in China. BANFS has over hundreds of millions of clicks online each day. The length of each sequence may be different. To reduce the cost of the experiment, our offline dataset ^{1}^{1}1The dataset is not yet ready to be published due to the privacy issues. In case the data is fully desentizized, publishing the dataset would be considered. contains only a subset of the features, including the user id, item id, item category and layout (the appearance of the item).
6.1. Evaluation Metrics
Traditional IR evaluation metrics(such as NDCG, MAP, ERR) is not suitable for evaluation of intralist correlations. Firstly, those metrics make the assumption that user feedback are fully observable for all the items, or there are oracle rankings, but it is no longer true for interactive systems such as news feed RS. Secondly, those metrics assume that an item is independent to each other.
To evaluate combinatorial RS, (Yue and Guestrin, 2011) uses an artificial simulator, others(Wilhelm et al., 2018) uses online experiments for evaluation. Also counterfactual evaluation tricks has been reported((Dudík et al., 2011),(Li et al., 2015)), but applying those methods to realistic RS with millions of candidate items and billions of users are often intractable.
In this work, we evaluate our ideas from the following aspects.

Firstly, three metrics are exploited to evaluate the precision of Evaluator through realistic user feedback. Metric Area Under Curve(AUC) of the ROC curve ared used to evaluate the precision of each item in each sequence. RMSES and Correlation are used to evaluate the overall click of a sequence. Root Mean Square Error of Sequences(RMSES) is defined as
(17) Since some methods, such DPP, do not predict in the right scale, we also evaluate the Correlation between total real clicks and the evaluation score , which is defined in Eq. 18.
(18) 
Secondly, we compare different Generators by regarding the Evaluator itself as an simulator(or environment). We also publish some of our research on the intuitive perception of the generated patterns.

Finally, we publish the result to compare different ranking frameworks in online A/B tests. As we are more cautious toward online experiments, we did not carry out the experiments on all possible ranking frameworks. It is worth noticing that our online experiments uses larger feature set and datasets to achieve better performance, thus the performance of the online and offline experiments are not totally equal.
6.2. Results on The Evaluators
To evaluate the precision of the Evaluator models, we use AUC, RMSES and Correlation as the criteria for comparison. We compare different models including DNN(IIP), GRNN(PIAP), BiGRNN(CIAP), Transformer(CIAP) and DPP. By concluding from Tab. 1, we can see that and do have impact on the click of the th position; The BiGRNN and Transformer performs best in all three evaluation criteria; Transformer is slightly better than BiGRNN, as expected; The performance of DPP is below the baseline, which is mainly caused by missing the position bias. To better study the impact of context items, we further replace all the preceding items in or the following items in randomly in BiGRNN and Transformer. The performance is shown as the BiGRNN(Disturb) and Transformer(Disturb) in Tab. 1.
Algorithm  AUC  RMSES  Correlation 
DNN(IIP)  0.7658  0.2781  0.4685 
GRNN(PIAP)  0.7730  0.2769  0.4838 
BiGRNN(CIAP)  0.7793  0.2760  0.4925 
BiGRNN(Disturb )  0.7708     
BiGRNN(Disturb )  0.7714     
Transformer(CIAP)  0.7802  0.2760  0.4969 
Transformer(Disturb )  0.7719     
Transformer(Disturb )  0.7738     
Deep DPP      0.3810 
To visualize the intralist correlation in BANFS, we plot the heatmap of the average attention strength on the single head selfattention layer of a 1layer Transformer(Fig. 3). We mention several phenomenons that are coherent with our instinct or investigation: Firstly, each item is more dependent on its previous items, especially those on the top of the list. The BANFS pushes 10 50 items each time, but only 2 4 items can be displayed in one single screen(Fig. 3). A user needs to slide downward to examine the whole list. Thus, whether the items on the top of the sequence attracts the user have an impact on the probability that the items lie below are examined, and thus clicked. Secondly, the attention between adjacent positions and is not as strong as that between and , which makes the heatmap interweaving(like chessboard). To better study this phenomenon, we further plot the correlation of clicks between any two positions. Fig. 3 shows that the correlation of user clicks is interweaving: the adjacent positions is less likely to be clicked together, but and is more likely to be clicked together. This is in consistency with the Serpentining phenomenon that was mentioned by (Zhao et al., 2017). This phenomenon has further shown that the intralist correlation is much more complicated than many position bias hypothesis or unordered setwise hypothesis previously proposed.
6.3. Results on the Generators
To evaluate the Generators, we randomly sample 1 million candidate sets from the user interaction logs. Those are regarded as pools of candidate C. In each iteration, we randomly sample an user , a candidate set C and length of the final list . The length follows the real distribution of sequence length online, which varies between 10 and 50. We sample the candidate set such that . Then, the Generator is required to generate one or multiple sequences of length , and we use the Evaluator to pick the sequence with highest overall score (if only 1 list is generated, such as greedy picking, then there are no need of Evaluators). We compare the evaluation score of the finally picked list. There are mainly three types of Generators to be compared:

Supervised Learning(SL) + Greedy The Generator is trained with normal loglikelihood loss, using user feedbacks in the log. We generate the list with greedy policy as classical ranking process.

SL + Softmax We use the predicted CTR score as the priority score, and then we apply softmax sampling of Eq. 8, generating list with temperature of . Then, we use the corresponding Evaluator to pick the one with highest score.

Reinforcement Learning(RL) with Virtual Rewards We apply RL to the Generator using reward from the corresponding Evaluators. When testing, the Generator generates only one list greedily.
We choose three Evaluators, GRNN, BiGRNN and Transformer. The result is shown in Tab. 2. Several remarks can be made: GRNN + SL + Greedy can not compare GRNN + SL + Softmax(100 Sample), even if the Evaluator is itself (GRNN Evaluator). This investigation shows that the difference between local and global optimum is considerably large. As we expected, RL + Greedy outperforms SL + Greedy, since it learns more combinatorial information from Evaluatior. Additionally, we can see that Softmax(100 Sample) achieves the best performance under all 3 Evaluators, which indicates that it is hard for a single sequence to beat multiple different sequences, even though that sequence is generated by a strong Generator, such as PointerNet + RL.
Generator  with Different Evaluators  
GRNN  BiGRNN  Transformer  
DNN + SL + Greedy    1.1416  1.5887 
GRNN + SL + Greedy  1.7162  1.5322  1.7019 
GRNN + SL + Softmax()  1.8134  1.7429  1.8368 
GRNN + SL + Softmax()  1.8676  1.8179  1.9032 
PointerNet + RL(Virtual Reward) + Greedy    1.6372  1.8967 
To illustrate that the EvaluatorGenerator framework indeed yields better item combinations, we did some inspection in the generated patterns. The combination of item ids or high dimensional features are far too sparse to give any insightful results, thus we focus on the layouts of the short local segments. The layout marks the visual content arrangement within each item when shown to the users, as shown in Fig. 3. More concretely, we denote the layout of the th item as , where is the size of all distinct layouts. Then we consider the layouts of a segment of three consecutive items in position as the local pattern that we want to look into. In BANFS, there are types of layouts, e.g. ”textonly”, ”oneimage”, ”threeimages”, etc. Thus there are distinct layout patterns in total. For a recommendation list of length , we count all possible segments. The procedure of analyzing local patterns works as follows: Firstly, the average sum of click of the segment pattern can be counted from the user log. Under the assumption that the higher click rate means better quality of the pattern, we use to measure the quality of the local layout pattern . So we regard the layout patterns in 216 possible patterns that rank topN in the expected clicks as ”good patterns”. To evaluate our proposed framework, we calculate the ratio of the topN pattern segments in the generated lists from different Generators. The results are shown in Tab. 3, where we use BiGRNN as the Evaluator. Comparing with Tab. 2, the Softmax and the RL group generates more ”good patterns”, as well as score higher in the Evaluator, compared to the SL+Greedy methods. This results demonstrate that our proposed EvaluatorGenerator framework is consistent with that kind of intuitive indicators, i.e., higher score RS generates more patterns that are preferred to be clicked by the users.
Generator  Top3 Pattern(%)  Top5 Patterns(%)  Top10 Patterns(%) 
DNN + SL + Greedy  1.46  1.85  3.43 
GRNN + SL + Greedy  1.43  2.43  7.32 
DNN + SL + Softmax ()  1.31  1.72  4.50 
GRNN + SL + Softmax ()  1.80  2.88  8.54 
PointerNet + RL(Virtual Reward) + Greedy  2.45  2.92  6.67 
6.4. Performance Online
Correlation between Evaluators and OnlinePerformance The previous results show that the Evaluator is more correlated to the sum of clicks of a list. But, is the predicted sum of clicks related to the final performance? Is it appropriate to treat Evaluator as a simulator? We perform additional investiagtion on the correlation between the Evaluation score of lists and the performance of A/B test. Typically we judge whether a new policy is superior than the baseline, by the increase in some critical metrics, such as total user clicks, in A/B test. For two experiment groups with experiment ID (experimental) and (baseline), the Relative Gain in total clicks is defined as the relative increase of clicks compared with baseline. Thus we retrieve the logs of the past experiments, and we repredict click of each sequences in the record by inferencing with our Evaluator model. We calculate the Predicted Relative Gain by
(19) 
We collect over 400 A/B testing experiments during 2018, including not only new ranking strategies with various policy, model and new features, but also human rules. Some of the new strategies are tested to be positive, others negative. We counted the correlation between the predicted relative gain and the statistical real relative gain. We use DNN and BiGRNN Evaluator for comparison. The correlation between DNN(IIP) and online performance among the 400 experiments is , while the correlation betwen BiGRNN(CIAP) and real performance is as high as . It has proved to some extent that the Evaluator can evaluate a strategy before doing A/B test online and the confidence is relatively high.
DNN  BiGRNN  
Correlation with Online Performance  0.2282  0.9680 
Online A/B Test of Ranking Framework We have conducted a series of online experiments on BANFS, we report the comparison of the following methods.

IIP. Multilayer perceptrons combined with user, item, useritem, and position features only.

PIAP. Adding to the model. Greedily rank the highest score to generate a list.

EvaluatorGenerator. We use BiGRNN as the Evaluator, supervised learning PIAP + softmax sampling as the Generator. The Generator generates 20 list and the Evaluator picks the one with highest score.

PointerNet + RL(Virtual Reward). We use PointerNet as the Generator, and Reinforcement Learning to optimize the parameters of PointerNet. We use the prediction of the Evaluator as the reward offline in the RL training process. The RL Generator generates only 1 sequence for recommendation.
PointerNet + RL(Real Click). We use PointerNet as the Generator, and Reinforcement Learning to optimize the parameters of PointerNet. We use the real user click as the reward and offpolicy learning metrics. It also generates only 1 sequence for recommendation.
Recommender Framework  Relative Gain ()  Coverage of Categories() 
PIAP vs. IIP  
EvaluatorGenerator vs. PIAP  
PointerNet + RL(Virtual Reward) vs. EvaluatorGenerator    
PointerNet + RL(Real Click) vs. EvaluatorGenerator   
Our evaluation result until now shows that EvaluatorGenerator with softmax sampling Generator has stateoftheart online performance. The RL group has shown comparable performance with EvaluatorGenerator. Compared with EvaluatorGenerator that generates and evaluates 20 lists, RL generate only once. Thus, RL has lower inference cost compared with the EvaluatorGenerator. Though we believe that RL should be more costefficient and more straightforward for solving the intralist correlation, our experiments shows that the performance of RL is to be further improved. We have also compared the Coverage((Ziegler et al., 2005)
) on distinct categories(We have a classification system that classify all news into 40+ categories). It is verified that PIAP and EvaluatorGenerator indeed improves diversity of exposure even though diversity is never considered as a explicit target in our framework.
7. Disscussion
In this paper, we propose a recommender framework by optimizing Kitem in the result page as a whole. We propose the EvaluatorGenerator framework to solve the combinatorial optimization problem. We show that compared with the other diversified ranking algorithms, the proposed framework is capable of capturing various possible correlations as well as the position bias. In this section, we post some of our further comments in this framework and its possible future extensions.
Exploration and Exploitation. Exploration is very important for interactive systems. Greedy ranking in RS typically ends up in mediocre or outdated recommendations and eventually will jeopardize the performance. Exploration for sequenceoptimization RS include not only the itemlevel, but also the combination level. To explain the second level, if we consider that the user always see a perfect combination every time, the evaluator would not be able to learn that lacking diversity would do harm to the performance. We have also observed the phenomenon online, BANFS keeps a small fraction of flow for exploration.
Evaluator Generator vs RL only Some might argue that a welldesigned RL Generator only will be enough for intralist correlation, where the Evaluator seems redundant. In our case, the Evaluator works as both the offline estimator and online selector. The EvaluatorGenerator framework does not only have much more flexibility in realistic online system, it is also close to the ”reward modeling” that was proposed in (Leike et al., 2018). The Evaluator is able to approximate real user preference better with less effort compared with the Generator. We think that the difference between GeneratorOnly and EvaluatorGenerator deserves further investigation.
Synthesising Intralist Correlation and Interlist Evolution. Though there are some works that shed light into the ”ultimate” RS that incoporate Intralist and Interlist correlations((Zhao et al., 2018)), incoporating both effects in a real RS remains very tricky. Few has been testified in real online systems. We believe that it is a challenging but promising topic in the future.
References
 Adomavicius and Tuzhilin [2015] Gediminas Adomavicius and Alexander Tuzhilin. Contextaware recommender systems. In Recommender systems handbook, pages 191–226. Springer, 2015.
 Azar and Gamzu [2011] Yossi Azar and Iftah Gamzu. Ranking with submodular valuations. In Proceedings of the twentysecond annual ACMSIAM symposium on Discrete Algorithms, pages 1070–1079. SIAM, 2011.
 Bello et al. [2016] Irwan Bello, Hieu Pham, Quoc V Le, Mohammad Norouzi, and Samy Bengio. Neural combinatorial optimization with reinforcement learning. arXiv preprint arXiv:1611.09940, 2016.
 Borodin et al. [2012] Allan Borodin, Hyun Chul Lee, and Yuli Ye. Maxsum diversification, monotone submodular functions and dynamic updates. In Proceedings of the 31st ACM SIGMODSIGACTSIGAI symposium on Principles of Database Systems, pages 155–166. ACM, 2012.
 Borodin [2009] Alexei Borodin. Determinantal point processes. arXiv preprint arXiv:0911.1153, 2009.
 Campos et al. [2014] Pedro G Campos, Fernando Díez, and Iván Cantador. Timeaware recommender systems: a comprehensive survey and analysis of existing evaluation protocols. User Modeling and UserAdapted Interaction, 24(12):67–119, 2014.
 Chapelle and Zhang [2009] Olivier Chapelle and Ya Zhang. A dynamic bayesian network click model for web search ranking. In Proceedings of the 18th international conference on World wide web, pages 1–10. ACM, 2009.

Cheng et al. [2016]
HengTze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra,
Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al.
Wide & deep learning for recommender systems.
In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems, pages 7–10. ACM, 2016.  Chuklin et al. [2015] Aleksandr Chuklin, Ilya Markov, and Maarten de Rijke. Click models for web search. Synthesis Lectures on Information Concepts, Retrieval, and Services, 7(3):1–115, 2015.
 Covington et al. [2016] Paul Covington, Jay Adams, and Emre Sargin. Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems, pages 191–198. ACM, 2016.
 Craswell et al. [2008] Nick Craswell, Onno Zoeter, Michael Taylor, and Bill Ramsey. An experimental comparison of click positionbias models. In Proceedings of the 2008 international conference on web search and data mining, pages 87–94. ACM, 2008.
 Dai et al. [2017] Hanjun Dai, Elias B Khalil, Yuyu Zhang, Bistra Dilkina, and Le Song. Learning combinatorial optimization algorithms over graphs. arXiv preprint arXiv:1704.01665, 2017.
 Diaz et al. [2013] Fernando Diaz, Ryen White, Georg Buscher, and Dan Liebling. Robust models of mouse movement on dynamic web search results pages. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management, pages 1451–1460. ACM, 2013.
 Dudík et al. [2011] Miroslav Dudík, John Langford, and Lihong Li. Doubly robust policy evaluation and learning. arXiv preprint arXiv:1103.4601, 2011.
 Feng et al. [2018] Yue Feng, Jun Xu, Yanyan Lan, Jiafeng Guo, Wei Zeng, and Xueqi Cheng. From greedy selection to exploratory decisionmaking: Diverse ranking with policyvalue networks. 2018.
 Gilotte et al. [2018] Alexandre Gilotte, Clément Calauzènes, Thomas Nedelec, Alexandre Abraham, and Simon Dollé. Offline a/b testing for recommender systems. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pages 198–206. ACM, 2018.
 He et al. [2016] Xiangnan He, Hanwang Zhang, MinYen Kan, and TatSeng Chua. Fast matrix factorization for online recommendation with implicit feedback. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pages 549–558. ACM, 2016.
 Khalil et al. [2017] Elias Khalil, Hanjun Dai, Yuyu Zhang, Bistra Dilkina, and Le Song. Learning combinatorial optimization algorithms over graphs. In Advances in Neural Information Processing Systems, pages 6351–6361, 2017.
 Koren et al. [2009] Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recommender systems. Computer, 42(8), 2009.

Kulesza et al. [2012]
Alex Kulesza, Ben Taskar, et al.
Determinantal point processes for machine learning.
Foundations and Trends® in Machine Learning, 5(2–3):123–286, 2012.  Leike et al. [2018] Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, and Shane Legg. Scalable agent alignment via reward modeling: a research direction. arXiv preprint arXiv:1811.07871, 2018.
 Li et al. [2015] Lihong Li, Shunbao Chen, Jim Kleban, and Ankur Gupta. Counterfactual estimation and optimization of click metrics in search engines: A case study. In Proceedings of the 24th International Conference on World Wide Web, pages 929–934. ACM, 2015.
 Lin et al. [2017] YenChen Lin, ZhangWei Hong, YuanHong Liao, MengLi Shih, MingYu Liu, and Min Sun. Tactics of adversarial attack on deep reinforcement learning agents. arXiv preprint arXiv:1703.06748, 2017.
 Ma and Hovy [2016] Xuezhe Ma and Eduard Hovy. Endtoend sequence labeling via bidirectional lstmcnnscrf. arXiv preprint arXiv:1603.01354, 2016.
 Mnih et al. [2013] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
 Sarwar et al. [2001] Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. Itembased collaborative filtering recommendation algorithms. In Proceedings of the 10th international conference on World Wide Web, pages 285–295. ACM, 2001.
 Shani and Gunawardana [2011] Guy Shani and Asela Gunawardana. Evaluating recommendation systems. In Recommender systems handbook, pages 257–297. Springer, 2011.
 Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017.
 Vinyals et al. [2015] Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. Pointer networks. In Advances in Neural Information Processing Systems, pages 2692–2700, 2015.
 Wang et al. [2015] Chao Wang, Yiqun Liu, Meng Wang, Ke Zhou, Jianyun Nie, and Shaoping Ma. Incorporating nonsequential behavior into click models. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 283–292. ACM, 2015.
 Wilhelm et al. [2018] Mark Wilhelm, Ajith Ramanathan, Alexander Bonomo, Sagar Jain, Ed H Chi, and Jennifer Gillenwater. Practical diversified recommendations on youtube with determinantal point processes. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pages 2165–2173. ACM, 2018.
 Yan et al. [2017] Yan Yan, Gaowen Liu, Sen Wang, Jian Zhang, and Kai Zheng. Graphbased clustering and ranking for diversified image search. Multimedia Systems, 23(1):41–52, 2017.
 Yue and Guestrin [2011] Yisong Yue and Carlos Guestrin. Linear submodular bandits and their application to diversified retrieval. In Advances in Neural Information Processing Systems, pages 2483–2491, 2011.
 Zhao et al. [2017] Qian Zhao, Gediminas Adomavicius, F Maxwell Harper, Martijn Willemsen, and Joseph A Konstan. Toward better interactions in recommender systems: cycling and serpentining approaches for topn item lists. In [CSCW2017] Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing, 2017.
 Zhao et al. [2018] Xiangyu Zhao, Long Xia, Liang Zhang, Zhuoye Ding, Dawei Yin, and Jiliang Tang. Deep reinforcement learning for pagewise recommendations. arXiv preprint arXiv:1805.02343, 2018.
 Zheng et al. [2018] Guanjie Zheng, Fuzheng Zhang, Zihan Zheng, Yang Xiang, Nicholas Jing Yuan, Xing Xie, and Zhenhui Li. Drn: A deep reinforcement learning framework for news recommendation. In Proceedings of the 2018 World Wide Web Conference on World Wide Web, pages 167–176. International World Wide Web Conferences Steering Committee, 2018.
 Zhu et al. [2007] Xiaojin Zhu, Andrew Goldberg, Jurgen Van Gael, and David Andrzejewski. Improving diversity in ranking using absorbing random walks. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference, pages 97–104, 2007.
 Ziegler et al. [2005] CaiNicolas Ziegler, Sean M McNee, Joseph A Konstan, and Georg Lausen. Improving recommendation lists through topic diversification. In Proceedings of the 14th international conference on World Wide Web, pages 22–32. ACM, 2005.
Comments
There are no comments yet.