1. Introduction
The explosive growth and variety of information (e.g. movies, commodities, news etc.) available on the web frequently overwhelms users, while Recommender Systems (RS) are valuable means to cope with the information overload problem. RS usually provide the target user with a list of items, which are selected from the overwhelmed candidates in order to best satisfy his/her current demand. In the most traditional scenarios of RS especially on mobiles, recommended items are shown in a waterfall flow form, i.e. users should scroll the screen and items will be presented onebyone. Due to the pressure of QPS (QueryPerSecond) for users interacting with RS servers, it is common to return a large amount of ranked items (e.g. 50 items in Taobao RS) based on CTR (ClickThroughRate) estimation for example
^{1}^{1}1Here we take CTR as an example, other preference score can also be used, e.g. Movie Rating, CVR (ConversionRate) or GMV (GrossMerchandiseVolume) etc. and present them from top to bottom. That is to say we believe the top ranked items take the most chance to be clicked or preferred so that when users scroll the screen and see items topdown, the overall clicking efficiency can be optimized. It can be seen as topK recommendation (Cremonesi et al., 2010), because the ranking of item list is important.However in many realworld recommendation applications, exact items are shown once all to the users. In other words, users should not scroll the screen and the combination of items is shown as a whole card. Taking two popular RS in the homepages of Taobao and YouTube for example (illustrated in Fig. 1), they recommend cards with exact 4 commodities and 6 videos respectively. Note that items in the same card may interact with each other, e.g. in Taobao, cooccurrence of “hat” and “scarf” performs better than “shoe” and “scarf”, but “shoe” and “scarf” can be optimal individually. We call it exactK recommendation, whose key challenge is to maximize the chance of the whole card being clicked or satisfied by the target user. Meanwhile, items in a card usually maintain some constraints
between each other to guarantee the user experience in RS, e.g. the recommended commodities in Ecommerce should have some diversity rather than being all similar for complement consideration. In a word, topK recommendation can be seen as a ranking optimization problem which assumes that “better” items should be put into top positions, while exactK recommendation is a (constrained) combinatorial optimization problem which tries to maximize the joint probability of the set of items in a card.
TopK recommendation has been well studied for decades in information retrieval (IR) research community. Among them, listwise models are the most related to our problem as they also perform optimization considering the whole item list. However, they either target on ranking refinement or do not consider constraints in the ranking list, which will fall into suboptimal towards exactK recommendation (refer to Sec. 2.1.2 for more discussions). Our work mainly focuses on solving exactK recommendation problem endtoend, and its main contributions can be summarized as follows.

We take the first step to formally define the exactK recommendation problem and innovatively reduce it to a Maximal Clique Optimization problem based on graph.

To solve it, we propose Graph Attention Networks (GAttN) with an EncoderDecoder framework which can endtoend learn the joint distribution of items and generate an optimal card containing items. Encoder utilizes Multihead Selfattention to encode the constructed undirected graph into node embeddings considering nodes correlations. Based on the node embeddings, decoder generates a clique consisting of items with RNN and attention mechanism which can well capture the combinational characteristic of the items. Beam search with masking is applied to meet the constraints. Then we adopt welldesigned Reinforcement Learning from Demonstrations (RLfD) which combines the advantages in behavior cloning and reinforcement learning, making it sufficientandefficient to train GAttN.

We conduct extensive experiments on three datasets (two constructed from public MovieLens datasets and one collected from Taobao). Both quantitative and qualitative analysis justify the effectiveness and rationality of our proposed GAttN with RLfD for exactK recommendation. Specifically, our method outperforms several strong baselines with significant improvements of 7.7% and 4.7% on average in Precision and Hit Ratio respectively.
2. Related Works
2.1. TopK Recommendation
TopK recommendation refers to recommending a list of ranked items to a user, which is related to the descriptions of recommendation problem and learning to rank methods.
2.1.1. The Recommendation Problem
The key problem of recommendation system lies in how to generate users’ most preferred item list. Some previous works (Koren et al., 2009) model the recommendation problem as a regression task (i.e. predict users’ ratings on items) or classification task (i.e. predict whether the user will click/purchase/…the item). Items are then ranked based on the regression scores or classification probabilities to form the recommendation list. Other works (Rendle et al., 2009; Wang et al., 2017; He et al., 2017) directly model the recommendation problem as a ranking task, where many pairwise/listwise ranking methods are exploited to generate users’ topk preferred items. Learning to rank is surveyed in detail in the next subsection.
2.1.2. Learning to Rank
Learning to Rank (LTR) refers to a group of techniques that attempts to solve ranking problems by using machine learning algorithms. It can be broadly classified into three categories: pointwise, pairwise, and listwise models. Pointwise Models
(Koren et al., 2009; He et al., 2017) treat the ranking task as a classification or regression task. However, pointwise models do not consider the interdependency among instances in the final ranked list. Pairwise Models (Rendle et al., 2009)assume that the relative order between two instances is known and transform it to a pairwise classification task. Note that their loss functions only consider the relative order between two instances, while the position of instances in the final ranked list can hardly be derived. Listwise Models provide the opportunity to directly optimize ranking criteria and achieve wholepage ranking optimization. Recently
(Ai et al., 2018) proposed Deep Listwise Context Model (DLCM) to finetune the initial ranked list generated by a base model, which achieves SOTA performance. Other wholepage ranking optimization methods can be found in (Jiang et al., 2018), which mainly focus on ranking refinement. Listwise models are the most related to our problem. However, they either target on ranking refinement or don’t consider the constraints in ranking list, which are not welldesigned for exactK recommendation.2.2. Neural Combinatorial Optimization
Even though machine learning (ML) and combinatorial optimization have been studied for decades respectively, there are few investigations on the application of ML methods in solving the combinational optimization problem. Current related works mainly focus on two types of ML methods: supervised learning and reinforcement learning. Supervised learning
(Vinyals et al., 2015b) is the first successful attempt to solve the combinatorial optimization problem. It proposes a special attention mechanism named PointerNet to tackle a classical combinational optimization problem: Traveling Salesman Problem (TSP). Reinforcement learning (RL) aims to transform the combinatorial optimization problem into a sequential decision problem and becomes increasingly popular recently. Based on Pointer Network, (Bello et al., 2016) develops a neural combinatorial optimization framework with RL, which performs excellently in some classical problems, such as TSP and Knapsack Problem. RL is also applied to RS (Zheng et al., 2018), but it is still designed for traditional topK recommendation. In this work, we focus on exactK recommendation, which is transferred into the maximal clique optimization problem. Some researches (Ion et al., 2011) also try to solve it, but they often focus on estimation of nodeweight. The main difference between them and ours is that we target to directly select an optimal clique rather than search for the clique comprised of maximum weighted nodes, which brings a grave challenge.3. Problem Definition
In this section, we first give a formal definition of exactK recommendation, and then discuss how to transfer it to the Maximal Clique Optimization problem. Finally we provide a baseline approach to tackle the above problem.
3.1. ExactK Recommendation
Given a set of candidate items , our goal is to recommend exact items which is shown as a whole card^{2}^{2}2Here we suppose that permutation of the items in a card is not considered in exactK recommendation., so that the combination of items takes the most chance to be clicked or satisfied by a user . We denote the probability of being clicked/satisfied as . Somehow items in should obey some constraints between each other as or not as , here is a boolean indicator which will be if the two items satisfy the constraint. Overall the problem of exactK recommendation can be regarded as a (constrained) combinatorial optimization problem, and is defined formally as follows:
(1)  
(2) 
where is the parameters for function of generating from given user , and donates relevance/preference indicator.
In another perspective, we construct a graph containing nodes, in which each node in represents an item in candidate item set , each edge in connecting nodes represents that items and should satisfy the constraints or there is no constraint (a.k.a is now a complete graph), i.e or , so it is an undirected graph here. Intuitively, we can transfer exactK recommendation into the maximal clique optimization problem^{3}^{3}3It can be generalized according to the optimization objective. (Garey et al., 1974; Gong et al., 2016). That is to say we aim to select a clique^{4}^{4}4A clique is a subset of nodes of an undirected graph such that every two distinct nodes in the clique are adjacent; that is, its induced subgraph is complete. (i.e characteristics of clique can ensure the constraints defined in Eq. 2) with nodes from where combination of the selected corresponding items achieves the maximal objective defined in Eq. 1. You can take Fig. 2 as an example. Furthermore, maximal clique problem is proved to be NPhard thus it can not be solved in polynomial time (Garey et al., 1974).
3.2. Naive NodeWeight Estimation Method
A baseline method is that we can estimate a weight as of each node related to the optimization objective in graph . In exactK recommendation, our goal is to maximize the clicked or satisfied probability of the recommended items set as in Eq. 1, so we regard the weight of each node as CTR of corresponding item. After getting the weight of each node in graph supported as , we can reduce the Maximal Clique Optimization problem as finding a clique in graph
with maximal node weights summation. We can then apply some heuristic methods like Greedy search to solve it. Specifically, we modify Eq.
1 as follows:(3) 
where can be regarded as node weight in graph. Here we focus on how to estimate the node weights , it can be formulated as a normal item CTR estimation task in IR. A large amount of LTR based methods for CTR estimation can be adopted as our strong baselines. Refer to Sec. 2.1.2 for more details.
We call the adapted baseline as Naive NodeWeight Estimation Method, with its detailed implementation shown in Algorithm 1 ^{5}^{5}5In our problem, we ignore the circumstance of getting infeasible solution, and we argue that in realworld application with small and large we can always find a clique with nodes in graph with nodes greedily.. However weaknesses of such method are obvious for the following three points: (1) CTR estimation for each item is independent, (2) combinational characteristic of the items in a card is not considered, (3) problem objective is not optimized directly but substituted with a reduced heuristic objective which will unfortunately fall into suboptimal. On the contrary, we will utilize a framework of Neural Combinatorial Optimization (some related works in Sec. 2.2) to directly optimize the problem objective in Sec. 4.
4. Approach
In Sec. 3.1, we formally define the exactK recommendation problem based on searching for maximal scoring clique with nodes in a specially constructed graph with nodes. The score of a clique is the overall probability of a user clicking or stratifying the corresponding card of items as in Eq. 1. To tackle this specific problem, we first propose Graph Attention Networks (GAttN) which follows the EncoderDecoder Pointer framework (Vinyals et al., 2015b) with Attention mechanism (Vaswani et al., 2017; Bahdanau et al., 2014). Then we adopt Reinforcement Learning from Demonstrations (RLfD) which combines the advantages in behavior cloning (Torabi et al., 2018) and reinforcement learning, making it sufficientandefficient to train the networks.
4.1. Graph Attention Networks
The traditional encoderdecoder framework (Sutskever et al., 2014)
usually encodes an input sequence into a fixed vector by a recurrent neural networks (RNN) and decodes a new output sequence from that vector using another RNN. For our exactK recommendation problem, the encoder produces representations of all input nodes in graph
, and the decoder selects a clique among input nodes by pointer, in which the constraint of clique is considered by masking.4.1.1. Input
We first define the input representation of each node in graph . Specifically in our problem, given candidate items set and user , we can represent the input of a node by combination of the features of corresponding item and user
. Here we use a simple fully connected neural network with nonlinear activation ReLU as:
(4) 
where and are feature vectors for item and user (e.g trainable embeddings of corresponding item and user IDs), represents the concatenation of vector and , and are training parameters for input representation.
4.1.2. Encoder
First of all, since the order of nodes in a undirected graph is meaningless, the encoder network architecture should be permutation invariant, i.e. any permutation of the inputs results in the same output representations. While the traditional encoder usually uses a RNN to convey sequential information, e.g., in text translation the relative position of words must be captured, but it is not appropriate to our case. Secondly, the representation for a node should consider the other nodes in graph, as there can exist some underlying structures in graph that nodes may influence between each other. So it’s helpful to represent a node with some attentions to other nodes. As a result, we use a model like Selfattention, it is a special case of attention mechanism that only requires a single sequence to compute its representation. Selfattention has been successfully applied to many NLP tasks up to now (Yang et al., 2018), here we utilize it to encode the graph and produce nodes representations.
Actually in this paper, the encoder that we use is similar to the encoder used in Transformer architecture by (Vaswani et al., 2017) with multihead selfattention. Fig. 3(b) depicts the computation graph of encoder. From the dimensional input feature vector for node , the encoder firstly computes initial dimensional graph node embedding (a.k.a representation) through a learned linear projection with parameters and as:
(5) 
The node embeddings are then updated through selfattention layers, each consisting of two sublayers: a multihead selfattention (MHSA) layer followed by a feedforward (FF) layer. We denote with the graph node embeddings produced by layer , and the final output graph node embeddings as .
The basic component of MHSA is the scaled dotproduct attention, which is a variant of dotproduct (multiplicative) attention (Luong et al., 2015). Given a matrix of input dimensional embedding vectors , the scaled dotproduct attention computes the selfattention scores based on the following equation:
(6) 
where is rowwise. More specifically, MHSA sublayers will employ
attention heads to capture multiple attention information and the results from each head are concatenated followed by a parameterized linear transformation to produce the sublayer outputs. Fig.
3(a) shows the computation graph of MHSA. Specifically in layer , it will operate on the output embedding matrices from previous layer and produce the MHSA sublayer outputs as:(7)  
where 
where is the number of heads, is parameter for each head, is parameter for linear transformation output, and is the output dimension in each head. In addition to MHSA sublayers, FF sublayers consist of two linear transformations with a ReLU activation in between.
(8) 
where are parameter matrices, and represent embedding outputs of node in MHSA and FF sublayers correspondingly. We emphasize that those trainable parameters mentioned above are unique per layer. Furthermore, each sublayer adds a skipconnection (He et al., 2016) and layer normalization (Ba et al., 2016).
4.1.3. Decoder
For exactK recommendation problem, the output represents a clique (card) with nodes (items) in graph that can be interrelated with each other. Recently, RNN (Vinyals et al., 2015b; Gong et al., 2018b; Gong et al., 2018a) has been widely used to map the encoder embeddings to a correlated output sequence, so does in our proposed framework. We call this RNN architecture decoder to decode the output nodes . Remember our goal is to optimize defined in Sec. 3.1 (here we omit relevance score of
), it is a joint probability and can be decomposed by the chain rule as follows:
(9)  
where we represent encoder as with trainable parameters , and decoder trainable parameters as . The last term in above Eq. 9 is estimated with RNN by introducing a state vector, , which is a function of the previous state , and the previous output node , i.e.
(10) 
where is computed as follows:
(11) 
where is usually a nonlinear function (e.g. cell in LSTM (Zhu et al., 2017) or GRU (Zhu et al., 2018)) that combines the previous state and previous output (embedding of the corresponding node from encoder) in order to produce the current state.
Decoding happens sequentially, and at timestep , the decoder outputs the node based on the output embeddings from encoder and already generated outputs which are embedded by RNN hidden state . See Fig. 3(c) for an illustration of the decoding process. During decoding, is implemented by an specific attention mechanism named pointer (Vinyals et al., 2015b), in which it will attend to each node in the encoded graph and calculate the attention scores before applying softmax
function to get the probability distribution. It allows the decoder to look at the whole input graph
at any time and select a member of the input nodes as the final outputs .For notation purposes, let us define decoder output hidden states as , the encoder output graph node embeddings as . At timestep , decoder first glimpses (Vinyals et al., 2015a) the whole encoder outputs, and then computes the representation of decoding up to now together with attention to the encoded graph, denoted as and the equation is as follows:
(12) 
where , and are trainable parameters. After getting the representation of decoder at timestep , we apply a specific attentive pointer with masking scheme to generate feasible clique from graph. In our case, we use the following masking procedures: (1) nodes already decoded are not allowed to be pointed, (2) nodes will be masked if disobey the clique constraint rule among the decoded subgraph. And we compute the attention values as follows:
(13) 
where , , and are trainable parameters. Then softmax function is applied to get the pointing distribution towards input nodes, as follows:
(14) 
We mention that the attention mechanism adopted in Eq. 4.1.3 and 13 is following Bahdanau et al (Bahdanau et al., 2014). At the period of decoder inference, we apply technique of beam search (Vinyals et al., 2015c). It is proposed to expand the search space and try more combinations of nodes in a clique (a.k.a items for a card) to get a most optimal solution.
To summarize, decoder receives embedding representations of nodes in graph from encoder, and selects clique of nodes with attention mechanism. With the help of RNN and beam search, decoder in our proposed GAttN framework is able to capture the combinational characteristics of the items in a card.
4.2. Reinforcement Learning from Demonstrations
4.2.1. Overall
In our proposed GAttN framework, we represent encoder as which can be seen as state in RL, and we represent decoder as which can be seen as policy in RL. A Reinforcement Learning from Demonstration (RLfd) agent, possessing both an objective reward signal and demonstrations, is able to combine the best from both fields. This framework is first proposed in domain of Robotics (Nair et al., 2018). Learning from demonstrations is much sample efficient and can speed up learning process, leveraging demonstrations to direct what would otherwise be uniform random exploration and thus speed up learning. While the demonstration trajectories may be noisy or suboptimal, so policy supervised from such demonstrations will be worse too. And learning from demonstrations is not directly targeting the objective which makes the policy fall into localminimal. On the other hand, training policy by reinforcement learning can directly optimize the objective reward signal, witch allows such an agent to eventually outperform a potentially suboptimal demonstrator.
4.2.2. Learning from Demonstrations
Learning from demonstrations can be seen as behavior cloning imitation learning
(Torabi et al., 2018), it applies supervised learning for policy (mapping states to actions) using the demonstration trajectories as groundtruth. We collect the ground truth clicked/satisfied cards given user and candidate items set as demonstration trajectories and formulated as , we can define the following loss function based on cross entropy of the generated cards and demonstrated cards .(15)  
where in the last term is state vector estimated by a RNN defined in Eq. 11 with inputs of and . This means that the decoder model focuses on learning to output the next item of the card given the current state of the model AND previous groundtruth items.
SUPERVISE with Policysampling.
During inference the model can generate a full card given state by generating one item at a time until we get items. For this process, at timestep , the model needs the output item from the last timestep as input in order to produce . Since we don’t have access to the true previous item, we can instead either select the most likely one given our model or sample according to it. In all these cases, if a wrong decision is taken at timestep , the model can be in a part of the state space that is very different from those visited from the training distribution and for which it doesn’t know what to do. Worse, it can easily lead to cumulative bad decisions. We call this problem as discrepancy between training and inference (Bengio et al., 2015).
In our work, we propose SUPERVISE with Policysampling to bridge the gap between training and inference of policy. We change the training process from fully guided using the true previous item, towards using the generated item from trained policy instead. The loss function for Learning from Demonstrations is now as follows:
(16)  
where is computed by Eq. 11 with inputs of and now, here is sampled from the trained policy .
4.2.3. Learning from Rewards
Reward Estimator.
The objective of exactK recommendation is to maximize the chance of being clicked or satisfied for the selected card given candidate items set and user , as we defined in Sec. 3.1 and Eq. 1. Leveraging the advantage of reinforcement learning, we can directly optimize the objective by regarding it as reward function in RL. While there is no explicit reward in our problem, we can estimate the reward function based on teacher’s demonstration by the idea from Inverse Reinforcement Learning (Abbeel and Ng, 2004). The reward function can then be more generalized against supervised by demonstration only. In our problem, there are large amount of explicit feedback data in which users click cards (labeled as ) or not (labeled as ), we represent it as . Then we transfer estimation of reward function to the problem of CTR estimation for a card given user as , and the loss function for training it is as follows:
To model the reward function, we follow the work of PNN (Qu et al., 2016) , in which we consider the feature crosses for card of items and user. And we define as following equation:
(18)  
where represents the concatenation of vectors, is innerproduct and
means sigmoid function,
and are feature vector for item and user defined in Sec. 4.1.1, are trainable parameters for reward function totally donated by .REINFORCE with Hillclimbing.
After we get the optimized reward function represented as , we use policy gradient based reinforcement learning (REINFORCE) (Wang et al., 2017) to train the policy. And its loss function given previously defined dataset is derived as follows:
where is previously defined encoder state, is the delayed reward (Yu et al., 2017) obtained after the whole card is generated and is estimated by the following equation:
(20) 
here we rescale the value of reward between to .
One problem for training REINFORCE policy is that the reward is delayed and sparse, in which policy may be hard to receive positive reward signal, thus the training procedure of policy becomes unstable and falls into local minimal finally. In order to effectively avoid nonoptimal local minimal and steadily increase the reward throughout training, we borrow the idea of Hill Climbing (HC) which is heuristic search used for mathematical optimization problems
(Duan et al., 2019). Instead of directly sampling from the policy by , in our method we first stochastically sample a buffer of solutions (cards) from policy and select the best one as , then train the policy by according to Eq. 4.2.3. In that case, we will always learn from a better solution to maximize reward, train on it and use the trained new policy to generate a better one.4.2.4. Combination
To benefit from both fields of Learning from Demonstrations and Learning from Rewards, we simply apply linear combination of their loss functions and conduct the final loss as:
(21) 
where and are formulated by Eq. 16 and 4.2.3, is the hyperparameter which should be tuned. The overall learning process is shown in Algorithm 2.
5. Experiments
In this section, we conduct experiments with the aim of answering the following research questions:

Does our proposed GAttN with RLfD method outperform the baseline methods in exactK recommendation problem?

How does our proposed Graph Attention Networks (GAttN) framework work for modeling the problem?

How does our proposed optimization framework Reinforcement Learning from Demonstrations (RLfD) work for training the model?
5.1. Experimental Settings
5.1.1. Datasets
We experiment with three datasets and Tab. 3 summarizes the statistics. The first two datasets are constructed from MovieLens and last dataset is collected from realworld exactK recommendation system on Taobao platform. The implementation details and parameter settings can be found in Appx. C.2.
MovieLens.
This movie rating dataset^{6}^{6}6http://grouplens.org/datasets/movielens/100k/ has been widely used to evaluate collaborative filtering algorithms in RS. As there is no public datasets to tackle exactK recommendation problem, we construct for it based on MovieLens. While MovieLens is an explicit feedback data, we first transform it into implicit data, where we regard the 5star ratings as positive feedback and treat all other entries as unknown feedback (Wang et al., 2017). Then we construct recommended cards of each user with set of items^{7}^{7}7The items in a card are randomly permuted. As we suppose in Sec. 3.1, the permutation of the item is not considered., where cards containing positive item are regarded as positive cards (labeled as 1) and cards without any positive item are negative cards (labeled as 0). Meanwhile, positive item in the corresponding card can be seen as what user actually clicked or preferred item belonging to that card. Finally we construct a candidate set with items for each card for a user, where the items are randomly sampled from the whole items set given this user and must include all the items in that card. We show examples how the dataset like in Tab. 4. Specially in our experiments, we construct two dataset: 1) card with items along with candidate items and 2) card with items along with candidate items. We call the above two dataset as MovieLens(K=4,N=20) and MovieLens(K=10,N=50). Notice that there is no constraint between items in the output card (i.e defined in Sec. 3.1) for these two datasets.
Taobao.
Above two datasets for exactK recommendation problem based on MovieLens are what we call synthetic data which are not realworld datasets in production. On the contrary, we collect a dataset from exactK recommendation system in Taobao platform, of which two days’ samples are used for training and samples of the following day for testing, and specifically with and . We call it Taobao(K=4,N=50). Notice there is a required constraint between items in the output card in this dataset, that normalized edit distance (NED) (Marzal and Vidal, 1993) of any two items’ titles must be larger than , i.e defined in Sec. 3.1, to guarantee the diversity of items in a card.
5.1.2. Evaluation Protocol
For evaluation, we can’t use traditional ranking evaluation metrics such as nDCG, MAP, etc. These metrics either require prediction scores for individual items or assume that “better” items should appear in top ranking positions, which are not suitable for exactK recommendation problem.
Hit Ratio
Hit Ratio (HR) is a recallbased metric, measuring how much the testing groundtruth items of card are in the predicted card with exact items. Specially for exactK recommendation, we refer to HR@K and is formulated as follows:
(22) 
where is the number of testing samples, represents the number of items in a set.
Precision
Precision (P) measures whether the actually clicked (positive) item in groundtruth card is also included in the predicted card with exact items, and is formulated as follows:
(23) 
where is the indicator function.
5.1.3. Baselines
Our baseline methods are based on Naive NodeWeight Estimation Method (in Sec. 3.2) to adapt to exactK recommendation. The center part is to estimate node weight which can be seen as CTR for the corresponding item. Therefor LTR based methods are applied and we compare with the follows:
Pointwise Model.
DeepRank model is a popular ranking method in production which applies DNNs and a pointwise ranking loss (a.k.a MLE) (He et al., 2017).
Pairwise Model.
Listwise Model.
GRU based listwise model (ListwiseGRU) a.k.a DLCM (Ai et al., 2018) is a SOTA model for wholepage ranking refinement. It applies GRU to encode the candidate items with a listwise ranking loss. In addition, we also compare with listwise model based on Multihead Selfattention in Sec. 4.1.2 as ListwiseMHSA.
5.2. Performance Comparison (RQ1)
Tab. 1 shows the performances of P@K and HR@K for the three datasets with respect to different methods. First, we can see that our method with the best setting (GAttN with RLfD) achieves the best performances on both datasets, significantly outperforming the SOTA methods ListwiseMHSA and ListwiseGRU by a large margin (on average over three datasets, the relative improvements for P@K and HR@K are 7.7% and 4.7%, respectively). Secondly from the results, we can find that listwise methods (both ListwiseMHSA and ListwiseGRU) outperform pointwise and pariwise baselines significantly. Therefore listwise methods are more suitable for exactK recommendation, because they consider the context information to represent an item (node in graph) as what we have claimed in Sec. 4.1.2. And ListwiseMHSA performs better than ListwiseGRU, which indicates the effectiveness of our proposed MHSA method for encoding the candidate items (graph nodes). More detailed analysis for our method GAttN with RLfD can be found in the following two subsections (RQ2 and RQ3).
Model  MovieLens (K=4,N=20)  MovieLens (K=10,N=50)  Taobao (K=4,N=50)  
P@4  HR@4  P@10  HR@10  P@4  HR@4  
DeepRank  0.2120  0.1670  0.0854  0.1320  0.6857  0.6045 
BPR  0.3040  0.2050  0.2350  0.1801  0.7357  0.6582 
ListwiseGRU  0.4142  0.2423  0.4041  0.2144  0.7645  0.6942 
ListwiseMHSA  0.4272  0.2465  0.4384  0.2168  0.7789  0.7176 
Ours (best)  0.4743  0.2611  0.4815  0.2245  0.7958  0.7488 
Impv.  11.0%  6.1%  9.8%  3.6%  2.2%  4.3% 
5.3. Analysis for GAttN (RQ2)
Tab. 1 shows that ListwiseMHSA performs better than ListwiseGRU on both P@K and HR@K on all datasets. It indicates the effectiveness to apply MHSA method for encoding the candidate items (graph nodes). As we claimed in Sec. 4.1.2, the representation for a node should consider the other nodes in graph, for there can exist some underlying structures in graph that nodes may influence between each other. Here we further give a presentational case on how the selfattention works in encoder (see Fig. 4) based on Taobao dataset. Take item “hat” in graph for example, items with larger attention weights to it are kinds of “scarf”, “glove” and “hat”. It is reasonable that users focusing on “hat” tend to prefer “scarf” rather than “umbrella”. So to represent item “hat” in graph, it’s helpful to attend more features of items like “scarf”.
Beam search is proposed in GAttN decoder (see Sec. 4.1.2) to expand the search space and try more combinations of items in a card to get a most optimal solution. A critical hyperparameter for beam search is the beam size, indicating how many solutions to search in a decoding timestep. We tune beam size in Fig. 5 and find that larger beam size can lead to better performances^{8}^{8}8We only report the results on MovieLens(K=4,N=20) and other datasets follow the same conclusion. on both P@K and HR@K. However we can also see that when beam size gets larger than 3 the improvement of performances will be minor, so for efficiency consideration we set beam size as 3 in our experiments.
5.4. Analysis for RLfD (RQ3)
To verify how our proposed optimization framework RLfD works, we will do the following ablation tests:

Finetune in Eq. 21 and figure out the influence to the combination of SL with RL.
Here we represent Learning from Demonstrations and Learning from Rewards as SL and RL for short. Tab. 2 gives an overall results and we only report on dataset MovieLens(K=4,N=20) for simplification.
Settings in RLfD  MovieLens (K=4,N=20)  

P@4  HR@4  
1  RL(w/o hillclimbing)  0.3340  0.2314 
2  RL(w/ hillclimbing)  0.3573  0.2330 
3  SL(w/o policysampling)  0.4095  0.2401 
4  SL(w/ policysampling)  0.4272  0.2465 
5  RL(w/o hillclimbing) + SL(w/ policysampling)  0.4495  0.2514 
6  RL(w/ hillclimbing) + SL(w/o policysampling)  0.4472  0.2534 
7  RL(w/ hillclimbing) + SL(w/ policysampling)  0.4743  0.2611 
Fig. 6 shows the learning curves respect to Reward (defined in Eq. 20), P@4 and HR@4 for RL with (w/) or without (w/o) hillclimbing proposed in Sec. 4.2.3. From the curves, we can find that with the help of hillclimbing REINFORCE training becomes more stable and steadily improves the performance, finally achieves a better solution (row 1 vs. 2 and row 5 vs. 7 in Tab. 2). Another insight in Fig. 6 is that learning curve of Reward is synchronous monotonous with P@4 and HR@4, which verifies the effectiveness of our defined reward function to direct the objective in problem.
Fig. 7 shows the learning curves respect to P@4 and HR@4 for SL with (w/) or without (w/o) policysampling proposed in Sec. 4.2.2. We observe that in the beginning 50 iterations SL with policysampling may perform worse than without policysampling. We believe that in the first steps of training procedure the learned policy can be poor, so feeding the output sampled from such policy to the next timestep as input in decoder can lead to worse performances. However as training goes on, SL with policysampling will converge better for revising the inconsistency between training and inference of policy, finally achieve better performances (row 3 vs. 4 and row 6 vs. 7 in Tab. 2).
In Fig. 8 we tune hyperparameter defined in Eq. 21, which represents tradeoff for applying SL and RL in training process. We observe that achieves the best both on P@4 and HR@4. The performances increase when is tuned from 0 to the optimal value and then drops down afterwards, which indicates that properly combining SL and RL losses can result in the best solution. Furthermore, we find that when only apply SL loss () we can get a preliminary suboptimal policy, after involving some degree of RL loss the policy can be directed to achieve more optimal solutions, which verifies the sufficiencyandefficiency of our proposed Reinforcement Learning from Demonstrations to train the policy.
6. Conclusion and Future Work
This work targets to a practical recommendation problem named exactK recommendation, we prove that it is different from traditional topK recommendation. In the first step, we give a formal problem definition, then reduce it to a Maximal Clique Optimization problem which is a combinatorial optimization problem and NPhard. To tackle this specific problem, we propose a novel approach of GAttN with RLfD. In our evaluation, we perform extensive analysis to demonstrate the highly positive effect of our proposed method targeting exactK recommendation problem. In our future work, we plan to adopt adversarial training for the components of Reward Estimator and REINFORCE learning, regarding as discriminator and generator in GAN’s (Wang et al., 2017) perspective. Moreover further online A/B testing in production will be conducted.
References
 (1)
 Abbeel and Ng (2004) Pieter Abbeel and Andrew Y Ng. 2004. Apprenticeship learning via inverse reinforcement learning. In ICML.
 Ai et al. (2018) Qingyao Ai, Keping Bi, Jiafeng Guo, and W Bruce Croft. 2018. Learning a Deep Listwise Context Model for Ranking Refinement. In SIGIR.
 Ba et al. (2016) Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. In arXiv.
 Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. In arXiv.
 Bello et al. (2016) Irwan Bello, Hieu Pham, Quoc V Le, Mohammad Norouzi, and Samy Bengio. 2016. Neural Combinatorial Optimization with Reinforcement Learning. In arXiv.
 Bengio et al. (2015) Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. 2015. Scheduled sampling for sequence prediction with recurrent neural networks. In NIPS.
 Cremonesi et al. (2010) Paolo Cremonesi, Yehuda Koren, and Roberto Turrin. 2010. Performance of recommender algorithms on topn recommendation tasks. In RecSys.
 Duan et al. (2019) Lu Duan, Haoyuan Hu, Yu Qian, Yu Gong, Xiaodong Zhang, Yinghui Xu, and Jiangwen Wei. 2019. A Multitask Selected Learning Approach for Solving 3D Flexible Bin Packing Problem. In AAMAS.
 Garey et al. (1974) Michael R Garey, David S Johnson, and Larry Stockmeyer. 1974. Some simplified NPcomplete problems. In STOC.
 Gong et al. (2018a) Yu Gong, Xusheng Luo, Kenny Q Zhu, Wenwu Ou, Zhao Li, and Lu Duan. 2018a. Automatic generation of chinese short product titles for mobile display. In arXiv.
 Gong et al. (2018b) Yu Gong, Xusheng Luo, Yu Zhu, Wenwu Ou, Zhao Li, Muhua Zhu, Kenny Q Zhu, Lu Duan, and Xi Chen. 2018b. Deep Cascade Multitask Learning for Slot Filling in Online Shopping Assistant. In arXiv.
 Gong et al. (2016) Yu Gong, Kaiqi Zhao, and Kenny Qili Zhu. 2016. Representing verbs as argument concepts. In AAAI.
 He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR.
 He et al. (2017) Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and TatSeng Chua. 2017. Neural collaborative filtering. In WWW.
 Ion et al. (2011) Adrian Ion, João Carreira, and Cristian Sminchisescu. 2011. Image segmentation by figureground composition into maximal cliques. In ICCV.
 Jiang et al. (2018) Ray Jiang, Sven Gowal, Yuqiu Qian, Timothy Mann, and Danilo J Rezende. 2018. Beyond Greedy Ranking: Slate Optimization via ListCVAE. In arXiv.
 Koren et al. (2009) Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization techniques for recommender systems. Computer (2009).
 Luong et al. (2015) MinhThang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective approaches to attentionbased neural machine translation. In arXiv.
 Marzal and Vidal (1993) Andres Marzal and Enrique Vidal. 1993. Computation of normalized edit distance and applications. PAMI (1993).
 Nair et al. (2018) Ashvin Nair, Bob McGrew, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. 2018. Overcoming exploration in reinforcement learning with demonstrations. In ICRA.
 Qu et al. (2016) Yanru Qu, Han Cai, Kan Ren, Weinan Zhang, Yong Yu, Ying Wen, and Jun Wang. 2016. Productbased neural networks for user response prediction. In ICDM.
 Rendle et al. (2009) Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars SchmidtThieme. 2009. BPR: Bayesian personalized ranking from implicit feedback. In UAI.
 Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In NIPS.
 Torabi et al. (2018) Faraz Torabi, Garrett Warnell, and Peter Stone. 2018. Behavioral Cloning from Observation. In arXiv.
 Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NIPS.
 Vinyals et al. (2015a) Oriol Vinyals, Samy Bengio, and Manjunath Kudlur. 2015a. Order matters: Sequence to sequence for sets. In arXiv.
 Vinyals et al. (2015b) Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. 2015b. Pointer networks. In NIPS.
 Vinyals et al. (2015c) Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015c. Show and tell: A neural image caption generator. In CVPR.
 Wang et al. (2017) Jun Wang, Lantao Yu, Weinan Zhang, Yu Gong, Yinghui Xu, Benyou Wang, Peng Zhang, and Dell Zhang. 2017. Irgan: A minimax game for unifying generative and discriminative information retrieval models. In SIGIR.
 Yang et al. (2018) Yunlun Yang, Yu Gong, and Xi Chen. 2018. Query Tracking for Ecommerce Conversational Search: A Machine Comprehension Perspective. In CIKM.
 Yu et al. (2017) Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. 2017. SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient.. In AAAI.
 Zheng et al. (2018) Guanjie Zheng, Fuzheng Zhang, Zihan Zheng, Yang Xiang, Nicholas Jing Yuan, Xing Xie, and Zhenhui Li. 2018. DRN: A Deep Reinforcement Learning Framework for News Recommendation. In WWW.
 Zhu et al. (2017) Yu Zhu, Hao Li, Yikang Liao, Beidou Wang, Ziyu Guan, Haifeng Liu, and Deng Cai. 2017. What to Do Next: Modeling User Behaviors by TimeLSTM.. In IJCAI.
 Zhu et al. (2018) Yu Zhu, Junxiong Zhu, Jie Hou, Yongliang Li, Beidou Wang, Ziyu Guan, and Deng Cai. 2018. A brandlevel ranking system with the customized attentionGRU model. In IJCAI.
Appendix A Naive NodeWeight Estimation Method
Appendix B Reinforcement Learning from Demonstrations
Appendix C Experimental Settings
c.1. Datasets
Dataset  User#  Card#  Item#  Sample# 

MovieLens(K=4,N=20)  817  40036  1630  40036 
MovieLens(K=10,N=50)  485  33196  1649  33198 
Taobao(K=4,N=50)  581055  310509  3148550  1116582 
user  card  candidate items  card label  positive item  
sample#1  1  1,2,3,4  1,2,3,4,…,20  1  2 
sample#2  1  1,4,5,6  1,2,3,4,…,20  0  / 
… 

(We take and for example. Items and users are represented as IDs here. Card label represents whether the card is clicked or satisfied by user (labeled as 1) or not (labeled as 0). Positive item is the actually clicked item in card by user.)
c.2. Implementation and Parameter Settings
Here we report implementation details for the three datasets^{9}^{9}9https://github.com/pangolulu/exactkrecommendation
(two MovieLens based datasets and one Taobao based dataset), and our implementation is based on TensorFlow
^{10}^{10}10https://www.tensorflow.org/. To construct the training and test sets, we perform a 4:1 random splitting as in (Wang et al., 2017) for all the datasets.c.2.1. MovieLens
Notice both MovieLens(K=4,N=20) and MovieLens(K=10,N=50) share the same parameter settings. For a fair comparison, all models are set with an embedding size of for item and user IDs, and optimized using the minibatch Adam optimizer with a batch size of and learning rate of . All models are trained for epoch. All the trainable feedforward parameter matrices are set with the same input and output dimension as (including DeepRank, BPR, and all the RNN cells in both ListwiseGRU, ListwiseMHSA and ours). Specifically for our GAttN model, in decoder (in Sec. 4.1.3) we use LSTM cells with units number of 32 and set beam size as , number of heads in encoder (in Sec. 4.1.2) MHSA layer is , and the coefficient parameter in loss function (in Sec. 4.2.4) is . Number of layers in both encoder and decoder are set as 2. For reward estimator model (in Sec. 4.2.3), we set the hidden size in fullyconnected layer as 128.
c.2.2. Taobao
In this dataset, the feature vectors for user and item are statistic features with size of 40 and 52 specifically, instead of ID features. Sample statistic features are PV (page view), IPV (item page view), GMV (cross merchandise volume), CTR (click through rate) and CVR (conversion rate) for 1 day, 7 days and 14 days, etc. For this dataset, we first transfer the input representation of user and item to 32 dimension, i.e we set and in Sec. 4.1.1. And all the other hyperparameters are set as the same with those on MovieLens based datasets (refer to Appx. C.2.1).
Comments
There are no comments yet.