Maximizing Cumulative User Engagement in Sequential Recommendation: An Online Optimization Perspective

06/02/2020 ∙ by Yifei Zhao, et al. ∙ Microsoft 0

To maximize cumulative user engagement (e.g. cumulative clicks) in sequential recommendation, it is often needed to tradeoff two potentially conflicting objectives, that is, pursuing higher immediate user engagement (e.g., click-through rate) and encouraging user browsing (i.e., more items exposured). Existing works often study these two tasks separately, thus tend to result in sub-optimal results. In this paper, we study this problem from an online optimization perspective, and propose a flexible and practical framework to explicitly tradeoff longer user browsing length and high immediate user engagement. Specifically, by considering items as actions, user's requests as states and user leaving as an absorbing state, we formulate each user's behavior as a personalized Markov decision process (MDP), and the problem of maximizing cumulative user engagement is reduced to a stochastic shortest path (SSP) problem. Meanwhile, with immediate user engagement and quit probability estimation, it is shown that the SSP problem can be efficiently solved via dynamic programming. Experiments on real-world datasets demonstrate the effectiveness of the proposed approach. Moreover, this approach is deployed at a large E-commerce platform, achieved over 7



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

In recent years, sequential recommendation has drawn attention due to its wide application in various domains (Chen et al., 2018; Tang and Wang, 2018; Huang et al., 2018; Donkers et al., 2017; Ebesu et al., 2018), such as E-commerce, social media, digital entertainment and so on. It even props up popular stand-alone products like Toutiao111, Tik Tok222 and so on.

Figure 1. Sequential Recommendation Procedure

Different from traditional recommender systems where it is often assumed that the number of recommended items is fixed, the most important feature of sequential recommendation is that it iteratively recommend items until the user quits (as depicted in Figure 1), which means that users can browse endless items if they want. Its goal is to maximize cumulative user engagement in each session, such as cumulative clicks, cumulative dwell time, etc. To this end, the recommender systems need to simultaneously achieve two objectives:

  1. Attracting users to have a longer session such that more items can be browsed;

  2. Capturing user interests such that higher immediate user engagement can be achieved.

In traditional recommender systems, since the number of recommended items is fixed, most efforts are spent on improving immediate user engagement, which is often measured by click-through rate, etc. However, when such a strategy is adopted for sequential recommendation, it tends to result in sub-optimal cumulative user engagement, due to the limited number of browsed items. Moreover, due to their inherent conflicts, it is not a trivial task to achieve a longer session and higher immediate user engagement simultaneously (which can be demonstrated in the experiments). For example, to achieve a longer session, it is generally needed to explore diverse recommendation results; almost for sure, this will sacrifice immediate user engagement. Therefore, how to tradeoff a longer session and higher immediate user engagement becomes critical to achieve higher cumulative user engagement, and this is essentially the key problem of sequential recommendation.

Generally speaking, existing works on sequential recommendation fall into two groups. The first group of works try to leverage sequential information (e.g., users’ interaction behavior) to estimate the probability of user engagement (e.g., click-through rate) more accurately  (Chen et al., 2018; Tang and Wang, 2018; Huang et al., 2018; Donkers et al., 2017; Ebesu et al., 2018)

, for example, by using recurrent neural network or its variants 

(Chen et al., 2018; Wu et al., 2017; Donkers et al., 2017). By exploiting the sequential behavior pattern, these methods focus on capturing user interests more accurately, but do not consider to extend session length thus may lead to sub-optimal results. Based on the observation that diverse results tend to attract users to browse more items, the second group of methods explicitly considers the diversity of recommendation results (Teo et al., 2016; Devooght and Bersini, 2017; Carbonell and Goldstein, 1998). However, the relationship between diversity and user browsing length is mostly empirical; thus it is not so well-founded to optimize diversity directly, especially when it is still a fact that there are no well-accepted diversity measures so far. Therefore, it is still a challenge to optimize cumulative user engagement in the sequential recommendation.

In this paper, we consider the problem of maximizing cumulative user engagement in sequential recommendation from an online optimization perspective, and propose a flexible and practical framework to solve it. Specifically, by considering different items as different actions, user’s different requests as states and user leaving as an absorbing state, we consider user browsing process in the Markov decision process (MDP) framework, and as a consequence the problem of maximizing cumulative user engagement can be reduced to a stochastic shortest path

(SSP) problem. To make this framework practical, at each state (except absorbing state), we need to know two probabilities for each possible action, i.e., the probability of achieving user engagement (e.g., click) and the probability of transitioning to the absorbing state which means that the user quits the browsing process. Obviously, the problem of estimating the probability of user engagement has been well studied, and many existing machine learning methods can be employed. Meanwhile, we propose a multi-instance learning method to estimate the probability of transitioning to the absorbing state (i.e., user quit model). With this framework and corresponding probabilities effectively estimated, the SSP problem can be solved efficiently via dynamic programming. Experiments on real-world datasets and an online E-commerce platform demonstrate the effectiveness of the proposed approach.

In summary, our main contributions are listed as below:

  • We solve the problem of maximizing cumulative user engagement within an online optimization framework. Within this framework, we explicitly tradeoff longer user browsing session and high immediate user engagement to maximize cumulative user engagement in sequential recommendation.

  • Within the online optimization framework, we propose a practical approach which is efficient and easy to implement in real-world applications. In this approach, existing works on user engagement estimation can be exploited, a new multi-instance learning method is used for user quit model estimation, and the corresponding optimization problem can be efficiently solved via dynamic programming.

  • Experiments on real-world dataset demonstrate the effectiveness of the proposed approach, and detailed analysis shows the correlation between user browsing and immediate user engagement. Moreover, the proposed approach has been deployed on a large E-commerce platform, and achieve over improvement on cumulative clicks.

The rest of the paper is organized as follows. In Section 2, we discuss some related works. Problem statement is given in Section 3. Section 4 provides the framework named MDP-SSP and the related algorithms. Experiments are carried out in Section 5, and finally we give a conclusion in Section 6.

2. Related work

2.1. Sequential Recommendation

In recent years, conventional recommendation methods, e.g., RNN models (Hidasi et al., 2015; Donkers et al., 2017; Huang et al., 2018), memory networks with attention (Chen et al., 2018; Ebesu et al., 2018; Huang et al., 2018), etc., are applied in sequential recommendation scenarios widely. In order to find the next item that should be recommended, RNN models capture the user’s sequence pattens by utilizing historic sequential information. One could also train a memory network and introduce the attention mechanism to weighting some sequential elements.  (Tang and Wang, 2018; Donkers et al., 2017) show that these methods significantly outperform the classic ones which ignored the sequential information. Essentially, they are still estimating the immediate user engagement (i.e. click-through rate) on the next item, without considering quit probability. Therefore further improvements are necessary to maximize cumulative user engagement.

2.2. MDP and SSP

Stochastic Shortest Path (SSP) is a stochastic version of the classical shortest path problem: for each node of a graph, we must choose a probability distribution over the set of successor nodes so as to reach a certain destination node with minimum expected cost 

(Bertsekas and Tsitsiklis, 1991; Polychronopoulos and Tsitsiklis, 1996). SSP problem is essentially a Markov Decision Process (MDP) problem, with an assumption that there is an absorbing state and a proper strategy. Some variants of Dynamic Programming can be adopted to solve the problem (Bonet and Geffner, 2003; Kolobov et al., 2011; Trevizan et al., 2016; Barto et al., 1995)

. Real Time Dynamic Program (RTDP) is an algorithm for solving non-deterministic planning problems with full observability, which can be understood either as an heuristic search or as a dynamic programming (DP) procedure 

(Barto et al., 1995). Labeled RTDP (Bonet and Geffner, 2003) is a variant of RTDP, and the key idea is to label a state as solved if the state and its successors have converged, and the solved states will not be updated further.

2.3. Multi-Instance Learning

In Multi-instance learning (MIL) tasks, each example is represented by a bag of instances (Dietterich et al., 1997). A bag is positive if it contains at least one positive instance, and negative otherwise. The approaches for MIL can fall into three paradigms according to (Amores, 2013): the instance-space paradigm, the bag-space paradigm and the embedded-space paradigm. For our sequential recommendation setting, the need of modeling the transiting probability is in accordance with the instance-space paradigm. Several SVM based methods are proposed in instance-level MIL tasks (Maron and Lozano-Pérez, 1998; Zhang and Goldman, 2002; Andrews et al., 2003; Bunescu and Mooney, 2007)

. MI-SVM is a variant of SVM-like MIL approaches, the main idea is that it forces the instance farthest to the decision hyperplane (with the largest margin) to be positive in each iteration.

3. Problem Statement

We model each browsing process as a personalized Markov Decision Process (MDP) including an absorbing state, and consider the problem of maximizing cumulative user engagement as a stochastic shortest path (SSP) problem.

3.1. Personalized MDP Model

The MDP consists of a tuple with four elements (S, A, R, P):

  • State space . Here we take each step in the recommendation sequence as an individual state and define , where is the step index. Since only one item is shown to the user in each step, is also the sequence number of the browsed items. is the upper limit of the browsing session length, which is large enough for the recommendation senarioes. is defined as the absorbing state meaning that the user has left.

  • Action space . Action space contains all candidates that could be recommended in the present session.

  • Reward . Denote as a state in , and as an action in , and then is the reward after taking action in state . Specifically, is the immediate user engagement (e.g., click-through rate) in the -th step.

  • Transition probability , and is the probability of transiting from state to state after taking action .

Since the states in are sequential, we introduce a regulation on that from all states except and , users can only transit to the next state (go on browsing) or jump into the absorbing state (quit). Moreover, from the last browsing step, users could only be admitted to jump into absorbing state. Formally, we have


The finite-state machine of the procedure is shown as Figure 2. Furthermore, it should be emphasized that the proposed MDP model is personalized and we will infer a new MDP model for each online session. An efficient algorithm for generating MDP models will be presented later.

Figure 2. MDP Finite-State Machine

3.2. SSP Problem

Based on the MDP model, the optimization of cumulative rewards in a sequential recommendation can be formally formulated as a SSP problem: Given a MDP, the objective is to find a policy : , which can help us to plan a path with maximal cumulative rewards, i.e.


where is the actual browse length. The distribution of can be derived as


Thus the expected cumulative rewards in Equation (2) can be represented as


Finally, by introducing Equation (1) into Equation (4), we have


3.2.1. Remark 1.

Maximize Equation (5) is simultaneously optimizing two points mentioned in Introduction: 1) user browse length, i.e. , and 2) immediate user engagement, i.e. .

According to the formulation, we should first estimate and in Equation (5), which is essentially generating a personalized MDP model. Then we optimize a policy by maximizing Equation (5), which could be used to plan a recommendation sequence (or called Path in SSP) to the corresponding user.

4. The Proposed Approach

In this section, we first propose an online optimization framework named MDP-SSP considering browsing session length and immediate user engagement simultaneously and maximizing the cumulative user engagement directly. Then the related algorithms are presented detailedly.

4.1. MDP-SSP Framework

In order to maximize the expected cumulative rewards, as mentioned previously, we should learn a MDP generator from which the personalized MDP model can be generated online, and then plan the recommendation sequence with the personalized MDP model. Therefore, the proposed MDP-SSP framework consists of two parts: an offline MDP Generator and an online SSP Planner, which is shown in Figure 3.

4.1.1. MDP Generator

is designed to generate personalized MDPs for each online sessions. There are two submodules in this part: Model Worker and Calibration Worker. Model Worker is used to learn a model from offline historic data, aiming to provide necessary elements of the personalized MDP. Specifically, the reward function and the quit probability in Equation (5) are needed. Here could be an immediate user engagement, e.g. immediate click, thus Model Worker contains the corresponding estimation model, e.g. click model. In the meanwhile, is related to a quit model which determines the browse session length and is an important component of Model Worker.

Moreover, since the efficiency of SSP planning depends on the accuracy of the generated MDP model, we introduce an additional Calibration Worker to calibrate the ranking scores obtained from the learned model to the real value (Niculescu-Mizil and Caruana, 2005; Lin et al., 2007; He et al., 2014).

4.1.2. SSP Planner

plans a shortest path (with maximal cumulative rewards) consisting of sequential recommended items. It also contains two submodules: MDP Producer and SSP Solver. Based on the generator learned by the offline MDP Generator algorithm, MDP Producer generates a personalized MDP for the user of present session. Then SSP Solver will plan an optimal path based on the personalized MDP to the user.

Figure 3. MDP-SSP Framework

4.2. Offline MDP Generator Algorithm

In this section, we present an offline algorithm to learn the reward function and the quit probability , which are required to generate online personalized MDPs. We will see that the problem of modeling is more critical and difficult. In practice, the historic data we obtain is often an containing the items seen by the user until the end of a session, which would make the users quit or go on browsing. However, it is hard to know which item in the is exactly the chief cause. In order to estimate the quit probability for each item, we alternatively adopt the Multi-Instance Learning (MIL) framework by taking as and as . Detailedly, if the causes a quit, the user dislikes all the items in this set; if the causes a continues browse, at least one item in the is accepted by the user, which is consistent with the MIL setting.

4.2.1. Remark 2.

The standard MIL assumption states that all negative bags contain only negative instances, and that positive bags contain at least one positive instance.

By utilizing some classical MIL techniques, we can obtain the following user quit model.

4.2.2. User Quit Model

Based on the users’ browse history, we can get sequences consisted with bags , and one may verify that only the last bag in a browse session cannot keep the users going on browsing. We assume that the bag which can keep the user is a positive bag, written as , and the last one is the negative bag written as , so a browse session is . Our job is to construct a model to predict the quit probability for each new instance . However, there exists a gap that the training labels we have are in bag level, while the predictions we need are in instance level. In order to cope with this problem, we introduce MI-SVM (Andrews et al., 2003) to help us train an instance level model with the bag level data, which is a novel application of MIL to recommendation to the best of our knowledge. The process for quit model training is shown in Algorithm 1.

0:    Historic browse session set: ={(, , …,,…,)}
1:   is converted to by NSK (Gärtner et al., 2002), and a initial is obtained
3:  for all  do
4:     for all  do
5:        Select with maximum value according to
7:     end for
9:  end for
11:  Train based on
13:  repeat
14:     line 3-12
15:  until 
16:  return  
Algorithm 1 User Quit Model

4.2.3. Model Calibration

In the industrial recommendation system, ranking scores provided by the click model and quit model are not equivalent to the reward and the transition probability of MDPs. Thus it is necessary to calibrate the model outputs to real probabilities. Readers interested in this topic may go to (Niculescu-Mizil and Caruana, 2005; Lin et al., 2007; He et al., 2014) for details. In this paper, denoting the predicted score as , the real probability value can be represented as follow:


where A and B are two scalar parameters can be learned from historic data.

4.3. Online SSP Planner Algorithm

Based on the MDP Generator discussed in the last subsection, we formally introduce SSP Planner, which consists of MDP Producer and SSP Solver.

4.3.1. MDP Producer

When a new session comes, the MDP Producer receives online information of user and items from server, and then feeds them into the generators derived from MDP Generator. After that, the reward and transition probability can be obtained and a personalized MDP is produced in real time. It’s worth noting that, the information about how many items the user has browsed, how many times the item’s category have been shown to the user and clicked by the user, should be considered. These interactive features play an important role in causing the user go on browse or quit intuitively.

4.3.2. SSP Solver

From MDP Producer we can get a personalized MDP for the present session, and the next job is to find a path with maximal cumulative rewards. Except absorbing state, the corresponding MDP has T states, and then optimal state value function can be addressed with dynamic programing in -steps interaction. Furthermore, it is easy to verify that the transition matrix of our specifically designed MDP preserves an upper triangular structure, shown as Equation (7).


Based on the special structured transition matrix, it is easy to find that the latter state value function will not change when we update the current state value function. Therefore the backwards induction could be adopted. One may start from the absorbing state, and iteratively obtain the optimal policy as well as the related optimal state value function. We formally summarize this procedures as follow:


Further more, when , we have


when , we have


Based on the Equations (8)-(12), we can plan an optimal path . The optimization procedure is shown in Algorithm 2. We can see that the whole planning procedure is quite simple and clear, which benefits the online application of the proposed method. Specifically, assuming there are candidates, the complexity of SSP is .

0:    User Request, MDP Generator
1:  Generating a personalized MDP for the current user

  Initialize a vector

Path with length T
3:  Obtain an optimal policy, i.e., according to Equation (9)
4:  Obtain an optimal state value according to Equation (10)
5:  Update
6:  for t=T-1, …, 2, 1 do
7:     Obtain an optimal policy, i.e., according to Equation (11)
8:     Obtain an optimal state value according to Equation (12)
9:     Update
10:  end for
11:  return  Path =
Algorithm 2 SSP Solver

5. Experiments

The experiments are conducted on a large E-commerce platform. We first analyze the characteristics of data which demonstrates the necessity of applying SSP, and then evaluate SSP offline and online.

5.1. Data Set

Dataset 1: This dataset is for MDP Generator. It consists of 15 days historic data of user item interactions, based on which we may learn models for predicting the click-through rate and quit probability of any user item pair.
Dataset 2: This dataset is for SSP offline evaluation. We collect the active users and their corresponding browse sessions, and discard those that are inactive or excessive active. The sampling is according to the criterion that whether the browse session length is between 50 items and 100 items. Finally, we get 1000 users and corresponding browse sessions. The average length of the browse sessions is 57.
Dataset 3: This dataset is for SSP online evaluation. It is actually online environment, and has about ten millions of users and a hundred millions of items each day.

Various strategies (including SSP) will be deployed to rerank the personalized candidate items for each user in Dataset 2 and Dataset 3, to validate their effect on maximizing cumulative user engagement. Before that we should first verify the datasets are in accordance with the following characteristics:

  • Discrimination: Different items should provide different quit probabilities, and they should have a significant discrimination. Otherwise quit probability is not necessary to be considered when make recommendations.

  • Weakly related: The quit probability of an item for a user should be weakly related with click-through rate. Otherwise SSP and Greedy will be the same.

5.2. Evaluation Measures

In the experiment, we consider the cumulative clicks as cumulative user engagement. Moreover, we name cumulative clicks as for short, which means Item Page View and is commonly used in industry. Browse length(, for short) is also a measurement since can be maximized through making users browse more items.

In offline evaluation, assuming that the recommended sequence length is , with Equation (1)-(5) we have


Furthermore, define the of the recommended sequence as


In online evaluation, can be counted according to the actual online situation, follows that


where indicates the click behavior in step t, and is the browse length, i.e. .

5.3. Compared Strategies

  • Greedy: The key difference between our methods and traditional methods is that: we take user’s quit probability into consideration and plan a path by directly maximizing , while most other methods try hard to estimate each step reward as exact as possible. However when planning they just rank the items greedily according to , ignoring that is also crucial to . Greedy is the first compared strategy in which quit probability cannot be involved. Assuming that there are candidates and the length of planning path is , the complexity is .

  • Beam Search: It is a search algorithm that balances performance and consumption. Its purpose is to decode relatively optimal paths in sequences. It is chosen as the compared strategy because the quit probability can be involved. We calculate beam path score according to Equation (13), so that Beam Search applied here directly optimize . Assuming that there are candidates and the length of planning path is , the complexity is , where is beam size.

5.4. MDP Generator Learning

We first describe MDP Generator learning in and , with which we verify the characteristics of datasets in and .

5.4.1. Model Learning

In model learning, we take full use of user attributes and item attributes. Further more, we add interactive features, for example how many times the item’s category have been shown to the user and clicked by the user, which intuitively play a important role in making the user go on browse. Area Under the Curve (AUC)333, which is a frequently-used metric in industry and research, is adopted to measure the learned model, and the result is shown in Table 1.

Model AUC
CTR Model 0.7194
Quit Model 0.8376
Table 1. CTR Model and Browse Model

Here we state briefly about Quit Model testing method. As we indeed do not know which item makes the user go on browsing in practice, thus AUC cannot be directly calculated on instance level. It is more rational to calculate AUC in bag level with instance prediction, as we can assume that the bag is positive if it contains at least one positive instance, the bag is negative if all the instances are negative.

Furthermore, we take a comparison experiment on Quit Model to show the necessary of adopting MIL. As bag labels are known, the most intuitive idea is that using the bag’s label to represent the instance’s label. Based on this idea, we obtain a , and AUC is calculated also in bag level. The results are shown in Table 2, from which we can see adopting MIL gives a improvement for Quit Model learning.

Model AUC
Quit Model 0.8376
Table 2. Quit Model Comparison

5.4.2. Model Calibration

Calibration tries to map the ranking scores obtained from models to real value. It is very important here for the error will accumulate, see Equation (11)~(12). We apply platt scaling method, and Root Mean Square Error (RMSE)444 is adopted as measurement. The results is shown in Table 3.

RMSE Before After
CTR 0.0957 0.0179
Quit 0.6046 0.0077
Table 3. Model Calibration

From Table 3, it can be seen significant improvement has been achieved after calibration, and the curve of real value and calibrated value is shown in Figure 4 and Figure 5. Abscissa axis is items sorted by the predicted scores, and vertical axis is calibrated score and the real score. The real score is obtained from items bin.

Figure 4. Calibration of Click Model.
Figure 5. Calibration of Quit Model

5.4.3. Discrimination

In Dataset 2, for each user we get the quit probability of corresponding candidates from the MDP Generator, i.e. the items in the user’s browse session. Then a user’s quit probability list is obtained, where is the quit probability when recommending to

. Standard Error (STD) and MEAN are calculated for each list, and the statistics of the dataset is shown in Table 

4. From the table, it can be demonstrated that, for each user, different candidates make different contributions to keep the user browsing, and they have a significant discrimination.

0.1963 0.7348 0.3135
Table 4. Discrimination of Quit Probabilites

5.4.4. Weakly Related

We further study the correlation between quit probability and immediate user engagement (i.e. each step reward). For each user, we get two item lists and with length . and are formed greedily according to and respectively. If and are completely positive correlation, and

will be the same, which leads to the equality of SSP and Greedy. We use Jaccard Index

555 and NDCG666 to measure the similarity of and , and the average result of the dataset is shown in Table 5. From the table, we find that in the dataset quit probability and immediate user engagement are weakly related.

Mean Length List Length Jaccard Index NDCG
57 20 0.33 0.52
Table 5. The Correlation between Quit Probability and CTR

5.5. SSP Planner: Offline Evaluation

Now we deploy the strategies in Dataset 2.

5.5.1. SSP Plan

We plan a sequence list with steps: , according to each strategy mentioned above. The revenue of can be calculated according to Equation (13)-(15).

The detailed results are shown in Table 6, and we can find that:

  • Greedy achieves the best , while SSP achieves the best and . This demonstrates our idea that can be improved through making users browse more. SSP does not aim to optimize each step effectiveness, and its purpose is to improve the total amount of cumulative clicks.

  • The longer the number of steps, the greater the advantages on and . See and , when times 2.5, from 20 to 50, the improvement of both and are more than 2.5 times(, respectively). This result is in line with our expectation as planning more steps could lead to a bigger effect of the quit probability on users.

Method T=20 T=50
GREEDY 168.10 455.36 0.37 280.63 765.79 0.37
Beam Search 321.91 977.78 0.33 660.90 2039.88 0.32
SSP 392.97 1347.57 0.29 1066.08 4045.47 0.26
Table 6. IPV of Offline Evaluation

5.5.2. SSP Plan with Duplicate Removal

In some practical scenarios, items are forbidden to display repeatedly. We need to make a compromise on the three strategies.

  • Greedy: The items selected in the previous steps should be removed from the candidate set for step .

  • Beam Search: The items selected in the previous steps should be removed from the candidate set for step .

  • SSP: When planing, we plan from step to step according to the upper bound of of each step, and keep the optimal items as the step’s candidates at each step. When selecting, we do the selection from step to step . Specifically, we choose the optimal one item and remove it from the remaining steps’ candidates simultaneously.

From the detailed results in Table 7, we can find that although the compromises hurts the ideal effects, SSP still outperforms Greedy and Beam Search.

Method T=20 T=50
GREEDY 68.06 216.38 0.31 68.23 217.93 0.31
Beam Search 105.52 427.80 0.25 107.06 439.59 0.24
SSP 189.11 999.51 0.19 242.77 1632.56 0.15
Table 7. IPV of Offline Evaluation with Deduplication

5.5.3. SSP Plan with Noise

Since there may exist a gap between offline environment and online environment, which makes the predicted click-through rate and quit probability offline are not absolutely equivalent to the real value online, we introduce a set of noise experiments before deploying MDP-SSP online.

The experiments are conducted in the following way: we add random noises in the click-through rate and quit probability given by the offline environment. Assuming the noise where

is a uniform distribution, we define

, , where is an integer ranges from to . We plan according to the value with noise, and calculate the final revenue with the real value. The results are shown in Figure 6. The horizontal axis represents the noise, i.e. in , and the vertical axis is the revenue, i.e. cumulative clicks.

(a) T=20
(b) T=50
Figure 6. Noise Experiments

From Figure 6 we can find that although SSP is more sensitive to noise, it performs better than Greedy and Beam Search. It demonstrates that considering quit probability plays a very important role in IPV issue.

5.6. SSP Planner: Online Evaluation

For online evaluation, we deployed SSP and Greedy strategies on a real E-commerce APP. For further comparison, we conduct a experiments with quit model which does not introduce MIL, and the strategy is named . Three strategies run online with the same traffic for one week, and the results shown in Table 8777The data has been processed for business reason. demonstrate that:

  • For cumulative clicks, quit probability cannot be ignored in sequential recommendations, see SSP and Greedy.

  • The accuracy of quit probability directly influence the results, see SSP and .

Method IPV BL
Greedy 0.9296 0.9440
0.9638 0.9789
SSP 1 1
Table 8. IPV of online Evaluation

6. Conclusions

In this paper, we study the problem of maximizing cumulative user engagement in sequential recommendation where the browse length is not fixed. Furthermore, we propose an online optimization framework, in which the problem can be reduced to a SSP problem. Then a practical approach that is easy to implement in real-world applications is proposed, and the corresponding optimization problem can be efficiently solved via dynamic programming. The superior advantage of our method is also verified with both offline and online experiments by generating optimal personalized recommendations.

In the future, we will study the MDP-SSP with deduplication. While the current MDP-SSP could yield good sequential recommendation, it fails to consider the item duplication issue, which is commonly not allowed in practice. Although we propose a compromise strategy and it outperforms Beam Search and Greedy strategy (which are commonly used in practice), it is not the optimal solution when considering items deduplication. It will bring more insights if the deduplication constraint can be modeled into the strategy.


  • (1)
  • Amores (2013) Jaume Amores. 2013. Multiple instance classification: review, taxonomy and comparative study. Artificial intelligence 201 (2013), 81–105.
  • Andrews et al. (2003) Stuart Andrews, Ioannis Tsochantaridis, and Thomas Hofmann. 2003. Support vector machines for multiple-instance learning. In Advances in Neural Information Processing Systems. Vancouver, Canada, 577–584.
  • Barto et al. (1995) Andrew G Barto, Steven J Bradtke, and Satinder P Singh. 1995. Learning to act using real-time dynamic programming. Artificial Intelligence 72, 1-2 (1995), 81–138.
  • Bertsekas and Tsitsiklis (1991) Dimitri P Bertsekas and John N Tsitsiklis. 1991. An analysis of stochastic shortest path problems. Mathematics of Operations Research 16, 3 (1991), 580–595.
  • Bonet and Geffner (2003) Blai Bonet and Hector Geffner. 2003. Labeled RTDP: improving the convergence of real-time dynamic programming.. In Proceedings of the Thirteenth International Conference on Automated Planning and Scheduling, Vol. 3. Trento, Italy, 12–21.
  • Bunescu and Mooney (2007) Razvan C Bunescu and Raymond J Mooney. 2007. Multiple instance learning for sparse positive bags. In Proceedings of the Twenty-Fourth International Conference on Machine Learning. ACM, Corvallis, Oregon, USA, 105–112.
  • Carbonell and Goldstein (1998) Jaime Carbonell and Jade Goldstein. 1998. The use of MMR, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the Twentiy-First Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, Melbourne, Australia, 335–336.
  • Chen et al. (2018) Xu Chen, Hongteng Xu, Yongfeng Zhang, Jiaxi Tang, Yixin Cao, Zheng Qin, and Hongyuan Zha. 2018. Sequential recommendation with user memory networks. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. ACM, Los Angeles, California, USA, 108–116.
  • Devooght and Bersini (2017) Robin Devooght and Hugues Bersini. 2017. Long and short-term recommendations with recurrent neural networks. In Proceedings of The Twenty-Fifth Conference on User Modeling, Adaptation and Personalization. ACM, Bratislava, Slovakia, 13–21.
  • Dietterich et al. (1997) Thomas G Dietterich, Richard H Lathrop, and Tomás Lozano-Pérez. 1997. Solving the multiple instance problem with axis-parallel rectangles. Artificial Intelligence 89, 1-2 (1997), 31–71.
  • Donkers et al. (2017) Tim Donkers, Benedikt Loepp, and Jürgen Ziegler. 2017. Sequential user-based recurrent neural network recommendations. In Proceedings of the Eleventh ACM Conference on Recommender Systems. ACM, Como, Italy, 152–160.
  • Ebesu et al. (2018) Travis Ebesu, Bin Shen, and Yi Fang. 2018. Collaborative memory network for recommendation systems. arXiv preprint arXiv:1804.10862 (2018).
  • Gärtner et al. (2002) Thomas Gärtner, Peter A Flach, Adam Kowalczyk, and Alexander J Smola. 2002. Multi-instance kernels. In Proceedings of the Nineteenth International Conference on Machine Learning, Vol. 2. Sydney, Australia, 179–186.
  • He et al. (2014) Xinran He, Junfeng Pan, Ou Jin, Tianbing Xu, Bo Liu, Tao Xu, Yanxin Shi, Antoine Atallah, Ralf Herbrich, Stuart Bowers, et al. 2014. Practical lessons from predicting clicks on ads at facebook. In Proceedings of the Eighth International Workshop on Data Mining for Online Advertising. ACM, New York, NY, USA, 1–9.
  • Hidasi et al. (2015) Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. 2015. Session-based recommendations with recurrent neural networks. arXiv preprint arXiv:1511.06939 (2015).
  • Huang et al. (2018) Jin Huang, Wayne Xin Zhao, Hongjian Dou, Ji-Rong Wen, and Edward Y Chang. 2018. Improving sequential recommendation with knowledge-enhanced memory networks. In Proceedings of the Forty-First International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM, Ann Arbor Michigan, USA, 505–514.
  • Kolobov et al. (2011) Andrey Kolobov, Mausam Mausam, Daniel S Weld, and Hector Geffner. 2011. Heuristic search for generalized stochastic shortest path MDPs. In Proceedings of the Twenty-First International Conference on Automated Planning and Scheduling. Freiburg, Germany.
  • Lin et al. (2007) Hsuan-Tien Lin, Chih-Jen Lin, and Ruby C Weng. 2007. A note on Platt’s probabilistic outputs for support vector machines. Machine learning 68, 3 (2007), 267–276.
  • Maron and Lozano-Pérez (1998) Oded Maron and Tomás Lozano-Pérez. 1998. A framework for multiple-instance learning. In Advances in Neural Information Processing Systems. Massachusetts, USA, 570–576.
  • Niculescu-Mizil and Caruana (2005) Alexandru Niculescu-Mizil and Rich Caruana. 2005.

    Predicting good probabilities with supervised learning. In

    Proceedings of the Twenty-Second International Conference on Machine Learning. ACM, Bonn, Germany, 625–632.
  • Polychronopoulos and Tsitsiklis (1996) George H Polychronopoulos and John N Tsitsiklis. 1996. Stochastic shortest path problems with recourse. Networks: An International Journal 27, 2 (1996), 133–143.
  • Tang and Wang (2018) Jiaxi Tang and Ke Wang. 2018. Personalized top-n sequential recommendation via convolutional sequence embedding. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. ACM, Los Angeles, California, USA, 565–573.
  • Teo et al. (2016) Choon Hui Teo, Houssam Nassif, Daniel Hill, Sriram Srinivasan, Mitchell Goodman, Vijai Mohan, and SVN Vishwanathan. 2016. Adaptive, personalized diversity for visual discovery. In Proceedings of The Tenth ACM Conference on Recommender Systems. ACM, Boston, MA, USA, 35–38.
  • Trevizan et al. (2016) Felipe W Trevizan, Sylvie Thiébaux, Pedro Henrique Santana, and Brian Charles Williams. 2016. Heuristic search in dual space for constrained stochastic shortest path problems.. In Proceedings of the Thirteenth International Conference on Automated Planning and Scheduling. London, UK, 326–334.
  • Wu et al. (2017) Chao-Yuan Wu, Amr Ahmed, Alex Beutel, Alexander J Smola, and How Jing. 2017. Recurrent recommender networks. In Proceedings of The Tenth ACM International Conference on Web Search and Data Mining. ACM, Cambridge, UK, 495–503.
  • Zhang and Goldman (2002) Qi Zhang and Sally A Goldman. 2002. EM-DD: An improved multiple-instance learning technique. In Advances in Neural Information Processing Systems. Vancouver, British Columbia, Canada, 1073–1080.