1 Introduction
Recommender systems have become ubiquitous, transforming user interactions with products, services and content in a wide variety of domains. In content recommendation, recommenders generally surface relevant and/or novel personalized content based on learned models of user preferences (e.g., as in collaborative filtering [Breese et al., 1998, Konstan et al., 1997, Srebro et al., 2004, Salakhutdinov and Mnih, 2007]) or predictive models of user responses to specific recommendations. Wellknown applications of recommender systems include video recommendations on YouTube [Covington et al., 2016], movie recommendations on Netflix [GomezUribe and Hunt, 2016] and playlist construction on Spotify [Jacobson et al., 2016]
. It is increasingly common to train deep neural networks (DNNs)
[van den Oord et al., 2013, Wang et al., 2015, Covington et al., 2016, Cheng et al., 2016] to predict user responses (e.g., clickthrough rates, content engagement, ratings, likes) to generate, score and serve candidate recommendations.Practical recommender systems largely focus on myopic prediction—estimating a user’s immediate
response to a recommendation—without considering the longterm impact on subsequent user behavior. This can be limiting: modeling a recommendation’s stochastic impact on the future affords opportunities to trade off user engagement in the nearterm for longerterm benefit (e.g., by probing a user’s interests, or improving satisfaction). As a result, research has increasingly turned to the sequential nature of user behavior using temporal models, such as hidden Markov models and recurrent neural networks
[Rendle et al., 2010, Campos et al., 2014, He and McAuley, 2016, Sahoo et al., 2012, Tan et al., 2016, Wu et al., 2017], and longterm planning using reinforcement learning (RL) techniques (e.g., [Shani et al., 2005, Taghipour et al., 2007, Gauci et al., 2018]). However, the application of RL has largely been confined to restricted domains due to the complexities of putting such models into practice at scale.In this work, we focus on the use of RL to maximize longterm value (LTV) of recommendations to the user, specifically, longterm user engagement. We address two key challenges facing the deployment of RL in practical recommender systems, the first algorithmic and the second methodological.
Our first contribution focuses on the algorithmic challenge of slate recommendation in RL. One challenge in many recommender systems is that, rather than a single item, multiple items are recommended to a user simultaneously; that is, users are presented with a slate of recommended items. This induces a RL problem with a large combinatorial action space, which in turn imposes significant demands on exploration, generalization and action optimization. Recent approaches to RL with such combinatorial actions [Sunehag et al., 2015, Metz et al., 2017] make inroads into this problem, but are unable to scale to problems of the size encountered in large, realworld recommender systems, in part because of their generality. In this work, we develop a new slate decomposition technique called SlateQ that estimates the longterm value (LTV) of a slate of items by directly using the estimated LTV of the individual items on the slate. This decomposition exploits certain assumptions about the specifics of user choice behavior—i.e., the process by which user preferences dictate selection and/or engagement with items on a slate—but these assumptions are minimal and, we argue below, very natural in many recommender settings.
More concretely, we first show how the SlateQ decomposition can be incorporated into temporal difference (TD)
learning algorithms, such as SARSA and Qlearning, so that LTVs can be learned at the level of individual items despite the fact that items are always presented to users in slates. This is critical for both generalization and exploration efficiency. We then turn to the optimization problem required to build slates that maximize LTV, a necessary component of policy improvement (e.g., in Qlearning) at training time and for selecting optimal slates at serving time. Despite the combinatorial (and fractional) nature of the underlying optimization problem, we show that it can be solved in polynomialtime by a twostep reduction to a linear program (LP). We also show that simple top
and greedy approximations, while having no theoretical guarantees in this formulation, work well in practice.Our second contribution is methodological. Despite the recent successes of RL afforded by deep Qnetworks (DQNs) [Mnih et al., 2015, Silver et al., 2016], the deployment of RL in practical recommenders is hampered by the need to construct relevant state and action features for DQN models, and to train models that serve millionstobillions of users. In this work, we develop a methodology that allows one to exploit existing myopic recommenders to: (a) accelerate RL model development; (b) reuse existing training infrastructure to a great degree; and (c) reuse the same serving infrastructure for scoring items based on their LTV. Specifically, we show how temporal difference (TD) learning can be built on top of existing myopic pipelines to allow the training and serving of DQNs.
Finally, we demonstrate our approach with both offline simulation experiments and online live experiments on the YouTube video recommendation system. We show that our techniques are scalable and offer significant improvements in user engagement over myopic recommendations. The live experiment also demonstrates how our methodology supports the relatively straightforward deployment of TD and RL methods that build on the learning infrastructure of extant myopic systems.
The remainder of the paper is organized as follows. In Section 2
, we briefly discuss related work on the use of RL for recommender systems, choice modeling, and RL with combinatorial action spaces. We formulate the LTV slate recommendation problem as a Markov decision process (MDP) in Section
3 and briefly discuss standard valuebased RL techniques, in particular, SARSA and Qlearning.We introduce our SlateQ decomposition in Section 4, discussing the assumptions under which the decomposition is valid, and how it supports effective TDlearning by allowing the longterm value (Qvalue) of a slate to be decomposed into a function of its constituent itemlevel LTVs (Qvalues). We pay special attention to the form of the user choice model, i.e., the random process by which a user’s preferences determine the selection of an item from a slate. The decomposition affords itemlevel exploration and generalization for TD methods like SARSA and Qlearning, thus obviating the need to construct value or Qfunctions explicitly over slates. For Qlearning itself to be feasible, we must also solve the combinatorial slate optimization problem—finding a slate with maximum LTV given the Qvalues of individual items. We address this problem in Section 5
, showing that it can be solved effectively by first developing a fractional mixedinteger programming formulation for slate optimization, then deriving a reformulation and relaxation that allows the problem to be solved exactly as a linear program. We also describe two simple heuristics,
top and greedy slate construction, that have no theoretical guarantees, but perform well in practice.To evaluate these methods systematically, we introduce a recommender simulation environment, RecSim, in Section 6 that allows the straightforward configuration of an item collection (or vocabulary), a user (latent) state model and a user choice model. We describe specific instantiations of this environment suitable for slate recommendation, and in Section 7 we use these models in the empirical evaluation of our SlateQ learning and optimization techniques.
The practical application of RL in the estimation of LTV in largescale, practical recommender systems often requires integration of RL methods with production machinelearning training and serving infrastructure. In Section
8, we outline a general methodology by which RL methods like SlateQ can be readily incorporated into the typical infrastructure used by many myopic recommender systems. We use this methodology to test the SlateQ approach, specifically using SARSA to get onestep policy improvements, in a live experiment for recommendations on the YouTube homepage. We discuss the results of this experiment in Section 9.2 Related Work
We briefly review select related work in recommender systems, choice modeling and combinatorial action optimization in RL.
Recommender Systems
Recommender systems have typically relied on collaborative filtering (CF) techniques [Konstan et al., 1997, Breese et al., 1998]. These exploit user feedback on a subset of items (either explicit, e.g., ratings, or implicit, e.g., consumption) to directly estimate user preferences for unseen items. CF techniques include methods that explicitly cluster users and/or items, methods that embed users and items in a lowdimensional representation (e.g., LSA, probabilistic matrix factorization), or combinations of the two [Krestel et al., 2009, Moshfeghi et al., 2011].
Increasingly, recommender systems have moved beyond explicit preference prediction to capture more nuanced aspects of user behavior, for instance, how they respond to specific recommendations, such as pCTR (predicted clickthrough rate), degree of engagement (e.g., dwell/watch/listen time), ratings, social behavior (e.g., comments, sharing) and other behaviors of interest. DNNs now play a significant role in such approaches [van den Oord et al., 2013, Wang et al., 2015, Covington et al., 2016, Cheng et al., 2016] and often use CFinspired embeddings of users and items as inputs to the DNN itself.
Sequence Models and RL in Recommender Systems
Attempts to formulate recommendation as a RL problem have been relatively uncommon, though it has attracted more attention recently. Early models included a MDP model for shopping recommendation [Shani et al., 2005] and Qlearning for page navigation [Taghipour et al., 2007], but were limited to very smallscale settings (100s of items, few thousands of users). More recently, biclustering has been combined with RL algorithms for recommendation [Choi et al., 2018], while Gauci et al. [2018] describe the use of RL in several applications at Facebook. Chen et al. [2018] also explored a novel offpolicy policygradient approach that is very scalable, and was shown to be effective in a largescale commercial recommender system. Their approach does not explicitly compute LTV improvements (as we do by developing Qvalue models), nor does it model the slate effects that arise is practical recommendations.
Zhao et al. [2018] explicitly consider RL in slatebased recommendation systems, developing an actorcritic approach for recommending a page of items and tested using simulator trained on user logs. While similar in motivation to our approach, this method differs in several important dimensions: it makes no significant structural assumptions about user choice, using a CNN to model the spatial layout of items on a page, thus not handling the actionspace combinatorics w.r.t. generalization, exploration, or optimization (but allowing additional flexibility in capturing user behavior). Finally, the focus of their method is online training and their evaluation with offline data is limited to item reranking.
Slate Recommendation and Choice Modeling
Accounting for slates of items in recommender systems is quite common [Deshpande and Karypis, 2004, Boutilier et al., 2003, Viappiani and Boutilier, 2010, Le and Lauw, 2017], and the extension introduces interesting modeling questions (e.g., involving metrics such as diversity [Wilhelm et al., 2018]) and computational issues due to the combinatorics of slates themselves. Swaminathan et al. [2017] explored offpolicy evaluation and optimization using inverse propensity scores in the context of slate interactions. Mehrotra et al. [2019] developed a hierarchical model for understanding user satisfaction in slate recommendation.
The construction of optimal recommendation slates generally depends on user choice behavior. Models of user choice from sets of items is studied under the banner of choice modeling in areas of econometrics, psychology, statistics, operations research and marketing and decision science [Luce, 1959, Louviere et al., 2000]
. Probably the most common model of user choice is the
multinomial logit (MNL)model and its extensions (e.g., the conditional logit model, the mixed logit model, etc.)—we refer to
Louviere et al. [2000] for an overview.For example, the conditional logit model assumes a set of useritem characteristics (e.g., feature vector)
for user and item , and determines the (random) utility of the item for the user. Typically, this model is linear so , though we consider the use of DNN regressors to estimate these logits below. The probability of the user selecting from a slate of items is(1) 
The choice model is justified under specific independence and extreme value assumptions [McFadden, 1974, Train, 2009]. Various forms of such models are used to model consumer choice and user behavior in wide ranging domains, together with specific methods for model estimation, experiment design and optimization. Such models form the basis of optimization procedures in revenue management [Talluri and van Ryzin, 2004, Rusmevichientong and Topaloglu, 2012], product line design [Chen and Hausman, 2000, Schön, 2010], assortment optimization [Martínezde Albéniz and Roels, 2011, Honhon et al., 2012] and a variety of other areas—we exploit connections with this work in Section 5 below.
The conditional logit model is an instance of a more general conditional choice format in which a user selects item with unnormalized probability , for some function :
(2) 
In the case of the conditional logit, , but any arbitrary can be used.
Within the ML community, including recommender systems and learningtorank, other choice models are used to explain user choice behavior. For example, cascade models [Joachims, 2002, Craswell et al., 2008, Kveton et al., 2015] have proven popular as a means of explaining user browsing behavior through (ordered) lists of recommendations, search results, etc., and is especially effective at capturing position bias. The standard cascade model assumes that a user has some affinity (e.g., perceived utility) for any item ; sequentially scans a list of items in order; and will select (e.g., click) an item with probability for some nondecreasing function . If an item is selected when inspected, no items following will be inspected/selected; and if the last item is inspected but not selected, then no selection is made. Thus the probability of being selected is:
(3) 
Various mechanisms for model estimation, optimization and exploration have been proposed for the basic cascade model and its variations. Recently, DNN and sequence models have been developed for explaining user choice behavior in a more general, nonparametric fashion [Ai et al., 2018, Bello et al., 2018]. As one example, Jiang et al. [2019]
proposed a slategeneration model using conditional variational autoencoders to model the distribution of slates conditioned on user response, but the scalability requires the use of a pretrained item embedding in large domains of the type we consider. However, the CVAE model does offer considerably flexibility in capturing item interactions, position bias, and other slate effects that might impact user response behavior.
RL with Combinatorial Action Spaces
Designing tractable RL approaches for combinatorial actions—of which slates recommendations are an example—is itself quite challenging. Some recent work in recommender systems considered slatebased recommendations (see, e.g., discussion of Zhao et al. [2018] above, though they do not directly address the combinatorics), though most is more general. Sequential DQN [Metz et al., 2017] decomposes dimensional actions into a sequence of atomic actions, inserting fictitious states between them so a standard RL method can plan a trajectory giving the optimal action configuration. While demonstrated to be useful in some circumstances, the approach trades off the exponential size of the action space with a corresponding exponential increase in the size of the state space (with fictitious states corresponding to possible sequences of subactions).
Sunehag et al. [2015] proposed Slate MDPs which considers slates of primitive actions, using DQN to learn the value of item slates, and a greedy procedure to construct slates. In fact, they develop three DQN methods for the problem, two of which manage the combinatorics of slates by assuming the primitive actions can be executed in isolation. In our setting, this amounts to the unrealistic assumption that we could “force” a user to consume a specific item (rather than present them with a slate, from which no item might be consumed). Their third approach, Generic Full Slate, makes no such assumption, but maintains an explicit function over slates of items. This means it fails to address the exploration and generalization problems, and while the greedy optimization (action selection) method used is tractable, it comes with no guarantees of optimality.
3 An MDP Model for Slate Recommendation
In this section, we develop a Markov decision process (MDP) model for content recommendation with slates. We consider a setting in which a recommender system is charged with presenting a slate to a user, from which the user selects zero or more items for consumption (e.g., listening to selected music tracks, reading content, watching video content). Once items are consumed, the user can return for additional slate recommendations or terminate the session. The user’s response to a consumed item may have multiple dimensions. These may include the immediate degree of engagement with the item (e.g., consumption time); ratings feedback or comments; sharing behavior; subsequent engagement with the content provider beyond the recommender system’s direct control. In this work, we use degree of engagement abstractly as the reward without loss of generality, since it can encompass a variety of metrics, or their combinations.
We focus on session optimization to make the discussion concrete, but our decomposition applies equally well to any longterm horizon.^{1}^{1}1Dealing with very extended horizons, such as lifetime value [Theocharous et al., 2015, Hallak et al., 2017], is often problematic for any RL method; but such issues are independent of the slate formulation and decomposition we propose. Session optimization with slates can be modeled as a MDP with states , actions , reward function and transition kernel , with discount factor .
The states typically reflect user state. This includes relatively static user features such as demographics, declared interests, and other user attributes, as well as more dynamic user features, such as user context (e.g., time of day). In particular, summaries of relevant user history and past behavior play a key role, such as past recommendations made to the user; past user responses, such as recommendations accepted or passed on, the specific items consumed, and degree of user engagement with those items. The summarization of history is often domain specific (see our discussion of methodology in Section 8) and can be viewed as a means of capturing certain aspects of user latent state in a partially observable MDP. The state may also reflect certain general (userindependent) environment variables. We develop our model assuming a finite state space for ease of exposition, though our experiments and our methodology admit both countably infinite and continuous state features.
The action space is simply the set of all possible recommendation slates. We assume a fixed catalog of recommendable items , so actions are the subsets such that , where is the slate size. We assume that each item and each slate is recommendable at each state for ease of exposition. However, our methods apply readily when certain items cannot be recommended at particular states by specifying for each and restricting to subsets of . If additional constraints are placed on slates so that is a strict subset of the slates defined over , these can be incorporated into the slate optimization problem at both training and serving time.^{2}^{2}2We briefly describe where relevant adjustments are needed in our algorithms when we present them. We also note that our methods work equally well when the feasible set of slates is stochastic (but stationary) as in [Boutilier et al., 2018]. We do not account for positional bias or ordering effects within the slate in this work, though such effects can be incorporated into the choice model (see below).
To account for the fact that a user may select no item from a slate, we assume that every slate includes a st null item, denoted . This is standard in most choice modeling work and makes it straightforward to specify all user behavior as if induced by a choice from the slate.
Transition probability reflects the probability that the user transitions to state when action is taken at user state . This generally reflects uncertainty in both user response and the future contextual or environmental state. One of the most critical points of uncertainty pertains the probability with which a user will consume a particular recommended item from the slate. As such, choice models play a critical role in evaluating the quality of a slate as we detail in the next section.
Finally, the reward captures the expected reward of a slate, which measures the expected degree of user engagement with items on the slate. Naturally, this expectation must account for the uncertainty in user response.
Our aim is to find optimal slate recommendation as a function of the state. A (stationary, deterministic) policy dictates the action to be taken (i.e., slate to recommend) at any state . The value function of a fixed policy is:
(4) 
The corresponding action value, or Qfunction, reflects the value of taking an action at state and then acting according to :
(5) 
The optimal policy maximizes expected value uniformly over , and its value—the optimal value function —is given by the fixed point of the Bellman equation:
(6) 
The optimal Qfunction is defined similarly:
(7) 
The optimal policy satisfies .
When transition and reward models are both provided, optimal policies and value functions can be computed using a variety of methods [Puterman, 1994], though generally these require approximation in large state/action problems [Bertsekas and Tsitsiklis, 1996]. With sampled data, RL methods such as TDlearning [Sutton, 1988], SARSA [Rummery and Niranjan, 1994, Sutton, 1996] and Qlearning [Watkins and Dayan, 1992] can be used (see [Sutton and Barto, 1998] for an overview). Assume training data of the form representing observed transitions and rewards generated by some policy . The Qfunction can be estimated using SARSA updates of the form:
(8) 
where represents the th iterative estimate of and is the learning rate. SARSA, Eq. (8), is onpolicy and estimates the value of the datagenerating policy . However, if the policy has sufficient exploration or other forms of stochasticity (as is common in large recommender systems), acting greedily w.r.t. , and using the data sogenerated to train a new function, will implement a policy improvement step [Sutton and Barto, 1998]. With repetition—i.e., if the updated is used to make recommendations (with some exploration), from which new training data is generated—the process will converge to the optimal function. Note that acting greedily w.r.t. requires the ability to compute optimal slates at serving time. In what follows, we use the term SARSA to refer to the (onpolicy) estimation of the Qfunction of a fixed policy , i.e., the TDprediction problem on stateaction pairs.^{3}^{3}3SARSA is often used to refer to the onpolicy control method that includes making policy improvement steps. We use it simply to refer to the TDmethod based on SARSA updates as in Eq. (8).
The optimal Qfunction can be estimated directly in a related fashion:
(9) 
where represents the th iterative estimate of . Qlearning, Eq. (9), is offpolicy and directly estimates the optimal Qfunction (again, assuming suitable randomness in the datagenerating policy ). Unlike SARSA, Qlearning requires that one compute optimal slates at training time, not just at serving time.
4 SlateQ: Slate Decomposition for RL
One key challenge in the formulation above is the combinatorial nature of the action space, consisting of all ordered sets over . This poses three key difficulties for RL methods. First, the sheer size of the action space makes sufficient exploration impractical. It will generally be impossible to execute all slates even once at any particular state, let alone satisfy the sample complexity requirements of TDmethods. Second, generalization
of Qvalues across slates is challenging without some compressed representation. While a slate could be represented as the collection of features of its constituent items, this imposes greater demands on sample complexity; we may further desire greater generalization capabilities. Third, we must solve the combinatorial optimization problem of finding a slate with maximum Qvalue—this is a fundamental part of Qlearning and a necessary component in any form of policy improvement. Without significant structural assumptions or approximations, such optimization cannot meet the realtime latency requirements of production recommender systems (often on the order of tens of milliseconds).
In this section, we develop SlateQ, a model that allows the Qvalue of a slate to be decomposed into a combination of the itemwise Qvalues of its constituent items. This decomposition exposes precisely the type of structure needed to allow effective exploration, generalization and optimization. We focus on the SlateQ decomposition in this section—the decomposition itself immediately resolves the exploration and generalization concerns. We defer discussion of the optimization question to Section 5.
Our approach depends to some extent on the nature of the user choice model, but critically on the interaction it has with subsequent user behavior, specifically, how it influences both expected engagement (i.e., reward) and user latent state (i.e., state transition probabilities). We require two assumptions to derive the SlateQ decomposition.

Single choice (SC): A user consumes a single item from each slate (which may be the null item ).

Reward/transition dependence on selection (RTDS): The realized reward (user engagement) depends (perhaps stochastically) only on the item consumed by the user (which may also be the null item ). Similarly, the state transition depends only on the consumed .
Assumption SC implies that the user selection of a subset from slate has only if . While potentially limiting in some settings, in our application (see Section 9), users can consume only one content item at a time. Returning to the slate for a second item is modeled and logged as a separate event, with the user making a selection in a new state that reflects engagement with the previously selected item. As such, SC is valid in our setting.^{4}^{4}4Domains in which the user can select multiple items without first engaging with them (i.e., without induced some change in state) would be more accurately modeled by allowing multiple selection. Our SlateQ model can be extended to incorporate a simple correction term to accurately model user selection of multiple items by assuming conditional independence of itemchoice probabilities given . Letting denote the reward when a user in state , presented with slate , selects item , and the corresponding probability of a transition to , the SC assumption allows us to express immediate rewards and state transitions as follows:
(10)  
(11) 
The RTDS assumption is also realistic in many recommender systems, especially with respect to immediate reward. It is typically the case that a user’s engagement with a selected item is not influenced to a great degree by the options in the slate that were not selected. The transition assumption also holds in recommender systems where direct user interaction with items drives user utility, overall satisfaction, new interests, etc., and hence is the primary determinant of the user’s underlying latent state. Of course, in some recommender domains, unconsumed items in the slate (say, impressions of content descriptions, thumbnails, clips, etc) may themselves create, say, future curiosity, which should be reflected by changes in the user’s latent state. But even in such cases RTDS may be treated as a reasonable simplifying assumption, especially where such impressions have significantly less impact on the user than consumed items themselves. The RTDS assumption can be stated as:
(12)  
(13) 
Our decomposition of (onpolicy) Qfunctions for a fixed datagenerating policy relies on an itemwise auxiliary function , which represents the LTV of a user consuming an item , i.e., the LTV of conditional on it being clicked. Under RTDS, this function is independent of the slate from which was selected. We define:
(14) 
Incorporating the SC assumption, we immediately have:
Proposition 1.
Proof.
This simple result gives a complete decomposition of slate Qvalues into Qvalues for individual items. Thus, the combinatorial challenges disappear if we can learn using TD methods. Notice also that the decomposition exploits the existence of a known choice function. But apart from knowing it (and using it our Qupdates that follow), we make no particular assumptions about the choice model apart from SC. We note that learning choice models from user selection data is generally quite routine. We discuss specific choice functions in the next section and how they can be exploited in optimization.
TDlearning of the function can be realized using a very simple Qupdate rule. Given a consumed item at with observed reward , a transition to , and selection of slate , we update as follows:
(19) 
The soundness of this update follows immediately from Eq. 14.
Our decomposed SlateQ update facilitates more compact Qvalue models, using items as action inputs rather than slates. This in turn allows for greater generalization and data efficiency. Critically, while SlateQ learns itemlevel Qvalues, it can be shown to converge to the correct slate Qvalues under standard assumptions:
Proposition 2.
Under standard assumptions on learning rate schedules and stateaction exploration [Sutton, 1988, Dayan, 1992, Sutton and Barto, 1998], and the assumptions on user choice probabilities, state transitions, and rewards stated in the text above, SlateQ—using update (19) and definition of slate value (18)—will converge to the true slate Qfunction and support greedy policy improvement of .
Proof.
(Brief sketch.) Standard proofs of convergence of TD(0), applied to the stateaction Qfunction apply directly, with the exception of the introduction of the direct expectation over user choices, i.e., , rather than the use of sampled choices.^{5}^{5}5We note that sampled choice could also be used in the full onpolicy setting, but is problematic for optimization/action maximization as we discuss below. However, it is straightforward to show that incorporating the explicit expectation does not impact the convergence of TD(0) (see, for example, the analysis of expected SARSA [Van Seijen et al., 2009]). There is some additional impact of user choice on exploration policies as well—if the choice model is such that some item has choice probability for any slates with in some state , we will not experience user selection of item at state under (for value prediction of this is not problematic, but it is for learning a ). Thus the exploration policy must account for the choice model, either by sampling all slates at each state (which is very inefficient), or by configuring exploratory slates that ensure each item is sampled sufficiently often. For most common choice models (see discussion below), every item has nonzero probability of selection, in which case, standard actionlevel exploration conditions apply. ∎
Notice that update (19) requires the use of a known choice model. Such choice models are quite commonly learned in MLbased recommender systems, as we discuss further below in Section 8
. The introduction of this expectation—rather than relying on sampled user choices—can be viewed as reducing variance in the estimates much like
expected SARSA, as discussed by Sutton and Barto [1998] and analyzed formally by Van Seijen et al. [2009]. Furthermore, it is straightforward to show that the standard SARSA(0) algorithm (with policy improvement steps) will converge to the optimal Qfunction, subject to the considerations mentioned above, using standard techniques [Singh et al., 2000, Van Seijen et al., 2009].The decomposition can be applied to Qlearning of the optimal Qfunction as well, requiring only a straightforward modification of Eq. (14) to obtain , the optimal (offpolicy) conditionalonclick itemwise Qfunction, specifically, replacing with (the proof is analogous to that of Proposition 1):
Proposition 3.
Likewise, extending the decomposed update Eq. (19) to full Qlearning requires only that we introduce the usual maximization:
(20) 
As above, it is not hard to show that Qlearning using this update will converge, using standard techniques [Watkins and Dayan, 1992, Van Seijen et al., 2009] and with similar considerations to those discussed in the proof sketch of Proposition 2:
Proposition 4.
Under standard assumptions on learning rate schedules and sufficient exploration [Sutton and Barto, 1998], and the assumptions on user choice probabilities, state transitions, and rewards stated in the text above, SlateQ—using update (20) and definition of slate value in Proposition 3—will converge to the optimal slate Qfunction .
The decomposition of both the policybased and optimal Qfunctions above accomplishes two of our three desiderata: it circumvents the natural combinatorics of both exploration and generalization. But we still face the combinatorics of action maximization: the LTV slate optimization problem is the combinatorial optimization problem of selecting the optimal slate from , the space of all possible (ordered) sets over . This is required during training with Qlearning (Eq. (9)) and when engaging in policy improvement using SARSA. One also needs to solve the slate optimization problem at serving time when executing the induced greedy policy (i.e., presenting slates with maximal LTV to users given a learned Qfunction). In the next section, we show that exact optimization is tractable and also develop several heuristic approaches to tackling this problem.
5 Slate Optimization with Qvalues
We address the LTV slate optimization in this section. We develop an exact linear programming formulation of the problem in Section 5.1 using (a generalization of) the conditional logit model. In Section 5.2, we describe two computationally simpler heuristics for the problem, the top and greedy algorithms.
5.1 Exact Optimization
We formulate the LTV slate optimization problem as follows:
(21) 
Intuitively, a user makes her choice from the slate based on the perceived properties (e.g., attractiveness, quality, topic, utility) of the constituent items. In the LTV slate optimization problem, we value the selection of an item from the slate based on its LTV or value, rather than its immediate appeal to the user. As discussed above, we assume access to the choice model , since models (e.g., pCTR models) predicting user selection from a slate are commonly used in myopic recommenders. Of course, the computational solution of the slate optimization problem depends on the form of the choice model. We discuss the use of the conditional logit model (CLM) in SlateQ (and the more general format, Eq. (2)) in this subsection.
When using the conditional logit model (see Eq. 1), the LTV slate optimization problem is analogous in a formal sense to, assortment optimization or product line design [Chen and Hausman, 2000, Schön, 2010, Rusmevichientong and Topaloglu, 2012], in which a retailer designs or stocks a set of products whose expected revenue or profit is maximized assuming that consumers select products based on their appeal (and not their value to the retailer).^{6}^{6}6Naturally, there are more complex variants of assortment optimization, including the choice of price, inclusion of fixed production or inventory costs, etc. There are other conceptual differences with our model as well. While not a formal requirement, LTV of an item in our setting reflects user engagement, hence reflects some form of user satisfaction as opposed to direct value to the recommender. In addition, many assortment optimization models are designed for consumer populations, hence choice probabilities are often reflective of diversity in the population (though random selection by individual consumers is sometimes considered as well; by contrast, in the recommender setting, choice probabilities are usually dependent on the features of individual users and typically reflect the recommender’s uncertainty about a user’s immediate intent or context.
Our optimization formulation is suited to any general conditional choice model of the form Eq. (2) (of which the conditional logit is an instance).^{7}^{7}7We note that since the ordering of items within a slate does not impact choice probabilities in this model, the action (or slate) space consists of the unordered sets in this case. We assume a user in state selects item with unnormalized probability , for some function . In the case of the conditional logit, . We can express the optimization Eq. (21) w.r.t. such a
as a fractional mixedinteger program (MIP), with binary variables
for each item indicating whether occurs in slate :max  (22)  
s.t.  (23) 
This is a variant of a classic productline (or assortment) optimization problem [Chen and Hausman, 2000, Schön, 2010]. Our problem is somewhat simpler since there are no fixed resource costs or peritem costs.
Chen and Hausman [2000] show that that the binary indicators in this MIP can be relaxed to obtain the following fractional linear program (LP):
max  (24)  
s.t.  (25) 
The constraint matrix in this relaxed problem is totally unimodular, so the optimal (vertex) solution is integral and standard nonlinear optimization methods can be used. However, since it is a fractional LP, it is directly amenable to the CharnesCooper 1962 transformation and can be recast directly as a (nonfractional) LP. To do so, we introduce an additional variable that implicitly represents the inverse choice weight of the selected items , and auxiliary variables that represent the products , giving the following LP:
max  (26)  
s.t.  (27)  
(28) 
The optimal solution to this LP yields the optimal assignment in the fractional LP Eq. (24) via , which in turn gives the optimal slate in the original fractional MIP Eq. (22)—just add any item to the slate where . This formulation applies equally well to the MNL model, or related random utility models. The slate optimization problem is now immediately proven to be polynomialtime solvable.
Observation 5.
The LTV slate optimization problem Eq. 21, under the general conditional choice model (including the conditional logit model), is solvable in polynomialtime in the number of items (assuming a fixed slate size ).
Thus full Qlearning with slates using the SlateQ decomposition imposes at most a small polynomialtime overhead relative to itemwise Qlearning despite its combinatorial nature. We also note that many production recommender systems limit the set of items to be ranked using a separate retrieval policy, so the set of items to consider in the LP is usually much smaller than the complete item set. We discuss this further below in Section 8.
5.2 Top and Greedy Optimization
While the exact maximization of slates under the conditional choice model can be accomplished in polynomialtime using and the itemscore function , we may wish to avoid solving an LP at serving time. A natural heuristic for constructing a slate is to simply add the items with the highest score. In our top optimization procedure, we insert items into the slate in decreasing order of the product .^{8}^{8}8Top slate construction is quite common in slatebased myopic recommenders. It has recently been used in LTV optimization as well [Chen et al., 2018]. This incurs only a overhead relative to the time required for maximization for itemwise Qlearning.
One problem with top optimization is the fact that, when considering the item to add to the th slot (for ), item scores are not updated to reflect the previous items already added to the slate. Greedy optimization, instead of scoring each item ab initio, updates item scores with respect to the current partial slate. Specifically, given of size , the th item added is that with maximum marginal contribution:
We compare top and greedy optimizations with the LP solution in our offline simulation experiments below.
Under the general conditional choice model (including for the conditional logit model), neither top nor greedy optimization will find the optimal solution as following counterexample illustrates:
Item  Score ()  Qvalue 

1  0  
2  0.8  
1  1 
The null item is always on the slate. Items are identical w.r.t. their behavior. We have , greater than . Both top and greedy will place on the slate first. However, , whereas the optimal slate is valued at . So for slate size , neither top nor greedy find the optimal slate.
While one might hope that the greedy algorithm provides some approximation guarantee, the set function is not submodular, which prevents standard analyses (e.g., [Nemhauser et al., 1978, Feige, 1998, Buchbinder et al., 2014]) from being used. The following example illustrates this.
Item  Score ()  Qvalue 
1  10  
1  10  
2 
We have expected values of the following item sets: . The fact that demonstrates lack of submodularity (the set function is also not monotone).^{9}^{9}9It is worth observing that without our exact cardinality constraint (sets must have size ), the optimal set under MNL can be computed in a greedy fashion [Talluri and van Ryzin, 2004] (the analysis also applies to the conditional logit model).
While we have no current performance guarantees for greedy and top, it’s not hard to show that top can perform arbitrarily poorly.
Observation 6.
The approximation ratio of top optimization for slate construction is unbounded.
The following example demonstrates this.
Item  Score ()  Qvalue 
0  
1  
1 
Suppose we have . Top scores item higher than , creating the slate with value , while the optimal slate has value .
5.3 Algorithm Variants
With a variety of slate optimization methods at our disposal, many variations of RL algorithms exist depending on the optimization method used during training and serving. Given a trained SlateQ model, we can apply that model to serve users using either top, greedy or the LPbased optimal method to generate recommended slates. Below we use the designations TS, GS, or OS to denote these serving protocols, respectively. These designations apply equally to (offpolicy) Qlearned models, (onpolicy) SARSA models, and even (nonRL) myopic models.^{10}^{10}10A myopic model is equivalent to a Qlearned model with .
During Qlearning, slate optimization is also required at training time to compute the maximum successor state Qvalue (Eq. 20). This can also use either of the three optimization methods, which we designate by TT, GT, and OT for top, greedy and optimal LP training, respectively. This designation is not applicable when training a myopic model or SARSA (since SARSA is trained only onpolicy). This gives us the following collection of algorithms. For Qlearning, we have:
Serving  

Top  Greedy  LP (Opt)  
Training  Top  QLTTTS  QLTTGS  QLTTOS 
Greedy  QLGTTS  QLGTGS  QLGTOS  
LP (Opt)  QLOTTS  QLOTGS  QLOTOS 
For SARSA and Myopic recommenders, we have:
Serving  SARSA  Myopic 

Top  SARSATS  MYOPTS 
Greedy  SARSAGS  MYOPGS 
LP (Opt)  SARSAOS  MYOPOS 
In our experiments below we also consider two other baselines: Random, which recommends random slates from the feasible set; and fullslate Qlearning (FSQ), which is a standard, nondecomposed Qlearning method that treats each slate atomically (i.e., holistically) as a single action. The latter is a useful baseline to test whether the SlateQ decomposition provides leverage for generalization and exploration.
5.4 Approaches for Other Choice Models
The SlateQ decomposition works with any choice model that satisfies the assumptions SC and RTDS, though the form of the slate optimization problem depends crucially on the choice model. To illustrate, we consider the cascade choice model outlined in Section 2 (see, e.g., Eq. (3)). Notice that the cascade model, unlike the general conditional choice model, has positiondependent effects (i.e., reordering of items in the slate changes selection probabilities and the expected LTV of the slate). However, it is not hard to show that the cascade model exhibits a form of “ordered submodularity” if we assume that the LTV or conditional Qvalue of not selecting from the slate is no greater that the Qvalue of selecting any item on the slate, i.e., if for all .^{11}^{11}11The statements to follow hold under the weaker condition that, for all states , there are at least items such that (where is the slate size). Specificially, the value of the marginal increase in value induced by adding item to the (ordered) partial slate is no greater than the increase in value of adding to a prefix of that slate for any . Thus top optimization can be used to support training and serving of the SlateQ approach under the cascade model.^{12}^{12}12It is also not hard to show that top is not optimal for the cascade model.
While the general conditional choice model is orderindependent, in practice, it may be the case that users incorporate some elements of a cascadelike model into the conditional choice model. For example, users may devote a random amount of time or effort to inspect a slate of recommended items, compare the top items, where is some function of the available time, and select (perhaps noisily) the most preferred item from among those inspected. This model would be a reasonable approximation of user behavior in the case of recommenders that involve scrolling interfaces for example. In such a case, we end up with a distribution over slate sizes. A natural heuristic for the conditional choice model would be, once the slate is selected, to order the items on slate based their top
or greedy scores to increase the odds that the random slate actually observed by the user contains items that induce highest expected longterm engagement.
6 User Simulation Environment
We discuss experiments with the various SlateQ algorithms in Section 7, using a simulation environment that, though simplified and stylized, captures several essential elements of a typical recommender system that drive a need for the long/shortterm tradeoffs captured by RL methods. In this section, we describe the simulation environment and models used to test SlateQ in detail. We describe the environment setup in a fairly general way, as well as the specific instantiations used in our experiments, since the simulation environment may be of broader interest.
6.1 Document and Topic Model
We assume a set of documents representing the content available for recommendation. We also assume a set of topics (or user interests) that capture fundamental characteristics of interest to users; we assume topics are indexed . Each document has an associated topic vector , where is the degree to which reflects topic .
In our experiments, for simplicity, each document has only a single topic , so for some
(i.e., we have a onehot encoding of the document topic). Documents are drawn from content distribution
over topic vectors, which in our onehot topic experiments is simply a distribution over individual topics.Each document also has a length (e.g., length of a video, music track or news article). This is sometime used as one factor in assessing potential user engagement. While the model supports documents of different lengths, in our experiments, we assume each document has the same constant length .
Documents also have an inherent quality , representing the topicindependent attractiveness to the average user. Quality varies randomly across documents, with document ’s quality distributed according to , where is a topicspecific mean quality for any . For simplicity, we assume a fixed variance across all topics. In general, quality can be estimated over time from user responses as we discuss below; but in our experiments, we assume is observable to the recommender system (but not to the user a priori, see below). Quality may also be userdependent, though we do not consider that here, since the focus of our stylized experiments is on the ability of our RL methods to learn average quality at the topic level. Both the topic and quality of a consumed document impact longterm user behavior (see Section 6.4 below).
In our experiments, we use topics, while the precise number of documents is immaterial as we will see. Of these, topics are low quality, with their mean quality evenly distributed across the interval . The remaining topics are high quality, with their mean quality evenly distributed across the interval .
6.2 User Interest and Satisfaction Models
Users have various degrees of interests in topics, ranging from (completely uninterested) to (fully interested), with each user associated with an interest vector . User ’s interest in document is given by the dot product . We assume some prior distribution over user interest vectors, but user ’s interest vector is dynamic, i.e., influenced by their document consumption (see below). To focus on how our RL methods learn to influence user interests and the quality of documents consumed, we treat a user’s interest vector as fully observable to the recommender system. In general, user interests are latent, and a partially observable/belief state model is more appropriate.
A user’s satisfaction with a consumed document is a function of user ’s interest and document ’s quality. While the form of may be quite complex in general, we assume a simple convex combination . Satisfaction influences user dynamics as we discuss below.
In our experiments, a new user ’s prior interest is sampled uniformly from ; specifically, there is no prior correlation across topics. We use an extreme value of so that a user’s satisfaction with a consumed document is fully dictated by document quality. This leaves user interest only to drive the selection of the document from the slate which we describe next.
6.3 User Choice Model
When presented with a slate of documents, a user choice model impacts which document (if any) from the slate is consumed by the user. We assume that a user can observe any recommended document’s topic prior to selection, but cannot observe its quality before consumption. However, the user will observe the true document quality after consuming it. While somewhat stylized, this treatment of topic and quality observability (from the user’s) perspective is reasonably wellaligned with the situation in many recommendation domains.
The general simulation environment allows arbitrary choice functions to be defined as a function of user’s state (interest vector, satisfaction) and the features of the document (topic vector, quality) in the slate. In our experiments, we use the general conditional choice model (Eq. (2)) as the main model for our RL methods. User ’s interest in document , , defines the document’s relative appeal to the user and serves as the basis of the choice function. For slates of size , the null document is always added as a st element, which (for simplicity) has a fixed utility across all users.
We also use a second choice model in our experiments, an exponential cascade model [Joachims, 2002], that accounts for document position on a slate. This choice model assumes “attention” is given to one document at a time, with exponentially decreasing attention given to documents as a user moves down the slate. The probability that the document in position is inspected is , where is a base inspection probability and is the inspection decay. If a document is given attention, then it is selected with a base choice probability ; if the document in position is not examined or selected/consumed, then the user proceeds to the st document. The probability that the document in position is consumed is:
While we don’t optimize for this model, we do run experiments in which the recommender learns a policy that assumed the general conditional choice model, but users behave according to the cascade model. In this case, the base choice probability for a document in the cascade model is set to be its normalized probability in the conditional choice model. While the cascade model allows for the possibility of no click, even without the fictitious null document , we keep the null document to allow the probabilities to remain calibrated relative to the conditional model. In our experiments, we use and .
6.4 User Dynamics
To allow for a nonmyopic recommendation algorithm—in our case, RL methods—to impact overall user engagement positively, we adopt a simple, but natural model of session termination. We assume each user has an initial budget of time to engage with content during an extended session. This budget is not observable to the recommender system, and is randomly realized at session initiation using some prior .^{13}^{13}13Naturally, other models that do not use terminating sessions are possible, and could emphasize amount of engagement per period. Each document consumed reduces user ’s budget by the fixed document length . But after consumption, the quality of the document (partially) replenishes the used budget where the budget decreases by the fixed document length less a bonus that increases with the document’s appeal . In effect, more satisfying documents decrease the time remaining in a session at a lower rate. In particular, for any fixed topic, documents with higher quality have a higher positive impact on cumulative engagement (reduce budget less quickly) than lower quality documents. A session ends once reaches . Since sessions terminate with probability 1, discounting is unnecessary.
In our experiments, each user’s initial budget is units of time; each consumed document uses units; and if a slate is recommended, but no document is clicked, units are consumed. We set bonus .
The second aspect of user dynamics allows user interests to evolve as a function of the documents consumed. When user consumes document , her interest in topic is nudged stochastically, biased slightly towards increasing her interest, but allows some chance of decreasing her interest. Thus, a recommender faces a shortterm/longterm tradeoff between nudging a user’s interests toward topics that tend to have higher quality at the expense of shortterm consumption of user budget.
We use the following stylized model to set the magnitude of the adjustment—how much the interest in topic changes—and its polarity—whether the user’s interest in topic increases or decreases. Let be the topic of the consumed document and be user ’s interest in topic prior to consumption of document . The (absolute) change in user ’s interest is , where denotes the fraction of the distance between the current interest level and the maximum level that the update move user ’s interest. This ensures that more entrenched interests change less than neutral interests.
In our experiments we set . A positive change in interest, , occurs with probability , and a negative change, , with probability . Thus positive (resp., negative) interests are more likely to be reinforced, i.e., become more positive (resp., negative), with the odds of such reinforcement increasing with the degree of entrenchment.
6.5 Recommender System Dynamics
At each stage of interaction with a user, candidate documents are drawn from , from which a slate of size must be selected for recommendation. This reflects the common situation in many largescale commercial recommenders in which a variety of mechanisms are used to subselect a small set of candidates from a massive corpus, which are in turn scored using more refined (and computationally expensive) predictive models of user engagement.
In our simulation experiments, we use and . This small set of candidate documents and the small slate size is used to allow explicitly enumeration of all slates, which allows us to compare SlateQ to RL methods like Qlearning that do not decompose the Qfunction. In our live experiments with the YouTube platform (see Section 9), slates are of variable size and the number of candidates is on the order of .
7 Empirical Evaluation: Simulations
We now describe several sets of results designed to assess the impact of the SlateQ decomposition. Our simulation environment is implemented in a general fashion, supporting many of the general models and behaviors described in the previous sections. Our RL algorithms, both those using SlateQ and FSQ, are implemented using Dopamine [Castro et al., 2018]
. We use a standard twotower architecture with stacked fully connected layers to represent user state and document. Updates to the Qmodels are done online by batching experiences from user simulations. Each trainingserving strategy is evaluated over 5000 simulated users for statistical significance. All results are within a 95% confidence interval.
7.1 Myopic vs. Nonmyopic Recommendations
We first test the quality of (nonmyopic) LTV policies learned using SlateQ to optimize engagement (, using a selection of the SlateQ algorithms (SARSA vs. Qlearning, different slate optimizations for training/serving). We compare these to myopic scoring (MYOP) (), which optimizes only for immediate reward, as well as a Random policy. The goal of these comparisons is to identify whether optimizing for longterm engagement using RL (either Qlearning or 1step policy improvement via SARSA) provides benefit over myopic recommendations.
The following table compares several key metrics of the final trained algorithms (all methods use 300K training steps):
Strategy  Avg. Return (%)  Avg. Quality (%) 

Random  159.2  0.5929 
MYOPTS  166.3 (4.46%)  0.5428 (8.45%) 
MYOPGS  166.3 (4.46%)  0.5475 (7.66%) 
SARSATS  168.4 (5.78%)  0.4908 (17.22%) 
SARSAGS  172.1 (8.10%)  0.3876 (34.63%) 
QLTTTS  168.4 (5.78%)  0.4931 (16.83%) 
QLGTGS  172.9 (8.61%)  0.3772 (36.38%) 
QLOTTS  169.0 (6.16%)  0.4905 (17.27%) 
QLOTGS  173.8 (9.17%)  0.3408 (42.52%) 
QLOTOS  174.6 (9.67%)  0.3056 (48.46%) 
The LTV methods (SARSA and Qlearning) using SlateQ offer overall improvements in average return per user session. The magnitude of these improvements only tells part of the story: we also show percentage improvements relative to Random are shown in parentheses—Random gives a sense of the baseline level of cumulative reward that can be achieved without any user modeling at all. For instance, relative to the random baseline, QLOTGS provides a provides a 105.6% greater improvement than MYOP. The LTV methods all learn to recommend documents of much higher quality than MYOP, which has a positive impact on overall session length, which explains the improved return per user.
We also see that LPbased slate optimization during training (OT) provides improvements over top and greedy optimization (TT, GT) in Qlearning when comparing similar serving regimes (e.g., QLOTGS vs. QLGTGS , and QLOTTS vs. QLTTTS). Optimal serving (OS) also shows consistent improvement over top and greedy serving—and greedy serving (GS) improves significantly over top serving (TS)—when compared under the same training regime. However, the combination of optimal training and top or greedy serving performs well, and is especially useful when serving latency constraints are tight, since optimal training is generally done offline.
Finally, optimizing using Qlearning gives better results than onpolicy SARSA (i.e., onestep improvement) under comparable training and serving regimes. But SARSA itself has significantly higher returns than MYOP, demonstrating the value of onpolicy RL for recommender systems. Indeed, repeatedly servingthentraining (with some exploration) using SARSA would implement a natural, continual policy improvement. These results demonstrate, in this simple synthetic recommender system environment, that using RL to plan longterm interactions can provide significant value in terms of overall engagement.
7.2 SlateQ vs. Holistic Optimization
Next we compare the quality of policies learned using the SlateQ decomposition to FSQ, the nondecomposed Qlearning method that treats each slate atomically as a single action. We set , , and so that we can enumerate all slates for FSQ maximization. Note that the Qfunction for FSQ requires representation of all slates as actions, which can impede both exploration and generalization. For SlateQ we test only SARSATS (since this is the method tested in our live experiment below). The following table shows our results:
Avg. Return (%)  Avg. Quality (%)  

Random  160.6  0.6097 
FSQ  164.2 (2.24%)  0.5072 (16.81%) 
SARSATS  170.7 (6.29%)  0.5340 (12.41%) 
While FSQ, which is an offpolicy Qlearning method, is guaranteed to converge to the optimal slate policy in theory with sufficient exploration, we see that, even using an onpolicy method like SARSA with a single step of policy improvement, SlateQ methods perform significantly better than FSQ, offering a 180% greater improvement over Random than FSQ. This is the case despite SlateQ using no additional trainingserving iterations to continue policy improvement. This is due to the fact that FSQ must learn Qvalues for 1140 distinct slates, making it difficult to explore and generalize. FSQ also takes roughly 6X the training time of SlateQ over the same number of events. These results demonstrate the considerable value of the SlateQ decomposition.
Improved representations could help FSQ generalize somewhat better, but the approach is inherently unscalable, while SlateQ suffers from no such limitations (see live experiment below). Interestingly, FSQ does converge quickly to a policy that offers recommendations of greater average quality than SlateQ, but fails to make an appropriate tradeoff with user interest.
7.3 Robustness to User Choice
Finally, we test the robustness of SlateQ to changes in the underlying user choice model. Instead of the assumed choice model defined above, users select items from the recommended slate using a simple (exponential) cascade model, where items on the slate are inspected from toptobottom with a positionspecific probability, and consumed with probability proportional to if inspected. If not consumed, the next item is inspected. Though users act in this fashion, SlateQ is trained using the original conditional choice model and the same decomposition is also used to optimize slates at serving time.
The following table shows results:
Strategy  Avg. Return (%)  Avg. Quality (%) 

Random  159.9  0.5976 
MYOPTS  163.6 (2.31%)  0.5100 (14.66%) 
SARSATS  166.8 (4.32%)  0.4171 (30.20%) 
QLTTTS  166.5 (4.13%)  0.4227 (29.27%) 
QLOTTS  167.5 (4.75%)  0.3985 (33.32%) 
QLOTOS  167.6 (4.82%)  0.3903 (34.69%) 
SlateQ continues to outperform MYOP, even when the choice model does not accurately reflect the true environment, demonstrating its relative robustness. SlateQ can be used with other choice models. For example, SlateQ can be trained by assuming the cascade model, with only the optimization formulation requiring adaptation (see our discussion in Section 5.4). But since any choice model will generally be an approximation of true user behavior, this form of robustness is critical.
Notice that QLTT and SARSA have inverted relative performance compared to the experiments above. This is due to the fact that Qlearning exploits the (incorrect) choice model to optimize during training, while SARSA, being onpolicy, only uses the choice model to compute expectations at serving time. This suggests that an onpolicy control method like SARSA (with continual policy improvement) may be more robust than Qlearning in some settings.
8 A Practical Methodology
The deployment of a recommender system using RL or TD methods to optimize for longterm user engagement presents a number of challenges in practice. In this section, we identify several of these and suggest practical techniques to resolve them, including ways in which to exploit an existing myopic, itemlevel recommender to facilitate the deployment of a nonmyopic system.
Many (myopic) itemlevel recommender systems [Liu et al., 2009, Covington et al., 2016] have the following components:

[label=()]

Logging of impressions and user feedback;

Training of some regression model (e.g., DNN) to predict user responses for useritem pairs, which are then aggregated by some scoring function;

Serving of recommendations, ranking items by score (e.g., returning the top items for recommendation).
Such a system can be exploited to quickly develop a nonmyopic recommender system based on Qvalues, representing some measure of longterm engagement, by addressing several key challenges.
8.1 State Space Construction
A critical part of any RL modeling is the design of the state space, that is, the development of a set of features that adequately capture a user’s past history to allow prediction of longterm value (e.g., engagement) in response to a recommendation. For the underlying process to be a MDP, the feature set should be (at least approximately) predictive of immediate user response (e.g., immediate engagement, hence reward) and selfpredictive (i.e., summarizes user history in a way that renders the implied dynamics Markovian).
The features of an extant myopic recommender system typically satisfy both of these requirements, meaning that an RL or TD model can be built using the same logged data (organized into trajectories) and the same featurization. The engineering, experimentation and experience that goes into developing stateoftheart recommender systems means that they generally capture (almost) all aspects of history required to predict immediate user responses (e.g., pCTR, listening time, other engagement metrics); i.e., they form a sufficient statistic. In addition, the core input features (e.g., static user properties, summary statistics of past behavior and responses) are often selfpredictive (i.e., no further history could significantly improve next state prediction). This fact can often be verified by inspection and semantic interpretation of the (input) features. Thus, using the existing state definition provides a natural, practical way to construct TD or RL models. We provide experimental evidence below to support this assertion in Section 9.
8.2 Generalization across Users
In the MDP model of a recommender system, each user should be viewed as a separate environment or separate MDP. However, it is critical to allow for generalization across users, since few if any users generate enough experience to allow reasonable recommendations otherwise. Of course, such generalization is a hallmark of almost any recommender system. In our case, we must generalize the (implicit) MDP dynamics across users. The state representation afforded by an extant myopic recommender system is already intended to do just this, so by learning a Qfunction that depends on the same user features and the myopic system, we obtain the same form of generalization.
8.3 User Response Modeling
As noted in Sections 4 and 5, SlateQ takes advantage of some measure of immediate item appeal or utility (conditioned on a specific user or state) to determine user choice behavior. In practice, since myopic recommender systems often predict these immediate responses, for example, using pCTR models, we can use these models directly to assess the immediate appeal required by our SlateQ choice model. For instance, we can use a myopic model’s pCTR predictions directly as a (unnormalized) choice probabilities for items in a slate, or we can use the logits of such a model in the conditional logit choice model. Furthermore, by using the same state features (see above), it is straightforward to build a multitask model [Zhang and Yang, 2017] that incorporates our longterm engagement prediction with other user response predictions.
8.4 Logging, Training and Serving Infrastructure
The training of longterm values requires logging of user data, and live serving of recommendations based on these LTV scores. The model architecture we detail below exploits the same logging, (supervised) training and serving infrastructure as used by the myopic recommender system.
Fig. 1 illustrates the structure of our LTVbased recommender system—here we focus on SARSA rather than Qlearning, since our longterm experiment in Section 9 uses SARSA. In myopic recommender systems, the regression model predicts immediate user response (e.g., clicks, engagement), while in our nonmyopic recommender system, label generation provides LTV labels, allowing the regressor to model .
Models are trained periodically and pushed to the server. The ranker uses the latest model to recommend items and logs user feedback, which is used to train new models. Using LTV labels, iterative model training and pushing can be viewed as a form of generalized policy iteration [Sutton and Barto, 1998]. Each trained DNN represents the value of the policy that generated the prior batch of training data, thus training is effectively policy evaluation. The ranker acts greedily with respect to this value function, thus performing policy improvement.
LTV label generation is similar to DQN training [Mnih et al., 2015]. A main network learns the LTV of individual items, —this network is easily extended from the existing myopic DNN. For stability, bootstrapped LTV labels (Qvalues) are generated using a separate label network. We periodically copy the weights of the main network to the label network and use the (fixed) label network to compute LTV labels between copies. LTV labels are generated using Eq. (19).
9 Empirical Evaluation: Live Experiments
We tested the SlateQ decomposition—specifically, the SARSATS algorithm, on YouTube (https://www.youtube.com/), a largescale video recommender with users and items in its corpus. The system is typical of many practical recommender systems with two main components. A candidate generator retrieves a small subset (hundreds) of items from a large corpus that best match a user context. The ranker scores/ranks candidates using a DNN with both user context and item features as input. It optimizes a combination of several objectives (e.g., clicks, expected engagement, several other factors).
The extant recommender system’s policy is myopic, scoring items for the slate using their immediate (predicted) expected engagement. In our experiments, we replace the myopic engagement measure with an LTV estimate in the ranker scoring function. We retain other predictions and incorporate them into candidate scoring as in the myopic model. Our nonmyopic recommender system maximizes cumulative expected engagement, with user trajectories capped at days. Since homepage visits can be spaced arbitrarily in time, we use timebased rather than eventbased discounting to handle credit assignment across large time gaps. If consecutive visits occur at times and , respectively, the relative discount of the reward at is , where is a parameter that controls the time scale for discounting.
Our model extends the myopic ranker using the practical methodology outlined in Section 8. Specifically, we learn a multitask feedforward deep network [Zhang and Yang, 2017], which learns , the predicted longterm engagement of item (conditional on being clicked) in state , as well as the immediate appeal
for pCTR/user choice computation (several other response predictions are learned, which are identical to those used by the myopic model). The multitask feedforward DNN network has 4 hidden layers of sizes 2048, 1024, 512, 256; and used ReLU activation functions on each of the hidden layers. Apart from the LTV/Qvalue head, other heads include pCTR, and other user responses. To validate our methodology, the DNN structure and all input features are identical to the production model which optimizes for shortterm (myopic) immediate reward. The state is defined by user features (e.g., user’s past history, behavior and responses, plus static user attributes). This also makes the comparison with the baseline fair.
The full training algorithm used in our live experiment is shown in Algorithm 1
. The model is trained using TensorFlow in a distributed training setup
[Abadi et al., 2015]using stochastic gradient descent. We train onpolicy over pairs of consecutive start page visits, with LTV labels computed using Eq.
19, and use top optimization for serving—i.e., we test SARSATS. The existing myopic recommender system (baseline) also builds slates greedily—i.e., MYOPTS.We note that at serving time, we don’t just choose the slate using the top method, we also order the slate presented to the user according to the item scores for each item (at state ). The reason for this is twofold. First, we expect that the user experience is positively impacted by placing more appealing items, that are likely to induce longerterm engagement, earlier in the slate. Second, the scrolling nature of the interface means that the slate size is not fixed at serving time—the number of inspected items varies per userevent (see discussion in Section 5.4).
We experimented with live traffic for three weeks, treating a small, but statistically significant, fraction of users to recommendations generated by our SARSATS LTV model. The control is a highlyoptimized production machine learning model that optimizes for immediate engagement (MYOPTS). Fig. 2 shows the percentage increase in aggregate user engagement using LTV over the course of the experiment relative to the control, and indicates that our model outperformed the baseline on the key metric under consideration, consistently and significantly. Specifically, users presented recommendations by our model had sessions with greater engagement time relative to baseline.
Fig. 3 shows the change in distribution of cumulative engagement originating from items at different positions in the slate. Recall that the number of items viewed in any userevent varies, i.e., experienced slates are of variable size and we show the first ten positions in the figure. The results show that the users under treatment have more engaging sessions (larger LTVs) from items ranked higher in the slate compared to users in the control group, which suggests that top slate optimization performs reasonably in this domain.^{14}^{14}14The apparent increase in expected engagement at position 10 is a statistical artifact due to the small number of events at that position: the number of observed events at each position decreases roughly exponentially, and position 10 has roughly two orders of magnitude fewer observed events than any of the first three positions.
10 Conclusion
In this work, we addressed the problem of optimizing longterm user engagement in slatebased recommender systems using reinforcement learning. Two key impediments to the use of RL in largescale, practical recommenders are (a) handling the combinatorics of slatebased action spaces; and (b) constructing the underlying representations.
To handle the first, we developed SlateQ, a novel decomposition technique for slatebased RL that allows for effective TD and Qlearning using LTV estimates for individual items. It requires relatively innocuous assumptions about user choice behavior and system dynamics, appropriate for many recommender settings. The decomposition allows for effective TD and Qlearning by reducing the complexity of generalization and exploration to that of learning for individual items—a problem routinely addressed by practical myopic recommenders. Moreover, for certain important classes of choice models, including the conditional logit, the slate optimization problem can be solved tractably using optimal LPbased and heuristic greedy and top methods. Our results show that SlateQ is relatively robust in simulation, and can scale to largescale commercial recommender systems like YouTube.
Our second contribution was a practical methodology for the introduction of RL to extant, myopic recommenders. We proposed the use of existing myopic models to bootstrap the development of Qfunctionbased RL methods, in a way that allows the substantial reuse of current training and serving infrastructure. Our live experiment in YouTube recommendation exemplified the utility of this methodology and the scalability of SlateQ. It also demonstrated that using LTV estimation can improve user engagement significantly in practice.
There are a variety of future research directions that can extend the work here. First, our methodology can be extended by relaxing some of the assumptions we made regarding the interaction between user choice and system dynamics. For instance, we are interested in models that allow unconsumed items on the slate to influence user latent state and choice models that allow for multiple items on a slate to be used/clicked. Further analysis of, and the development of corresponding optimization procedures for, additional choice models using SlateQ remains of intense interest (e.g., hierarchical model such as nest logit). In a related vein, methods for simultaneous learning of choice models, or their parameters, while learning Qvalues would be of great practical value. Finally, the simulation environment has the potential to serve as a platform for additional research on the application of RL to recommender systems. We hope to release a version of it to the research community in the near future.
Acknowledgments.
Thanks to Larry Lansing for system optimization and the IJCAI2019 reviewers for helpful feedback.
References
 Abadi et al. [2015] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Largescale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
 Ai et al. [2018] Qingyao Ai, Keping Bi, Jiafeng Guo, and W. Bruce Croft. Learning a deep listwise context model for ranking refinement. In Proceedings of the 41st Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR18), pages 135–144, 2018.
 Bello et al. [2018] Irwan Bello, Sayali Kulkarni, Sagar Jain, Craig Boutilier, Ed Chi, Elad Eban, Xiyang Luo, Alan Mackey, and Ofer Meshi. Seq2slate: Reranking and slate optimization with rnns. arXiv:1810.02019 [cs.IR], 2018.
 Bertsekas and Tsitsiklis [1996] Dimitri P. Bertsekas and John. N. Tsitsiklis. Neurodynamic Programming. Athena, Belmont, MA, 1996.

Boutilier et al. [2003]
Craig Boutilier, Richard S. Zemel, and Benjamin Marlin.
Active collaborative filtering.
In
Proceedings of the Nineteenth Conference on Uncertainty in Artificial Intelligence (UAI03)
, pages 98–106, Acapulco, 2003.  Boutilier et al. [2018] Craig Boutilier, Alon Cohen, Avinatan Hassidim, Yishay Mansour, Ofer Meshi, Martin Mladenov, and Dale Schuurmans. Planning and learning with stochastic action sets. In Proceedings of the Twentyseventh International Joint Conference on Artificial Intelligence (IJCAI18), pages 4674–4682, Stockholm, 2018.
 Breese et al. [1998] Jack S. Breese, David Heckerman, and Carl Kadie. Empirical analysis of predictive algorithms for collaborative filtering. In Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence (UAI98), pages 43–52, Madison, WI, 1998.
 Buchbinder et al. [2014] Niv Buchbinder, Moran Feldman, Joseph Seffi Naor, and Roy Schwartz. Submodular maximization with cardinality constraints. In Proceedings of the Twentyfifth Annual ACMSIAM Symposium on Discrete Algorithms (SODA14), pages 1433–1452, 2014.
 Campos et al. [2014] Pedro G. Campos, Fernando Díez, and Iván Cantador. Timeaware recommender systems: A comprehensive survey and analysis of existing evaluation protocols. User Modeling and UserAdapted Interaction, 24(1–2):67–119, 2014.
 Castro et al. [2018] Pablo Samuel Castro, Subhodeep Moitra, Carles Gelada, Saurabh Kumar, and Marc G. Bellemare. Dopamine: A research framework for deep reinforcement learning. arXiv:1812.06110 [cs.LG], 2018.
 Charnes and Cooper [1962] Abraham Charnes and William W. Cooper. Programming with linear fractional functionals. Naval Research Logistics Quarterly, 9(34):181–186, 1962.
 Chen and Hausman [2000] Kyle D. Chen and Warren H. Hausman. Mathematical properties of the optimal product line selection problem using choicebased conjoint analysis. Management Science, 46(2):327–332, 2000.
 Chen et al. [2018] Minmin Chen, Alex Beutel, Paul Covington, Sagar Jain, Francois Belletti, and Ed Chi. Topk offpolicy correction for a REINFORCE recommender system. In 12th ACM International Conference on Web Search and Data Mining (WSDM19), pages 456–464, Melbourne, Australia, 2018.

Cheng et al. [2016]
HengTze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra,
Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, Rohan
Anil, Zakaria Haque, Lichan Hong, Vihan Jain, Xiaobing Liu, and Hemal Shah.
Wide & deep learning for recommender systems.
In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems, pages 7–10, Boston, 2016.  Choi et al. [2018] Sungwoon Choi, Heonseok Ha, Uiwon Hwang, Chanju Kim, JungWoo Ha, and Sungroh Yoon. Reinforcement learningbased recommender system using biclustering technique. arXiv:1801.05532 [cs.IR], 2018.
 Covington et al. [2016] Paul Covington, Jay Adams, and Emre Sargin. Deep neural networks for YouTube recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems, pages 191–198, Boston, 2016.
 Craswell et al. [2008] Nick Craswell, Onno Zoeter, Michael Taylor, and Bill Ramsey. An experimental comparison of click positionbias models. In Proceedings of the 2008 International Conference on Web Search and Data Mining (WSDM08), pages 87–94. ACM, 2008.
 Dayan [1992] Peter Dayan. The convergence of TD() for general . Machine Learning, 8:341–362, 1992.
 Deshpande and Karypis [2004] Mukund Deshpande and George Karypis. Itembased topn recommendation algorithms. ACM Transactions on Information Systems (TOIS), 22(1):143–177, 2004.
 Feige [1998] Uriel Feige. A threshold of ln(n) for approximating set cover. Journal of the ACM, 45(4):634–652, 1998.
 Gauci et al. [2018] Jason Gauci, Edoardo Conti, Yitao Liang, Kittipat Virochsiri, Yuchen He, Zachary Kaden, Vivek Narayanan, and Xiaohui Ye. Horizon: Facebook’s open source applied reinforcement learning platform. arXiv:1811.00260 [cs.LG], 2018.
 GomezUribe and Hunt [2016] Carlos A. GomezUribe and Neil Hunt. The Netflix recommender system: Algorithms, business value, and innovation. ACM Transactions on Management Information Systems, 6(4):13:1–13:19, 2016.
 Hallak et al. [2017] Assaf Hallak, Yishay Mansour, and Elad YomTov. Automatic representation for lifetime value recommender systems. arXiv:1702.07125 [stat.ML], 2017.

He and McAuley [2016]
Ruining He and Julian McAuley.
Fusing similarity models with Markov chains for sparse sequential recommendation.
In Proceedings of the IEEE International Conference on Data Mining (ICDM16), Barcelona, 2016.  Honhon et al. [2012] Dorothee Honhon, Sreelata Jonnalagedda, and Xiajun Amy Pan. Optimal algorithms for assortment selection under rankingbased consumer choice models. Manufacturing and Service Operations Management, 14(2):279–289, 2012. doi: 10.1287/msom.1110.0365. URL http://pubsonline.informs.org/doi/abs/10.1287/msom.1110.0365.
 Ie et al. [2019] Eugene Ie, Vihan Jain, Jing Wang, Sanmit Narvekar, Ritesh Agarwal, Rui Wu, HengTze Cheng, Tushar Chandra, and Craig Boutilier. SlateQ: A tractable decomposition for reinforcement learning with recommendation sets. In Proceedings of the Twentyeighth International Joint Conference on Artificial Intelligence (IJCAI19), Macau, 2019. To appear.
 Jacobson et al. [2016] Kurt Jacobson, Vidhya Murali, Edward Newett, Brian Whitman, and Romain Yon. Music personalization at Spotify. In Proceedings of the 10th ACM Conference on Recommender Systems (RecSys16), pages 373–373, Boston, Massachusetts, USA, 2016.
 Jiang et al. [2019] Ray Jiang, Sven Gowal, Timothy A. Mann, and Danilo J. Rezende. Beyond greedy ranking: Slate optimization via ListCVAE. In Proceedings of the Seventh International Conference on Learning Representations (ICLR19), New Orleans, 2019.
 Joachims [2002] Thorsten Joachims. Optimizing search engines using clickthrough data. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD02), pages 133–142, 2002.
 Konstan et al. [1997] Joseph A. Konstan, Bradley N. Miller, David Maltz, Jonathan L. Herlocker, Lee R. Gordon, and John Riedl. GroupLens: Applying collaborative filtering to Usenet news. Communications of the ACM, 40(3):77–87, 1997.
 Krestel et al. [2009] Ralf Krestel, Peter Fankhauser, and Wolfgang Nejdl. Latent Dirichlet allocation for tag recommendation. In Proceedings of the 3rd ACM Conference on Recommender Systems (RecSys09), pages 61–68, New York, 2009.
 Kveton et al. [2015] Branislav Kveton, Csaba Szepesvari, Zheng Wen, and Azin Ashkan. Cascading bandits: Learning to rank in the cascade model. In Proceedings of the Thirtysecond International Conference on Machine Learning (ICML15), pages 767–776, 2015.
 Le and Lauw [2017] Dung D. Le and Hady W. Lauw. Indexable Bayesian personalized ranking for efficient topk recommendation. In Proceedings of the ACM Conference on Information and Knowledge Management (CIKM17), pages 1389–1398, 2017.
 Liu et al. [2009] TieYan Liu et al. Learning to rank for information retrieval. Foundations and Trends in Information Retrieval, 3(3):225–331, 2009.
 Louviere et al. [2000] Jordan J. Louviere, David A. Hensher, and Joffre D. Swait. Stated Choice Methods: Analysis and Application. Cambridge University Press, Cambridge, 2000.
 Luce [1959] R. Duncan Luce. Individual Choice Behavior: A Theoretical Analysis. Wiley, 1959.
 Martínezde Albéniz and Roels [2011] Victor Martínezde Albéniz and Guillaume Roels. Competing for shelf space. Production and Operations Management, 20(1):32–46, 2011. doi: 10.1111/j.19375956.2010.01126.x. URL http://dx.doi.org/10.1111/j.19375956.2010.01126.x.
 McFadden [1974] Daniel McFadden. Conditional logit analysis of qualitative choice behavior. In Paul Zarembka, editor, Frontiers in Econometrics, pages 105–142. Academic Press, 1974.
 Mehrotra et al. [2019] Rishabh Mehrotra, Mounia Lalmas, Doug Kenney, Thomas LimMeng, and Golli Hashemian. Jointly leveraging intent and interaction signals to predict user satisfaction with slate recommendations. In 2019 World Wide Web Conference (WWW’19), pages 1256–1267, San Francisco, 2019.
 Metz et al. [2017] Luke Metz, Julian Ibarz, Navdeep Jaitly, and James Davidson. Discrete sequential prediction of continuous actions for deep RL. arXiv:1705.05035 [cs.LG], 2017.
 Mnih et al. [2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
 Moshfeghi et al. [2011] Yashar Moshfeghi, Benjamin Piwowarski, and Joemon M. Jose. Handling data sparsity in collaborative filtering using emotion and semantic based features. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR11), pages 625–634, Beijing, 2011.
 Nemhauser et al. [1978] G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher. An analysis of approximations for maximizing submodular set functions—I. Mathematical Programming, 14(1):265–294, 1978.
 Puterman [1994] Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, New York, 1994.
 Rendle et al. [2010] Steffen Rendle, Christoph Freudenthaler, and Lars SchmidtThieme. Factorizing personalized Markov chains for nextbasket recommendation. In Proceedings of the 19th International World Wide Web Conference (WWW10), pages 811–820, Raleigh, NC, 2010.
 Rummery and Niranjan [1994] Gavin A. Rummery and Mahesan Niranjan. Online Qlearning using connectionist systems. Technical Report Technical Report TR166, University of Cambridge, Department of Engineering, Cambridge, UK, 1994.
 Rusmevichientong and Topaloglu [2012] Paat Rusmevichientong and Huseyin Topaloglu. Robust assortment optimization in revenue management under the multinomial logit choice model. Operations Research, 60(4):865–882, 2012.
 Sahoo et al. [2012] Nachiketa Sahoo, Param Vir Singh, and Tridas Mukhopadhyay. A hidden Markov model for collaborative filtering. Management Information Systems Quarterly, 36(4), 2012.
 Salakhutdinov and Mnih [2007] Ruslan Salakhutdinov and Andriy Mnih. Probabilistic matrix factorization. In Advances in Neural Information Processing Systems 20 (NIPS07), pages 1257–1264, Vancouver, 2007.
 Schön [2010] Cornelia Schön. On the optimal product line selection problem with price discrimination. Management Science, 56(5):896–902, 2010.
 Shani et al. [2005] Guy Shani, David Heckerman, and Ronen I. Brafman. An MDPbased recommender system. Journal of Machine Learning Research, 6:1265–1295, 2005.
 Silver et al. [2016] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
 Singh et al. [2000] Satinder Singh, Tommi Jaakkola, Michael L. Littman, and Csaba Szepesvári. Convergence results for singlestep onpolicy reinforcement learning algorithms. Machine Learning, 38(3):287–308, 2000.
 Srebro et al. [2004] Nathan Srebro, Jason Rennie, and Tommi Jaakkola. Maximum margin matrix factorization. In Advances in Neural Information Processing Systems 17 (NIPS2004), pages 1329–1336, Vancouver, 2004.
 Sunehag et al. [2015] Peter Sunehag, Richard Evans, Gabriel DulacArnold, Yori Zwols, Daniel Visentin, and Ben Coppin. Deep reinforcement learning with attention for slate Markov decision processes with highdimensional states and actions. arXiv:1512.01124 [cs.AI], 2015.
 Sutton [1988] Richard S. Sutton. Learning to predict by the method of temporal differences. Machine Learning, 3:9–44, 1988.
 Sutton [1996] Richard S. Sutton. Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Advances in Neural Information Processing Systems 9 (NIPS96), pages 1038–1044, 1996.
 Sutton and Barto [1998] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 1998.
 Swaminathan et al. [2017] Adith Swaminathan, Akshay Krishnamurthy, Alekh Agarwal, Miro Dudik, John Langford, Damien Jose, and Imed Zitouni. Offpolicy evaluation for slate recommendation. In Advances in Neural Information Processing Systems 30 (NIPS17), pages 3632–3642, Long Beach, CA, 2017.
 Taghipour et al. [2007] Nima Taghipour, Ahmad Kardan, and Saeed Shiry Ghidary. Usagebased web recommendations: A reinforcement learning approach. In Proceedings of the First ACM Conference on Recommender Systems (RecSys07), pages 113–120, Minneapolis, 2007. ACM.
 Talluri and van Ryzin [2004] Kalyan Talluri and Garrett van Ryzin. Revenue management under a general discrete choice model of consumer behavior. Management Science, 50(1):15–33, 2004.
 Tan et al. [2016] Yong Kiam Tan, Xinxing Xu, and Yong Liu. Improved recurrent neural networks for sessionbased recommendations. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems, pages 17–22, Boston, 2016.
 Theocharous et al. [2015] Georgios Theocharous, Philip S. Thomas, and Mohammad Ghavamzadeh. Personalized ad recommendation systems for lifetime value optimization with guarantees. In Proceedings of the Twentyfourth International Joint Conference on Artificial Intelligence (IJCAI15), pages 1806–1812, Buenos Aires, 2015.
 Train [2009] Kenneth E. Train. Discrete Choice Methods with Simulation. Cambridge University Press, Cambridge, 2009.
 van den Oord et al. [2013] Aaron van den Oord, Sander Dieleman, and Benjamin Schrauwen. Deep contentbased music recommendation. In Advances in Neural Information Processing Systems 26 (NIPS13), pages 2643–2651, Lake Tahoe, NV, 2013.
 Van Seijen et al. [2009] Harm Van Seijen, Hado Van Hasselt, Shimon Whiteson, and Marco Wiering. A theoretical and empirical analysis of expected SARSA. In IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, pages 177–184, 2009.
 Viappiani and Boutilier [2010] Paolo Viappiani and Craig Boutilier. Optimal Bayesian recommendation sets and myopically optimal choice query sets. In Advances in Neural Information Processing Systems 23 (NIPS), pages 2352–2360, Vancouver, 2010.
 Wang et al. [2015] Hao Wang, Naiyan Wang, and DitYan Yeung. Collaborative deep learning for recommender systems. In Proceedings of the Twentyfirst ACM International Conference on Knowledge Discovery and Data Mining (KDD15), pages 1235–1244, Sydney, 2015.
 Watkins and Dayan [1992] Christopher J. C. H. Watkins and Peter Dayan. Qlearning. Machine Learning, 8:279–292, 1992.
 Wilhelm et al. [2018] Mark Wilhelm, Ajith Ramanathan, Alexander Bonomo, Sagar Jain, Ed H. Chi, and Jennifer Gillenwater. Practical diversified recommendations on YouTube with determinantal point processes. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management (CIKM18), pages 2165–2173, Torino, Italy, 2018.
 Wu et al. [2017] ChaoYuan Wu, Amr Ahmed, Alex Beutel, Alexander J. Smola, and How Jing. Recurrent recommender networks. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining (WSDM17), pages 495–503, Cambridge, UK, 2017.
 Zhang and Yang [2017] Yu Zhang and Qiang Yang. A survey on multitask learning. arXiv:1707.08114 [cs.LG], 2017.
 Zhao et al. [2018] Xiangyu Zhao, Long Xia, Liang Zhang, Zhuoye Ding, Dawei Yin, and Jiliang Tang. Deep reinforcement learning for pagewise recommendations. In Proceedings of the 12th ACM Conference on Recommender Systems (RecSys18), pages 95–103, Vancouver, 2018.
Comments
There are no comments yet.