1 Introduction
Reinforcement Learning (RL) [SB98] is a paradigm for learning through trialanderror while interacting with an unknown environment. The interaction happens in cycles during which the agent chooses an action and the environment returns an observation together with a realvalued reward. The agent’s goal is to maximize longterm accumulated reward. RL has had many successes, including autonomous helicopter control [NCD04] and, recently, mastering a wide range of Atari games [MKS15] (with a single agent) and a range of physics control tasks [LHP15].
Although these are impressive accomplishments, the Atari games only contain actions and while the physics control tasks have continuous action spaces they are of limited dimensionality (below ). Our work addresses combinatorial action spaces represented by feature vectors of dimensionality up to , but with useful extra structure naturally present in the applications of interest. We consider the application of RL to problems such as recommendation systems [PKCK12] in which a whole slate (tuple) of actions is chosen at each time point. While these problems can be modeled with MDPs with combinatorial action spaces, the extra structure that is naturally present in the applications allows for tractable approximate value maximization of the slate.
Slate Markov Decision Processes (Figure 1) We address RL problems that are such that at each time point the agent picks a fixed number of actions from a finite set . We refer to a tuple of actions as a slate. The slates are ordered in our formalization, but as a special case one can have an environment that is invariant to this order. In our environments only one action from the slate is executed. For example, in the recommendation system case, the user’s choice when given the recommendations, is the execution of an action. We assume that we are given an underlying traditional RL problem, e.g. a Markov Decision Process (MDP) [Put94].
In an MDP, we observe a state at each time point and an action is executed; the next state and reward (here nonnegative) received are independent of what happened before time . In other words, we assume that summarizes all relevant information up to time .
The key point of slateMDPs is that, instead of taking one action, the agent chooses a whole slate of actions and then the environment chooses which one to execute in the underlying MDP; see Figure 1. The state information received tells the agent what action was executed, but not what would have happened for other actions. SlateMDPs have important extra structure compared to the situations in which all the actions are executed and each full slate is its own discrete action in an enormous action space.
We investigate modelfree agents that directly learn the slateMDP value function for full slates. A simpler approach that is often deployed in large scale applications is to learn the values of individual actions and then combine the individually best actions. We present experiments that show serious shortcomings of this simple approach that completely ignores the combinatorial aspect of the tasks. When extra actions are added to a slate these might interfere with the execution of the highest value action. An agent that learns a slate value function is less harmed by this and can in principle learn beneficial slate patterns such as diversity without, unlike methods such as maximum marginal relevance [CG98] in information retrieval, being given a mathematical definition and a constant specifying the amount of diversity to introduce.
The main drawback of full slate agents is the number of evaluations of the value function needed for producing a slate based on them. Therefore, we also investigate the option of learning a parameterized policy (a neural network) using deterministic policy gradients
[SLH14, LHP15, DAES16] to guide attention towards areas of the action space in which the value function is high. The neural network policy is combined with a nearestneighbor lookup and an evaluation of the value function on this restricted set.Related work As far as we are aware, [FP11]
is the only work that picks multiple actions at a time for an MDP. They studied known MDPs, aiming to provide as many actions as possible while remaining nearly optimal in the worst case. Besides that they assume a known MDP, their work differs critically from our article in that they always execute an action from the slate and that they focus on the worst case choice. We work with actionexecution probabilities and do not assume that any action from the slate will be executed. Further, we work with highdimensional feature representations and aim for a scale at which achieving guaranteed nearoptimal worst case behavior is not feasible. Other work on slate actions
[KRS10, YG11, KWA14] has focused on the bandit setting in which there are no state transitions. In these articles, rewards are received and summed for each action in the slate. In our slateMDPs, the reward is only received for the executed action and we do not know what the reward would have been for the other actions in the slate. In the recommendation systems literature [PKCK12], the focus of most work is on the immediate probability of having a recommendation accepted or being relevant, and not on expected value, which is successfully optimized for here. [Sha05] is an exception and optimizes for longterm value within an MDP framework, but treats individual recommendations as independent just like the agents we employ as a baseline and outperform with our full slate agents. [VHW09] used continuous control methods for discreteaction problems for which the actions are embedded in a feature space and the nearest discrete action is executed. This continuous control for discrete reinforcement learning approach is in a different way (as we present at greater length for RL with large action spaces in [DAES16]) utilized here as an attention mechanism.2 Reinforcement Learning with Slate Actions
As is common in reinforcement learning [SB98], we work with an agentenvironment framework [RN10] in which an agent chooses actions executed in an environment and receives observations and realvalued rewards . A sequence of such interactions is generated cyclically at times . If the Markov property is satisfied, then the environment is called a Markov Decision Process (MDP) [Put94] and the observations can be viewed as states .
An MDP is defined by a tuple of a state space , an action space , a reward function where
denotes the probability distributions over the set
, and a transition function . We will use to denote the expected value of . A stationary policy is a function from states to probability distributions over .The agent is designed to accumulate as much reward as possible, which is often addressed by aiming for maximizing expected discounted reward for a discount factor . In this article we primarily work with episodic environments where the discounted rewards are summed to the end of the episode. Our agents learn for a policy , where and is the end of the episode. Ideally, we want to find because then acting according to is optimal. One method for achieving this is Qlearning which is based on the update , where is the learning rate. The update translates to a parameter update rule also for the case with a parameterized instead of a tabular value function. We employ the greedy approach to action selection based on a value function, which means that with probability we pick and with probability a random action. Our study focuses on deep architectures for the value function similar to those used by [MKS15, LHP15] and our approach incorporates the key techniques of target networks and experience replay employed there.
Slate Markov Decision Processes (slateMDPs) In this section we formally introduce slateMDPs as well as some important special cases that enable more efficient inference.
Definition 1 (slateMDP).
Let be an MDP. Let . Define and by
The tuple is called a slateMDP with underlying MDP and actionexecution function . We assume that the previous executed action can be derived from the state through a function .
Note that any slateMDP is itself an MDP with a special structure. In particular, the probability distribution of the next state and the reward conditional on the current state and action can be factored as:
The expected reward for a slateMDP can be computed as
If we let , we have the following identity for the stateslate value function of the slateMDP:
(1) 
for any slate policy .
We do not require that the executed action is an element of , but in the environment we create that will be the case for “good” slates (ExecutionIsBest in Definition 2). In the recommendation system setting, means that a recommendation was selected by the user. We formally define what it means that having actions from the slate executed is the best outcome, by saying that the valueorder between policies coincides with that of a modified version of the environment in which implies that the episode ends with zero reward. We call the latter property the fatal failure property.
Definition 2 (Valueorder, Fatal Failure, ExecutionIsBest (EIB)).
Let and be two environments with the same state space and action space .
Value order: If, for any pair of policies and ,
then we say that and have the same valueorder.
Further, suppose there is such that .
Fatal Failure: If whenever , then we say that has fatal failure.
Suppose that if , i.e., the environments coincide for executed slates.
EIB: If has fatal failure and has the same valueorder as , then we say that has ExecutionIsBest (EIB) property.
To be able to identify a valuemaximizing slate in a largescale setting, we need to avoid a combinatorial search. The first step is to note that if an environment has the fatal failure property, then can be replaced by in (1). In other words, only terms corresponding to actions in the slate are nonzero. While this condition does not hold in our environments, the EIB assumption is natural and implies that one can perform training for an environment modified as in Definition 2. Although the sum with fewer terms is easier to optimize, the problem is still combinatorial and does not scale. Therefore, monotonicity and submodularity are interesting to us since if is monotonic and submodular, we can sequentially greedily choose a slate and [Fuj05].
Definition 3 (Monotonic and Submodular).
We say that a function for is
Monotonic if it holds that and
Submodular if (diminishing returns)
holds for all .
To guarantee monotonicity and submodularity we introduce a further assumption that we call sequential presentation since it is satisfied if the actionselection happens sequentially in the environment, e.g, if recommendations are presented onebyone to a user or if the users are assumed to inspect them in such a manner. Although, our environments do not have sequential presentation, the sequentially greedy procedure works well. When we evaluate the choice of a first recommendation we look at it in the presence of the other recommendations provided by a default strategy. This brings our setting closer to sequential recommendations.
Definition 4 (Sequential Presentation).
We say that a slateMDP has sequential presentation if for all states its actionexecution probabilities satisfy
(2) 
and
Proposition 1.
If a slateMDP has sequential presentation and satisfies the fatal failure property then its stateslate value function is monotonic and submodular for all .
Proof.
Let . For any vector and any scalar , let denote the vector constructed by concatenating to . Assume that . Also the rewards are nonnegative. Then, we have that
Sequential presentation immediately implies that . This establishes that is indeed submodular in . Monotonicity follows from (2). ∎
The next section introduces agents based on the theory of this section. They learn the value of full slates and select a slate through a sequentially greedy procedure which under the sequential presentation assumption combined with EIB, is potentially performing slightly worse than combinatorial search. Further, motivated by EIB, training is performed on a modified environment for which fatal failure is satisfied.
Slate agents
We consider modelfree agents that directly learn either the value of an individual action (Algorithm 1) or the value of a full slate (Algorithm 2). We perform the action selection for the latter in a way that only considers dependence on the actions in slots above. However, we still learn a value function which depends on a whole slate by using value function approximators that take both the features of the state and all the actions as arguments. We perform the maximization in a sequentially greedy manner and fill slots following the one being maximized with the same action that is being evaluated, while keeping previous ones fixed. Both Algorithm 1 and Algorithm 2 are stated in a generic manner while in our experiments we include useful techniques from [MKS15, LHP15] that stabilize and speed up learning, namely experience replay and target networks as in Algorithm 3. Algorithm 1 is presented in two phases; One with training using slate size and one testing with slate size . In our experiments we interleave test and training phases.
Deterministic Policy Gradient (DPG) Learning of Slate Policies to Guide Attention To decrease the number of evaluations of the value function when choosing a slate, we attempt to learn an attention guiding policy that is trained to produce a slate that maximizes the learned value function as seen in Algorithm 3, which generalizes Algorithm 2 in which candidate actions are used as the nearest neighbors. The policy is optimized using gradient ascent on its parameters for as a function of those and the state.
The main extra issue, besides the much higher dimensionality, compared to existing deterministic policy gradient work [SLH14, LHP15], is that instead of using the continuous action produced by the neural network, we must choose from a discrete subset. We resolve this by performing a nearest neighbor lookup among the available actions and either execute the nearest or evaluate for all the identified neighbors and pick the highest valued action. We introduce this approach in fuller detail and further developed for large action space in [DAES16]. The policy is still updated in the same way since we simply want it to produce vectors with as high values as possible. However, when we also learn we update based on for the action actually taken. As in [SLH14, LHP15] the next action used for the TDerror is the action produced by the current target policy. To perform a nearest neighbor lookup for slates we focus on a slot at a time. We use the sequentially greedy maximization defined in Algorithm 2 but for each slot the choices are further restricted to only consider the result of that lookup.
DPG+kNN
3 Experimental comparison
We perform an experimental comparison of a range of agents described in the previous section on a test environment of a generic template. The examples used in this study, have respectively , , and states and actions represented by dimensional vectors for and and dimensional for .
The Test Environment: The test environments are such that and varies with the environment. An environment is defined by a transition weight matrix (from a real recommendation system) such that for each state and each action there is a real valued nonnegative weight indicating how common it is that follows . For each , only a limited number (larger than zero and at most ) of are nonzero and the magnitude of a typical nonzero weight is . We refer to those for which as the candidate actions for state . Further, there is a weight .
The weight matrix represents a weighted directed graph, which is extracted as a subgraph from a very large full graph of the system, by choosing a seed node and performing a breadth first traversal to a limited depth and then pruning childless nodes in an iterative manner. There is also a reward for each state that is received upon transition to that state.
When an agent in state produces a slate , each action has a probability of being executed that is proportional (not counting duplicates) to (standard discount in information retrieval [JK02, CMS10]) where is the position in the slate and the probability that no action is executed is proportional to . If no action from the slate is executed, the environment transitions to a uniformly random next state and the agent receive the corresponding reward. After this transition, there is a probability (here ) of the episode ending. If was executed the environment transitions to , the reward is received and the episode ends with a fixed probability (here ).
The Agents For all agents’
functions, we use function approximators that are feedforward neural networks with two hidden layers, each with a
units. The policies are feedforward neural networks with two hidden layers with hidden units each. Fewer units suffice since we only need an approximate location of high values. We use learning rate and target network update rate . In line with theory presented in the previous section, training is performed on a modified version of the environment in which the episode ends with zero reward when an action not from the slate is executed. The update routine is a gradient step on a squared () loss between and , where is the target network and the action produced by the target policy network at state . The target network parameters slowly track through the update and similarly for the target policy network. Algorithm 3 details these procedures.Evaluation We evaluate our full slate agents and simple top agents on the three environments with slate sizes , and . The full slate agents are evaluated in three variations with different number of actions in the slate. The cheapest version (in number of evaluations) immediately picks the action, for each slot, whose features are nearest (in distance) to the vector produced by the policy. The most expensive agent considers all candidate actions and we also evaluate an agent that, for each slot, only considers the nearest. We ran each experiment times with different random seeds and plotted the average total reward per episode (averaged both over seeds and episodes at evaluation with ) in Figures 27 for which a moving average with window length
has also been employed. The error bars show one standard deviation. Figures
, and compares different number of neighbors for , and . Figures , and compare full and simple agents at different slate sizes for the same environments.Results We see the full slate agents performing much better overall than the simple top agents that we employ as a baseline. The baseline is relevant since agents used in recommendation systems are often of that form (based on an unrealistic independence assumption as in [Sha05]) although they typically focus on recommendations being accepted [PKCK12]. Unlike the simple agents, full slate agents always perform well for larger slate sizes. The simple agents are unable to learn to avoid including actions with high weight but with lower value than the top pick. For slate size , the simple top agent coincides with the full slate agent which evaluates all candidate actions, hence these two agents are shown as one agent. Further, we can see that the curve for agents that only evaluate of the candidate actions is almost identical to the one for agents that evaluate all. The nearest neighbor agent that simply picks the nearest neighbor is slightly worse and has larger variability than the other two. However, as we demonstrate in a further experiment that also highlight the ability to learn nonmyopically, the nearest neighbor agent can outperform the other agents. The variability of the nearest neighbor agent aids exploration and the attention can help when
is not estimated well everywhere.
RiskSeeking In the case of the environment in particular, it is possible to perform much better than we have already seen. In fact the performance seen in Figure 4 only reaches the performance of the optimal myopic policy. For this environment there are far better policies. We perform a simple modification to our agent (slate size , all neighbors) to make it more likely to discover multistep paths to high reward outcomes. We transform the reward that the agent is training on by replacing with , while still evaluating with the orginal reward. We see that for a wide range of exponents we eventually see far superior performance compared to . We refer to the agents with as riskseeking in line with prospect theory [KT79].
4 Conclusions
We introduced agents that successfully address sequential decision problems with highdimensional combinatorial slateaction spaces, found in important applications including recommendation systems. We focus on slate Markov Decision Processes introduced here, providing a formal framework for such applications. The new agents’ superiority over relevant baselines was demonstrated on a range of environments derived from real world data in a live recommendation system.
References
 [CG98] Jaime Carbonell and Jade Goldstein. The use of MMR, diversitybased reranking for reordering documents and producing summaries. In In SIGIR, pages 335–336, 1998.
 [CMS10] W. Bruce Croft, Donald Metzler, and Trevor Strohman. Search engines: information retrieval in practice. AddisonWesley, Boston, 1st edition, February 2010.
 [DAES16] Gabriel DulacArnold, Richard Evans, and Peter Sunehag. Fast reinforcement learning in large discrete action spaces. In preparation, 2016.
 [FP11] M.M. Fard and J. Pineau. Nondeterministic policies in markovian decision processes. J. Artif. Intell. Res. (JAIR), 40:1–24, 2011.
 [Fuj05] S. Fujishige. Submodular Functions and Optimization: Second Edition. Annals of Discrete Mathematics. Elsevier Science, 2005.
 [JK02] K. Järvelin and J. Kekäläinen. Cumulated gainbased evaluation of ir techniques. ACM Transactions on Information Systems (TOIS), 20(4):422–446, 2002.
 [KRS10] S. Kale, L. Reyzin, and R. Schapire. NonStochastic Bandit Slate Problems. In J. Lafferty, C. K. I. Williams, R. Zemel, J. ShaweTaylor, and A. Culotta, editors, Advances in Neural Information Processing Systems 23, pages 1045–1053. 2010.
 [KT79] Daniel Kahneman and Amos Tversky. Prospect theory: An analysis of decisions under risk. Econometrica, pages 263–291, 1979.
 [KWA14] B. Kveton, Z. Wen, A. Ashkan, H. Eydgahi, and B. Eriksson. Matroid bandits: Fast combinatorial optimization with learning. CoRR, abs/1403.5045, 2014.
 [LHP15] Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. CoRR, abs/1509.02971, 2015.
 [MKS15] V. Mnih, K. Kavukcuoglu, D. Silver, A. Rusu, J. Veness, M. Bellemare, A. Graves, M. Riedmiller, A. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, 02 2015.
 [NCD04] A. Ng, A. Coates, M. Diel, V. Ganapathi, J. Schulte, B. Tse, E. Berger, and E. Liang. Autonomous inverted helicopter flight via reinforcement learning. In Experimental Robotics IX, pages 363–372, 2004.
 [PKCK12] Deuk Hee Park, Hyea Kyeong Kim, Il Young Choi, and Jae Kyeong Kim. A literature review and classification of recommender systems research. Expert Systems with Applications, 39(11):10059 – 10072, 2012.
 [Put94] M. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, New York, 1994.
 [RN10] S. J. Russell and P. Norvig. Artificial Intelligence: A Modern Approach. Prentice Hall, Englewood Cliffs, NJ, edition, 2010.
 [SB98] R. Sutton and A. Barto. Reinforcement Learning. The MIT Press, 1998.
 [Sha05] G. Shani, RI. Brafman and D, Heckerman An MDPbased recommender system J. Mach. Learn. Res. 6 (December 2005), 12651295.

[SLH14]
David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and
Martin A. Riedmiller.
Deterministic policy gradient algorithms.
In
Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 2126 June 2014
, volume 32 of JMLR Proceedings, pages 387–395. JMLR.org, 2014.  [VHW09] Hado Van Hasselt, Marco Wiering, et al. Using continuous action spaces to solve discrete problems. In Neural Networks, 2009. IJCNN 2009. International Joint Conference on, pages 1149–1156. IEEE, 2009.
 [YG11] Y. Yue and C. Guestrin. Linear submodular bandits and their application to diversified retrieval. In Advances in Neural Information Processing Systems 24, pages 2483–2491, 2011.