1 Introduction
In the user coldstart problem a new user is introduced to a recommendation system. Here, the system often has little to no information about this new user and must provide reasonable recommendation nonetheless. A good recommendation system must on one hand provide quality (initially based on item popularity) recommendations to such users in order to keep them engaged, and on the other hand learn the new users’ personal preferences as quickly as possible. The initial session of a user with a recommendation system is critical as in it, the user decides whether to terminate the session, and possibly never return, as opposed to registering to the site or becoming a regular visitor of the system. We refer to this phenomenon as that of a oneshot session. This brings emphasis on the need to obtain guarantees not only for a long horizon but also for a very short one.
The oneshot session framework leads to a highly natural objective: Maximize the session length, i.e. the number of items consumed by the user until terminating the session. Indeed, the longer the user engages with the system the more likely she is to register and become a regular user. Our focus is on recommendation systems in which we present multiple items in each round. The user will either choose a single item and proceed to the next round, or choose to terminate the session. The property of having multiple items allows us to learn about the user’s preferences based on the items chosen, versus those that were skipped.
A typical session length is quite short as it consists of a handful of rounds. This translates to us having very few data to learn from in order to personalize our recommendations. Due to the limited amount of information we are forced to restrict ourselves to a very simple model. For this reason we take a similar approach to that in [Agrawal, Teneketzis, and Anantharam1989, Salomon and Audibert2011, Maillard and Mannor2014] and assume that each user belongs to one of a fixed number of user types (in the mentioned works these were called user clusters), such as man/woman, low/high income, or latent types based on previously observed sessions. The simplicity of the model translates into being a small integer. We assume that the model associated with each of the user types is known^{1}^{1}1Learning the correct model for a user type can be done for example from data collected from different users whose identity is known. In either case this can be handled independently hence we do not deal with this issue. That is, for any
tuple of items, the probability of each of the items to be chosen, and the probability of the session terminating given the user type is known. We emphasize the fact that a complete recommendation system will start with the simple model with
being a small constant, and for users that are ‘hooked’, i.e. remain for a long period / register, we may move to a more complex model where for example a user is represented by a high dimensional vector. We do not discuss the latter more complex system, aimed for users with a long history, as it is outside the scope of our paper.
The problem we face can be formulated as a Markov Decision Problem (MPD; [Bertsekas and Tsitsiklis1995, Sutton and Barto1998]). In each round the state is a distribution over reflecting our knowledge about the user. We choose an action consisting of different items from the item set . The user either terminates the session, leading to the end of the game or chooses an item, moving us to a different state as we gained some knowledge as to her identity. Notice that any available context, e.g. time of day, gender, or basic information available to us can be used in order to set the initial state. The formulated MDP can be solved in order to obtain the optimal strategy; the computational cost scales as the size of the action space and the state space. Since is restricted to be small, the size of the state space does not present a real challenge. However, the action space has a size of which is typically huge. The number of available items can be in the hundreds if not thousands and a system presenting even a handful of items will have for the very least billions of possible actions. For this reason we seek a solution that scales relatively to rather than .
To this end we require an additional mild assumption, that can be viewed as a quantitive extension of the irrelevant alternatives axiom (see Section 4). With this assumption we are able to provide a solution (Section 5) based on a greedy approach that scales as and has a constant competitive ratio with the computationally unbounded counterpart. The main component of the proof is an analysis showing that the submodularity and monotonicity of the immediate reward in a round translates into monotone and submodularlike properties of the so called function in a modified value iteration procedure we denote by Greedy Value Iterations (GVI). Given these properties we are able to show, via an easy adaptation of the tools provided in [Nemhauser, Wolsey, and Fisher1978] for dealing with submodular monotone functions, that the greedy approach emits a constant approximation guarantee. We emphasize that in general, a monotone submodular reward function does not in any way translate into a monotone submodular function, and we exploit specific properties of our model in order to prove our results; to demonstrate this we show in Appendix G an example for a monotone submodular immediate reward function with a corresponding function that is neither monotone nor submodular. We complement the theoretical guarantees of our solution in Section 6 with experimental results on synthetic data showing that in practice, our algorithm has performance almost identical to that of the computationally unbounded algorithm.
2 Related Work
Many previous papers provide adaptive algorithms for managing a recommendation system, yet to the best of our knowledge, non of them deal with oneshot sessions. The tools used include Multiarmed Bandits [Radlinski, Kleinberg, and Joachims2008], Multiarmed bandits with submodularity, [Yue and Guestrin2011], MDPs [Shani, Heckerman, and Brafman2005], and more. A common property shared by these results is the assumption of an infinite horizon. Specifically, a poor recommendation given in one round cannot cause the termination of the session, as in oneshot sessions, but only result in a small reward in the same single round. This crucial difference in the ‘cost’ of a single bad round in the setups of these papers versus ours is very likely to cause these methods to fail in our setup. A paper that partially avoids this drawback is by [Deshpande and Montanari2012], where other than a guarantee for an infinite horizon the authors provide a multiplicative approximation to the optimal strategy at all times. A notable difference between our setup is the fact that the recommendations there consist of a single item rather than multiple items as required here. This, along with the somewhat vague connection to our oneshot session setup exclude their methods from being a possible solution to our problem.
Our work can be casted as a Partially Observable MDP (POMDP; [Kaelbling, Littman, and Cassandra1998]), where the agent only has partial (sometimes stochastic) knowledge over the current state. Our problem stated as a POMDP instance admits states, one for each user type and an additional state reflecting the session end. The benefit of such an approach is the ability to significantly reduce the size of the state space, from potentially down to . Nevertheless, we did not chose this approach as the gain is rather insignificant due to being a small constant, while the inherent complication to the analysis and algorithm make it difficult to deal with the large action space, forming the main challenge in our setting. Recently, [Satsangi, Whiteson, and Oliehoek2015] presented a result dealing with a combinatorial action space in a POMDP framework, when designing a dynamic sensor selection algorithm. They analyze a specific reward function that is affected only by the level of uncertainty of the current state, thereby pushing towards a variant of pure exploration. The specific properties of their reward function and MDP translate into a monotone and submodular function. These properties are not present in our setup, in particular due to the fact that a session may terminate, hence the methods cannot be applied. Furthermore, our greedy VI variant is slightly more complex than the counterpart in [Satsangi, Whiteson, and Oliehoek2015] as it is tailored to ensure the (approximate) monotonicity of ; this is an issue that was not encountered in the problem setup of [Satsangi, Whiteson, and Oliehoek2015].
Another area which is related to our work is that of “Combinatorial Multi Armed Bandits” setup (CMAB; see [Chen, Wang, and Yuan2013] and references within). Here, similarly to our setup, in each round the set of actions available to us can be described as subsets of a set of options (denoted by arms in the CMAB literature). These methods cannot directly be applied to our setting due to the infinite horizon property mentioned above. Furthermore, the methods given there that help deal with the combinatorial nature of the problem cannot be applied in our setting since the majority of our efforts lie in characterizing properties of the function; an object that has no meaning in MAB settings but only in MDPs.
3 Problem Formulation
In this section we provide the formal definition of our problem. We first provide the definition of a Markov Decision Process (MDP). We continue to describe our setup and its different notations, and then formulate it as an MDP.
Markov Decision Processes
An MDP is defined by a tuple where X is a state space, U is a set of actions,
is a mapping from stateaction pairs to a probability distribution over the nextstates, and
is a mapping from the stateactionnextstate to the reward. The MDP defines a process of rounds. In each round we are at a state and must choose an action from U. According to our action, the following state and the reward are determined according to . The objective of an MDP is to maximize the cumulative sum of rewards with a future discount of , i.e. , where is the action taken at time , is the state at time , and is the expected reward given the actionstate pair. For this objective we seek a policy mapping each state to an action. The objective of planing in an MDP is to find a policy maximizing the value functionwhere the value of is the longterm accumulated reward obtained by following the policy , starting in state . We denote the optimal value function by . A policy is optimal if its corresponding value function is (see [Bertsekas and Tsitsiklis1995] for details).
The Bellman’s operator (or DP operator) maps a function (where is the set of nonnegative reals) to another function and is defined as follows.
(1) 
where and denote the current and next state, respectively.
Under mild conditions, the equation is known to have a unique solution which is the fixed point of the equation and equals to . A known method for finding is the Value Iteration (VI; [Bertsekas and Tsitsiklis1995, Sutton and Barto1998]) algorithm which is defined by applying The DP operator (1) repeatedly on an initial function (e.g. the constant function mapping all states to zero). More precisely, applying (1) times on yields
and the VI method consists of estimating
. The VI algorithm is known to converge to . However, computational difficulties arise for large state and action spaces.Notations
Let us first formally define the rounds of the usersystem interaction and our objective. When a new user arrives to the system (e.g., content provider) we begin a session. At each round, we present the user a subset of up to items from the set of available items . The user either terminates the session, in which case the session ends, or chooses a single item from the set, in which case we continue to the next round. The reward is either if the user chose an item^{2}^{2}2It is an easy task to extend our results to a setting where different items incur different rewards. For simplicity however we keep it simple and assume equality between items, in terms of rewards. or otherwise. Following a common framework for MDPs, our objective is to maximize the sum of rewards with future rewards discounted by a factor of . That is, by denoting the reward of round and
the random variable (or random time) describing the total number of rounds, we aim to maximize
(2) 
The reason for considering is the fact that the difference between a session of say length 10 and length 5 is not the same as that of length 6 and 1. Indeed in the user coldstart problem one can think of a model where every additional item observed by the user increases the probability of her registering, yet this function is not linear but rather monotone increasing and concave.
We continue to describe the modeling of users. Recall that users are assumed to characterized by one of the members of the set . Our input contains for every set of items, every user type , and any item the probability of the user of type choosing item when presented the set . In the session dynamics described above we maintain at all times a belief regarding the user type, denoted by^{3}^{3}3Eventually we consider a discretization of the simplex, but for clarity we discuss this issue only at a later stage. , with being the set of distributions over . Notice that given the distribution we may compute for every set and item the probability of the user choosing item . We denote this probability by
Assume now that at round , our belief state is , we presented the user a set of items , and the user chose item
. The following observation provides the posterior probability
also denoted by . The proof is based on the Bayes rule; as it is quite simple we defer it to Appendix A in the supplementary material.Observation 1
The vector is the posterior typeprobability for a prior , action and a chosen item . This probability is obtained by
(3) 
Formulating the Problem as an MDP
We formulate our problem as an MDP as follows. The state space is defined as where denotes the termination state. The action space consists of all subsets of cardinality . The reward function depends only on the target state and is defined as 1 for any and zero for . As a result of Observation 1, we are able to define the transition function :
where the set is defined as
that is the set containing such that (3) is satisfied. The final missing definition to the transition function is the probability to move to the termination state, denoted by , defining the session end. For it, .
4 User Modeling Assumptions
In order to obtain our theoretical guarantees we use assumptions regarding the user behavior. Specifically, we assume a certain structure in the function mapping a item set and a item to the probability that a user of type (any ) will choose the item when presented with the item set . To assess the validity of the below assumption consider an example standard model^{4}^{4}4An example for where this modeling is implicitly made is in the setting of a
Multinomial Logistic Regression
. where each item in (and the empty item) has a positive value for the user and the chosen item is drawn with probability proportional to . We note that the below assumptions hold for this model.The first assumption essentially states that at all states there is a constant, bounded away from zero, probability to reach the termination state. In our setup this translates into an assumption that even given knowledge of the user type, the probability of the user ending the session remains nonzero. Needless to say this is a highly practical assumption.
Assumption 1
For a constant , any set of content items where , any types vector and a content item , it holds that
In what follows, our approximation guarantee will depend on , that is on how much the bestcasescenario probability of ending a session is bounded away from zero. The second assumption assert independence between the probabilities of choosing different content items.
Assumption 2
For every , a set of content items and a content item it holds that
(4) 
The above assumption is related to the independence of irrelevant alternatives axiom (IIA) [Saari2001] of decision theory, stating that “If is preferred to out of the choice set , introducing a third option , expanding the choice set to , must not make preferable to ”. Our assumption is simply a quantitive version of the above.
5 Approximation Of the Value Function
In this section we develop a computationally efficient approximation of the value function for the setup described above. We begin with dealing with the action space, and later we also take into consideration the continuity of the state space.
Addressing the Largeness of the Action Space by Submodularity
In this section we provide a greedy approach dealing with the large action space, leading to a running time scaling as . For clarity we ignore the fact that X is infinite and defer its discretization to the Section 5. The outline of the section is as follows: We first mention that the immediate reward function, when viewed as a function of the action, is monotone and submodular. Next, we define a modified valueiteration procedure we denote by greedy value iteration (GVI), resulting in a sequence of approximate value function and functions , obtained in the iterations of the procedure. We show that these functions are approximately monotone and approximately submodular and that for functions with these approximate monotonesubmodular properties, the greedy approach provides a constant approximation for maximization; we are not aware of papers using the exact same definitions for approximate monotonicity and submodularity yet we do not consider this contribution as major since the proofs regarding the greedy approach are straightforward given existing literature. Finally, we tie the results together and obtain an approximation of the true function, as required.
Since it is mainly technical and due to space limitations, we defer the proof that the reward function is monotone and submodular to Appendix B. We now turn to describe the process GVI. We start by defining our approximate maximum operator
Definition 2
Let be a set, , and let be an integer. We denote by the set of subsets of of size . The operators (the superscript “g” for greedy) are defined as follows
Informally, the operator maximizes the value of a function over subsets of restricted size by greedily adding elements to a subset in a way that maximizes . For a value function we define the function as
(5) 
When it is clear from context which is referred to, we omit the subscript of it. Recall that the standard DP operator is defined as . Using our greedybased approximate max we define two greedybased approximate DP operator. The first is denoted as the simplegreedy approach where
(6) 
As it turns out, the simplegreedy approach does not necessarily converge to a quality value function. In particular, the function obtained by it does not emit necessary monotonesubmodularlike qualities that we require for our analysis. We hence define the second DP operator we call the greedy operator.
Definition 3
For a function we define
(7) 
where the set is defined in the following statement,
In words, we take advantage of the fact that the number of states is small (as opposed to the number of actions) and use the operator not to associate actions with states but rather to reduce the number of actions to be at most the same as the number of states. We then choose the actual for each state, from the small subset of actions. Notice that the compositional complexity of the operator is , as opposed to as the operator. In Appendix 6 , we explore whether there is a need for the further complication involved with using rather than , or whether its use is needed only for the analysis. We show that in simulations, the system using the operator significantly outperforms that using the simpler operator.
Recall that the value iteration (VI) procedure consists of starting with an initial value function, commonly the zero function, then performing the operator on multiple times until convergence. Our GVI process is essentially the same, but with the operator. Specifically, we initialize to be the zero function and analyze the properties of for . In our analysis we manage to tie the value of computed w.r.t. a decay value (Equation (5)), to the value of , the true VI procedure, computed w.r.t. a decay value of with . To dispaly our result we denote by the iterated DP operator done on w.r.t. decay value . The proof is given in Appendix C .
Theorem 4
To better understand the meaning of the above expression we estimate the value of for the initial state in reasonable settings. Specifically, we would like estimate
In cases where is a small constant we get a constant multiplicative approximation of the value function obtained via the optimal, computationally inefficient maximization.
In the supplementary material (Lemma 19) we provide the bound
The proof is purely technical. Notice that is in fact the probability of the user, given the state and us choosing the best possible action, choosing a link rather than terminating the session. Assuming a large number of content items (compared to ) it is most likely that for every type there are much more than favorable items. This informally means that either the probability of choosing any item among a set is roughly or is a poor choice of links and the probability of ending the session when presenting is significantly lower than . It is thus reasonable to assume that
Hence
For example, for and . Then we have , hence , meaning we get a multiplicative approximation compared to the optimal operator with .
Addressing Both The Continuity of State Space and The Largeness of the Action Space
Recall that the state space of our model is continuous. As our approach requires scanning the state space we present here an analysis of our approach taken over a discretized state space. That is, rather than working over (the entire dimensional simplex) our finite state space X is taken to be an net, w.r.t. the norm, over .
As before, the value iteration we suggest takes the greedy approach where the only difference is in the definition of the function.
Definition 5
The function, based on a function mapping a state to a value is defined as follows:
(11) 
where is defined as the closest point in X to .
Analogically to before, we define the operator over a value function as
(12) 
with being defined w.r.t. the finite state set X. In Appendix E we prove the following theorem, giving the analysis of the above value iteration procedure.
Theorem 6
For sufficiently small , the result is essentially the same as that in Section 5.
6 Experiments ^{5}^{5}5Additional experiments are provided in Section F of the supplementary material.
In this section we investigate numerically the algorithms suggested in Section 5. We examine four types of CP policies:
1. Random policy, where the CP provides a (uniformly) random set of content items at each round.
2. Regular DP operator policy, namely as in (1), in which the maximum is computed exactly. The computational complexity of each iteration of the VI with the original DP operator is of order of .
3. Greedy Operator policy, namely following the operator as in (12). In this case the computational complexity of each iteration of the GVI is of order of .
4. Simple Greedy CP, namely following the operator as in (6). No theoretical guarantees are provided for this CP, but since its computational complexity of each iteration of the VI is of order of and its similarity to the greedy CP, we are interested in its performances.
We conducted our experiments on synthetic data. The users’ policy implemented the following model relating the scores to the users’ choice,
where is a score expressing the subjective value of item for users of type and where expresses the tendency of user of a type to terminate the session. It is easy to verify that for large enough compared to the scores, Assumption 1 holds, and that Assumption 2 holds for any value assigned to and .
For the experiments, we considered the case of , , and . The scores were chosen as follows: For all types, the termination score was . Four items were chosen i.i.d. uniformly at random from the interval . The remaining items where chosen such that for each user type,
items are uniformly distributed in
(strongly related to this type), while the other are drawn uniformly from . We repeated the experiment times, where for each repetition a different set of scores was generated and sessions were generated (a total of sessions).In Figure 1 we present the average session length under the optimal, greedy and simple greedy CPs for different numbers of iterations executed for computing the Value function. The average length that was achieved by the random CP is
, much lower than that of the other methods. The standard deviation is smaller that
in all of our measures. As shown in Figure 1, the extra comparison step in the greedy CP compared to the simple greedy CP substantially improves the performance.7 Discussion and Conclusions
In this work we developed a new framework for analyzing recommendation systems using the MDP framework. The main contribution is twofold. First, we provide a model for the usercold start problem with oneshot sessions, where a single round with low quality recommendations may end the session entirely. We formulate a problem where the objective is to maximize the session length ^{7}^{7}7 Another problem, which is somehow related to the coldstart problem, is the problem of devices that are shared between several users [White et al.2014, File2013]. In this scenario, several people share the same device while the content provider is aware only of the identity of the device and not of the identity of the user. This phenomenon typically occurs with devices in the same household, shared by the members of the family. The methods developed in this work can be easily adapted to solve this problem as well. . Second, we suggest a greedy algorithm overcoming the computational hardship involved with the combinatorial action space present in recommendation system that recommend several item at a time. The effectiveness of our theoretical results is demonstrated with experiments on synthetic data, where we see that our method performs practically as well as the computationally unbounded one.
As future work we plan to generalize our techniques for dealing with the combinatorial action space to setups other than the usercold start problem, and aim to characterize the conditions in which the function is (approximately) monotone and submodular. In particular we will consider an extension to POMDPs as well that may deal with similar settings in which can take larger values.
References

[Agrawal, Teneketzis, and
Anantharam1989]
Agrawal, R.; Teneketzis, D.; and Anantharam, V.
1989.
Asymptotically efficient adaptive allocation schemes for controlled markov chains: Finite parameter space.
Automatic Control, IEEE Transactions on 34(12):1249–1259.  [Bertsekas and Tsitsiklis1995] Bertsekas, D. P., and Tsitsiklis, J. N. 1995. Neurodynamic programming: an overview. In Decision and Control, 1995., Proceedings of the 34th IEEE Conference on, volume 1, 560–564. IEEE.

[Chen, Wang, and
Yuan2013]
Chen, W.; Wang, Y.; and Yuan, Y.
2013.
Combinatorial multiarmed bandit: General framework, results and
applications.
In
Proceedings of the 30th International Conference on Machine Learning (ICML 13), Atlanta, Georgia, USA
.  [Deshpande and Montanari2012] Deshpande, Y., and Montanari, A. 2012. Linear bandits in high dimension and recommendation systems. In Communication, Control, and Computing (Allerton), 2012 50th Annual Allerton Conference on, 1750–1754. IEEE.
 [File2013] File, T. 2013. Computer and internet use in the united states. Population Characteristics.
 [Kaelbling, Littman, and Cassandra1998] Kaelbling, L. P.; Littman, M. L.; and Cassandra, A. R. 1998. Planning and acting in partially observable stochastic domains. Artificial intelligence 101(1):99–134.
 [Maillard and Mannor2014] Maillard, O., and Mannor, S. 2014. Latent bandits. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 2126 June 2014, 136–144.
 [Nemhauser, Wolsey, and Fisher1978] Nemhauser, G. L.; Wolsey, L. A.; and Fisher, M. L. 1978. An analysis of approximations for maximizing submodular set functions—i. Mathematical Programming 14(1):265–294.
 [Radlinski, Kleinberg, and Joachims2008] Radlinski, F.; Kleinberg, R.; and Joachims, T. 2008. Learning diverse rankings with multiarmed bandits. In Proceedings of the 25th international conference on Machine learning, 784–791. ACM.
 [Saari2001] Saari, D. 2001. Decisions and elections: explaining the unexpected. Cambridge University Press.
 [Salomon and Audibert2011] Salomon, A., and Audibert, J.Y. 2011. Deviations of stochastic bandit regret. In Algorithmic Learning Theory, 159–173. Springer.
 [Satsangi, Whiteson, and Oliehoek2015] Satsangi, Y.; Whiteson, S.; and Oliehoek, F. A. 2015. Exploiting submodular value functions for faster dynamic sensor selection. In AAAI 2015: Proceedings of the TwentyNinth AAAI Conference on Artificial Intelligence.
 [Shani, Heckerman, and Brafman2005] Shani, G.; Heckerman, D.; and Brafman, R. I. 2005. An mdpbased recommender system. In Journal of Machine Learning Research, 1265–1295.

[Sutton and
Barto1998]
Sutton, R. S., and Barto, A. G.
1998.
Introduction to reinforcement learning
. MIT Press.  [White et al.2014] White, R. W.; Hassan, A.; Singla, A.; and Horvitz, E. 2014. From devices to people: Attribution of search activity in multiuser settings. In Proceedings of the 23rd international conference on World wide web, 431–442. International World Wide Web Conferences Steering Committee.
 [Yue and Guestrin2011] Yue, Y., and Guestrin, C. 2011. Linear submodular bandits and their application to diversified retrieval. In Advances in Neural Information Processing Systems, 2483–2491.
Appendix A Missing Proofs
Proof of Lemma 1
Here we use as a short for
. By Bayes’ theorem, for any
, and , it follows that(13) 
where stands for the probability that the user type is .
So, the result is obtained.
Appendix B Additional Propositions and Lemmas
In the following propositions and lemmas we derive some results related to greedy maximization of submodular functions. These results are used for the proofs of Theorems 4 and 6.
Model Properties
In the following proposition we show the monotonicity and submodularity properties of the chosen model.
Proposition 7
Under Assumption 2, for any two sets of content items and a content item , it holds that (monotonicity)
(14) 
and (submodularity)
(15) 
for any type .
Almost Submodular Maximization
In this Section we provide three Lemmas: Lemma 8 is the main result which generalizes the classical result proposed in [Nemhauser, Wolsey, and Fisher1978] to ”almost”monotone and ”almost”submodular functions.
Lemma 8
Let be a function mapping subsets of to nonnegative reals with the following properties:


for all ,

for all and ,
for some scalar .
Then, it is obtained that
where is obtained by the Greedy Algorithm and
(18) 
Proof: (Based on Nemhauser et al. 1978) By Lemma 10, for we have
where is the set that obtained by the greedy Algorithm after iterations and the set attains the optimal value, namely, .
In the following Lemma we bound the loss of adding greedily one item to a given set. This lemma is used for the proof of Lemma 10 (which is used for the proof of Lemma 8).
Lemma 9
Under the conditions of Lemma 8, after applying the Greedy Algorithm, it holds that
where the set attains the optimal value, namely, , and the set is the set that obtained by the greedy Algorithm after iterations.
Proof: For every set of content items and , we denote and . So, we have,
Then, since for every and
and
it is obtained that
Therefore,
Then, for the choice of , since , we have
In the following Lemma we bound the loss that is incurred by adding greedily a certain number of items to a set. This lemma is used for the proof of Lemma 8.
Lemma 10
Under the Greedy Algorithm, it holds that
where the set attains the optimal value, namely, , and the set is the set that obtained by the greedy Algorithm after iterations.
Appendix C Proof of Theorem 4
In this Section we provide the proof of Theorem 4. Here, we use and for shorthand of and , respectively. We begin with a Lemma that upper bounds the value function obtained by the operator. Then, in Lemmas 12 and 13 we show a monotonic increasing property of the value function. In Lemmas 14 and 15 we show the convexity of the value function. In Lemma 16 we show the “almost” submodularity of the Qfunction, while in Lemma 17 we show the monotonicity of the Qfunction. Lemma 18 shows the direct relation between a larger set of items larger long term cumulative reward. We conclude this section with the proof of Theorem 4 which is based on Lemmas 1118.
Lemma 11
For every , and zero initiation of the value function (namely, ), it holds that
Proof: It is obtained easily by Assumption 1, that for every it holds that
where is an upper bound on for every . So, since , we have that
In the next two lemmas we show a monotonic increasing property of the value function that is obtained by the operator .
Lemma 12
Let and let and be a pair of positive constants. Assume that
(19) 
for all . Then it holds that
Proof:
The result is immediate since for every .
Lemma 13
Let and let and be a pair of positive constants. Assume that
(20) 
for all . Then, we have for any positive integer
Proof: We prove the claim by induction over . The base case for holds due to Lemma 12. Assume that the lemma is satisfied for . Recall Equation (3) characterizing
By plugging in with Equation (20) we get that
for any , and , as . Therefore, by the induction assumption applied for
(21) 
for every and . Furthermore, by Equation (20)
(22) 
for every and . So, by the fact that
and also respectively for and , it is obtained by Equations (21) and (22) that
(23) 
for any .
So, by Definition 3 the result is obtained.
In the following two lemmas we show a convexity property of the value function that is obtained by the operator.
Lemma 14
Let and let , and be a tuple of positive constants. Assume that
(24) 
for all . Then it holds that
Proof:
True for initiate value function for every .
Lemma 15 (Convexity)
Let and let , and be a tuple of positive constants. Assume that
(25) 
for all . We have that for any positive integer it holds that