In recent times Reinforcement Learning (RL) has seen great success in many domains. In particular, Q-learning Watkins and Dayan 
and its deep learning extension DQNMnih et al.  have shown great performance in challenging domains such as the Atari Learning Environment Bellemare et al. -functions; and the ability to learn off-policy and use replay buffers Lin , which allows DQN to be very sample efficient. Traditional RL focuses on the interaction between one agent and an environment. However, in many cases of interest, a multiplicity of agents will need to interact with a unique environment and with each other. This is the object of study of Multi-agent RL (MARL), which goes back to the early work of Tan  and has seen renewed interest of late (for an updated survey see Zhang et al. ). In this paper we consider the particular case of cooperative MARL in which the agents form a team and have a shared unique goal. We are interested in tasks where collaboration is fundamental and a high degree of coordination is necessary to achieve good performance. In particular, we consider two scenarios.
In the first scenario, the global state and all actions are visible to all agents (one example of this situation could be a team of robots that collaborate to move a big and heavy object). It is well known that in this scenario the team can be regarded as one single agent where the aggregate action consists of the joint actions by all agents Littman . The fundamental drawback of this approach is that the joint action space grows exponentially in the number of agents and the problem quickly becomes intractable Kok and Vlassis , Guestrin et al. [2002b]. Another important inconvenience with this approach is that it cannot cope with a changing number of agents (for example if the system is trained with 4 agents, it cannot be executed by a team of 5 agents; we expand on this point in a later section). One well-known and popular approach to solve these issues, is to consider each agent as an independent learner (IL) Tan . However, this approach has a number of issues. First, from the point of view of each IL, the environment is non-stationary (due to the changing policies of the other agents), which jeopardizes convergence. And second, replay buffers cannot be used due to the changing nature of the environment and therefore even in cases where this approach might work, the data efficiency of the algorithm is negatively affected. Ideally, it is desirable to derive an algorithm with the following features: i) it learns individual policies (and is therefore scalable), ii) local actions chosen greedily with respect to these individual policies result in an optimal team action iii) can be combined with NNs, iv) works off-policy and can leverage replay buffers (for data efficiency), v) and enjoys theoretical guarantees to team optimal policies at least in the dynamic programming scenario. Indeed, the main contribution of this work is the introduction of Logical Team Q-learning (LTQL), an algorithm that has all these properties. We start in the dynamic programing setting and derive equations that characterize the desired solution. We use these equations to define the Factored Team Optimality Bellman Operator and provide a Theorem that characterizes the convergence properties of this operator. A stochastic approximation of the dynamic programming setting is used to obtain the tabular and deep versions of our algorithm. For the single agent setting, these steps reduce to: the Bellman optimality equation, the Bellman optimality operator (and the theorem which states the linear convergence of repeated application of this operator) and Q-learning (in its tabular form and DQN).
In the second scenario, we consider the centralized training and decentralized execution paradigm. During execution, agents only have access to observations which we assume provide enough information to play an optimal team policy. An example of this case would be a soccer team in which the attackers have the ball and see each other but do not see the goalkeeper or the defenders of their own team (arguably this information is enough to play optimally and score a goal). The techniques we develop for the previous scenario can be applied to this case without modification.
1.1 Relation to prior work
Some of the earliest works on MARL are Tan , Claus and Boutilier . Tan  studied Independent Q-learning (IQL) and identified that IQL learners in a MARL setting may fail to converge due to the non-stationarity of the perceived environment. Claus and Boutilier  compared the performance of IQL and joint action learners (JAL) where all agents learn the -values for all the joint actions, and identified the problem of coordination during decentralized execution when multiple optimal policies are available. Littman  later provided a proof of convergence for JALs. Recently, Tampuu et al.  did an experimental study of ILs using DQNS in the Atari game Pong. All these mentioned approaches cannot use experience replay due to the non-stationarity of the preceived environment. Following Hyper Q-learning Tesauro , Foerster et al.  addressed this issue to some extent using fingerprints as proxys to model other agents’ strategies.
Lauer and Riedmiller  introduced Distributed Q-learning (DistQ), which in the tabular setting has guaranteed convergence to an optimal policy for deterministic MDPs. However, this algorithm performs very poorly in stochastic scenarios and becomes divergent when combined with function approximation. Later Hysteretic Q-learning (HystQ) was introduced in Matignon et al.  to improve these two limitations. HystQ
is based on a heuristic and can be thought of as a generalization ofDistQ. These works also consider the scenario where agents cannot perceive the actions of other agents. They are related to LTQL (from this work) in that they can be considered approximations to our algorithm in the scenario where agents do not have information about other agents’ actions. Recently Omidshafiei et al.  introduced Dec-HDRQNs for multi-task MARL, which combines HystQ with Recurrent NNs and experience replay (which they recognize is important to achieve high sample efficiency) through the use of Concurrent Experience Replay Trajectories.
Wang and Sandholm  introduced OAB
, the first algorithm that converges to an optimal Nash equilibrium with probability one in any team Markov game.OAB considers the team scenario where agents observe the full state and joint actions. The main disadvantage of this algorithm is that it requires estimation of the transition kernel and rewards for the joint action state space and also relies on keeping count of state-action visitation, which makes it impractical for MDPs of even moderate size and cannot be combined with function approximators.
Guestrin et al. [2002a, b], Kok and Vlassis  introduced the idea of factoring the joint -function to handle the scalability issue. These papers have the disadvantage that they require coordination graphs that specify how agents affect each other (the graphs require significant domain knowledge). The main shortcoming of these papers is the factoring model they use, in particular they model the optimal -function (which depends on the joint actions) as a sum of local -functions (where is the number of agents, and each -function considers only the action of its corresponding agent). The main issue with this factorization model is that the optimal -function cannot always be factored in this way, in fact, the tasks for which this model does not hold are typically the ones that require a high degree of coordination, which happen to be the tasks where one is most interested in applying specific MARL approaches as opposed to ILs. Moreover, even if the -function can be accurately modeled in this way, there is no guarantee that if individual agents select their optimum strategies by maximizing their local -functions the resulting joint action maximizes the global -function. The approach we introduce in this paper also considers learning factored -functions. However, the fundamental difference is that the factored relations we estimate always exist and the joint action that results from maximizing these individual -functions is optimal. VDN Sunehag et al.  and QMIX Rashid et al.  are two recent deep methods that also factorize the optimal -function assuming additivity and monotonicity, respectively. This factoring is their main limitation since many MARL problems of interest do not satisfy any of these two assumptions. Indeed, Son et al.  showed that these methods are unable to solve a simple matrix game. Furthermore, the individual policies cannot be used for prediction, since the individual values are not estimates of the return. To improve on the representation limitation due to the factoring assumption, Son et al.  introduced QTRAN which factors the -function in a more general manner and therefore allows for a wider applicability. The main issue with QTRAN is that although it can approximate a wider class of -functions than VDN and QMIX, the algorithm resorts to other approximations, which degrade its performance in complex environments (see Rashid et al. ).
Recently, actor-critic strategies have been explored Foerster et al. , Gupta et al. . However, these methods have the inconvenience that they are on-policy and therefore do not enjoy the data efficiency that off-policy methods can achieve. This is of significant importance in practical MARL settings since the state-action space is very large.
2 Problem formulation
We consider a situation where multiple agents form a team and interact with an environment and with each other. We model this interaction as a Team Markov Decision Process (TMDP),111Prior works use definitions such as Dec-POMDP Oliehoek et al. , Multi-agent MDPs (MAMDP) Lauer and Riedmiller  or Team Markov Games Wang and Sandholm . However, these definitions are different from ours, which is why we opted for the alternative name of TMDP. In particular, TMDPs include the notion of types of agents. which we define by the tuple (,,,,,,). Here, is a set of global states shared by all agents; is the set of types of agents; is the total amount of agents, each of type ; is the observation function for agents of type , whose output lies in some set of observations ;222In other words, is agent’s description of the global state from its own perspective. is the set of actions available to agents of type ; specifies the probability of transitioning to state from state having taken joint actions ; and is a global reward function. Specifically, is the reward when the team transitions to state from state having taken actions . The reward
can be a random variable following some distribution. We clarify that from now on we will refer to the collection of all individual actions as the team’s action, denoted as . Furthermore we will use to refer to the actions of all agents except for action . Therefore we can write , and . The goal of the team is to maximize the team’s return:
where and are the state and actions at time , respectively, is the team’s policy, is the distribution of initial states, and is the discount factor. We clarify that we use bold font to denote random variables and the notation makes explicit that the expectation is taken with respect to distribution . From now on, we will only make the distributions explicit in cases where doing so makes the equations more clear. Accordingly, the team’s optimal state-action value function () and optimal policy () are given by Sutton and Barto :
As already mentioned, a team problem of this form can be addressed with any single-agent algorithm. The fundamental inconvenience with this approach is that the joint action space scales exponentially with the number of agents, more specifically (where is the joint action space). Another problem with this approach is that the learned -function cannot be executed in a decentralized manner using the agents’ observations. Furthermore, the learned quantities (value functions or policies) are useless if the number of agents changes. However, if factored policies are learned, then these could be executed by teams with different number of agents (as long as the extra agents are of the same "type" as the agents used for learning. In section 5.3 we provide one example of this scenario). For these reasons, in the next sections we concern ourselves with learning factored quantities.
We assume that if for two states and we have , then .
Agents of the same type are assumed to be homogeneous. Mathematically, if two agents and are homogeneous, then for every state there is another equivalent state such that:
In simple terms assumption 1 means that even though observations are not full descriptions of the state, they provide enough information to know the effect of individual actions assuming everybody else in the team acts optimally (intuitively this is a reasonable requirement if the agents are expected to be able to play a team optimum strategy using only their partial observations). Assumption 2 means that if two agents of the same type are swapped (while other agents remain unchanged), then the value functions of the corresponding states are equal independently of the policy being executed by the team (as long as the agents swap their corresponding policies as well).
3 Factored Bellman relations and dynamic programming
Similarly to the way that relations 2 are used to derive -learning in the single agent setting, the goal of this section is to derive relations in the dynamic programming setting from which we can derive a MARL algorithm. The following two lemmas take the first steps in this direction.
See Appendix 6.1.
A simple interpretation of equation (4c) is that is the expected return starting from state when agent takes action while the rest of the team acts in an optimal manner.
Lemmas 2 and 2 are important because they show that if the agents learn factored functions that satisfy (4) and act greedily with respect to their corresponding , then the resulting team policy is guaranteed to be optimal and hence they are not subject to the coordination problem identified in Lauer and Riedmiller 333This problem arises in situations in which the TMDP has multiple deterministic team optimal policies and the agents learn factored functions of the form (we remark that these are not the same as ). (we show this in section 5.1). Therefore, an algorithm that learns would satisfy the first two of the five desired properties that were enumerated in the introduction. As a sanity check, note that for the case where there is only one agent, equation (4c) simplifies to the Bellman optimality equation. Furthermore, Lemma 2 can be seen as an extension to the TMDP case of the well known result that states that every MDP has at least one deterministic optimal policy Puterman . Although in the single agent case the Bellman optimality equation can be used to obtain (applying repeatedly the operator of the same name), we cannot do the same with (4c). The fundamental reason for this is that the functions are not the only functions that satisfy relation (4c).
We prove this with an example. See appendix 6.2.
Note that Remark 1 implies that relation (4c) is not sufficient to derive a learning algorithm capable of obtaining a team optimal policy because it can find sub-optimal Nash equilibria. To avoid this inconvenience, it is necessary to find another relation that is only satisfied by . We can obtain one such relation combining (4a) and (4c):
The sub-optimal Nash fixed points mentioned in Remark 1 do not satisfy relation (6) since by definition the right hand side is equal to . Intuitively, equation (6) is not satisfied by these suboptimal strategies because the considers all possible team actions (while Nash equilibria only consider unilateral deviations).
where is the indicator function, is a Boolean variable, is the logical not operator applied on , and is some distribution that assigns strictly positive probability to every . Note that operator is stochastic: every time it is applied to , is sampled according to .
A simple interpretation of operator is the following. Consider a basketball game, in which player has the ball and passes the ball to teammate . If gets distracted, misses the ball and the opposing team ends up scoring, should learn from this experience and modify its policy to not pass the ball? The answer is no, since the poor outcome was player ’s fault. In plain English, from the point of view of some player , what the first term of (7) means is "I will only learn from experiences in which my teammates acted according to what I think is the optimal team strategy". It is easy to see why this kind of stubborn rationale cannot escape Nash equilibria (i.e., agents do not learn when the team deviates from its current best strategy, which obviously is a necessary condition to learn better strategies). The interpretation of the full operator is "I will learn from experiences in which: a) my teammates acted according to what I think is the optimal team strategy; or b) my teammates deviated from what I believe is the optimal strategy and the outcome of such deviation was better than I expected if they had acted according to what I thought was optimal", which arguably is what a logical player would do (this is the origin of the algorithm’s name).
Repeated application of the operator to any initial -functions converge to set with probability one. Mathematically:
The mean convergence rate is exponential with constant lower bounded by , where is the lowest probability assigned to any by (i.e. ).
See appendix 6.3.
As a sanity check, notice that in the single agent case operator reduces to the Bellman optimality operator and Theorem 1 reduces to the well known result that repeated application of the Bellman optimality operator to any initial -function converges at an exponential rate (with constant ) to .
4 Reinforcement learning
In this section we present LTQL (see algorithm 1), which we obtain as a stochastic approximation to operator . Note that the algorithm utilizes two estimates for each type , a biased one parameterized by (which we denote ) and an unbiased one parameterized by (which we denote ). We clarify that in the listing of algorithm 1 we used a constant step-size, however this can be replaced with decaying step-sizes or other schemes such as AdaGrad Duchi et al.  and Adam Kingma and Ba . Note that the target of the unbiased network is used to calculate the target values for both functions; this prevents the bias in the estimates (which arises due to the condition) from propagating through bootstrapping. The target parameters of the biased estimates () are used solely to evaluate condition . We have found that this stabilizes the training of the networks, as opposed to just usingweights samples that satisfy condition () differently from those who satisfy . Intuitively, since the purpose of condition is to escape Nash equilibria, should be chosen as small as possible as long as the algorithm doesn’t get stuck in such equilibria. As we remarked in the introduction, LTQL reduces to DQN for the case where there is a unique agent. In appendix 6.6 we include the tabular version of the algorithm along with a brief discussion.
Note that LTQL works off-policy and there is no necessity of synchronization for exploration. Therefore, it can be implemented in a fully decentralized manner as long as all agents have access to all observations (and therefore to the full state) and actions of other agents (so that they can evaluate ). Interestingly, if condition () was omitted (to eliminate the requirement that agents have access to all this information), the resulting algorithm is exactly DistQ Lauer and Riedmiller . However, as the proof of theorem 1 indicates, the resulting algorithm would only converge in situations where it could be guaranteed that during learning overestimation of the values is not possible (i.e., the tabular setting applied to deterministic MDPs; this remark was already made in Lauer and Riedmiller ). In the case where this condition could not be guaranteed (i.e., when using function approximation and/or stochastic MDPs), some mechanism to decrease overestimated values would be necessary, as this is the main tasks of updates due to . One possible way to do this would be to use all transitions to update the estimates but use a smaller step-size for the ones that do not satisfy . Notice that the resulting algorithm would be exactly HystQ Matignon et al. .
5.1 Matrix game
The first experiment is a simple matrix game (figure 1 shows the payoff structure) with multiple team optimum policies to evaluate the resilience of the algorithm to the coordination issue mentioned in section 3. In this case, we implemented LTQL and DistQ in tabular form (we do not include HystQ because in deterministic environments with tabular representation this algorithm is dominated by DistQ) and we also implemented Qmix (note that this algorithm cannot be implemented in tabular form due to the use of the mixing network). In all cases we used uniform exploratory policies () and we did not use replay buffer. DistQ converges to (12), which clearly shows why DistQ has a coordination issue. However, LTQL converges to either of the two possible solutions shown in (13) (depending on the seed) for which individual greedy policies result in team optimal policies. Qmix converges to (14). Note that Qmix fails at identifying an optimum team policy and the resulting joint -function obtained using the mixing network also fails at predicting the rewards. The full is shown in appendix 6.7, where we also include the learning curves of all algorithms for the readers reference along with a brief discussion.
5.2 Stochastic finite TMDP
In this experiment we use a tabular representation in a stochastic episodic TMDP. The environment is a linear grid with 4 positions and 2 agents. At the beginning of the episode, the agents are initialized in the far right. Agent 1 cannot move and has 2 actions (push button or not push), while agent 2 has 3 actions (stay, move left or move right). If agent 2 is located in the far left and chooses to stay while agent 2 chooses push, the team receives a reward. If the button is pushed while agent 2 is moving left the team receives a
reward. This negative reward is also obtained if agent 2 stays still in the leftmost position and agent 1 does not push the button. All rewards are subject to additive Gaussian noise with mean 0 and standard deviation equal to 1. Furthermore if agent 2 tries to move beyond an edge (left or right), it stays in place and the team receives a Gaussian reward withmean and standard deviation equal to . The TMDP finishes after 5 timesteps or if the team gets the reward (whichever happens first). We ran the simulation times with different seeds. Figure 0(a) shows the average test return444The average test return is the return following a greedy policy averaged over games. (without the added noise) of LTQL, HystQ, DistQ and Qmix. As can be seen, LTQL is the only algorithm capable of learning the optimal team policy. In appendix 6.8 we specify the hyperparameters and include the learning curves of the -functions along with a discussion on the performance of each algorithm.
5.3 Cowboy bull game
In this experiment we use a more complex environment. The TMDP is a challenging predator-prey type game, in which 4 (homogeneous) cowboys try to catch a bull (see figure 1). The position of all players is a continuous variable (and hence the state space is continuous). The space is unbounded and the bull can move faster than the cowboys. The bull follows a fixed stochastic policy, which is handcrafted to mimic natural behavior and evade capture. Due to the unbounded space and the fact that the bull moves faster than the cowboys, it cannot be captured unless all agents develop a coordinated strategy (the bull can only be caught if the agents first surround it and then close in evenly). The task is episodic and ends after 75 timesteps or when the bull is caught. Each agent has 5 actions (the four moves plus stay). When the bull is caught a reward is obtained and the team also receives a small penalty () for every agent that moves. Note that due to the reward structure there is a very easily attainable Nash equilibrium, which is for every agent to stay still (since in this way they do not incur in the penalties associated with movement). In this game, since all agents are homogeneous, only one -function is learned whose input is the agent’s observation and the output are the -values corresponding to the 5 possible actions. Figure 1 shows the test win percentage555Percentage of games, out of 50, in which the team succeeds to catch the bull following a greedy policy. and figure 1 shows the average test return for LTQL, HystQ and Qmix. The best performing algorithm is LTQL. HystQ learns a policy that catches the bull of the times, although it fails at obtaining returns higher than zero. We believe that the poor performance of Qmix in this task is a consequence of its limited representation capacity due to its monotonic factoring model. As we mentioned in the introduction, we can test the learned policy on teams with different number of agents, figure 1 shows the results. The policy scores above for teams of all sizes bigger than . Note that the policy can be improved for any particular team size by further training if necessary. In the appendix we provide all hyperparameters and implementation details, we detail the bull’s policy and the observation function. All code, a pre-trained model and a video of the policy learned by LTQL are included as supplementary material.
In this article we have introduced theoretical groundwork for cooperative MARL. We also introduced LTQL, which has the desirable properties mentioned in the introduction. Furthermore, it does not impose constraints on the learned individual -functions and hence it can solve environments where previous algorithms, which are considered to be state of the art such as Qmix Rashid et al. , fail. The algorithm fits in the centralized training and decentralized execution paradigm. It can also be implemented in a fully distributed manner in situations where all agents have access to each others’ observations and actions.
This paper introduces novel concepts and algorithms to MARL theory. We believe the material we present does not introduce any societal or ethical considerations worth mentioning in this section.
The arcade learning environment: an evaluation platform for general agents.
Journal of Artificial Intelligence Research47, pp. 253–279. Cited by: §1.
- The dynamics of reinforcement learning in cooperative multiagent systems. AAAI/IAAI 1998 (746-752), pp. 2. Cited by: §1.1.
Adaptive subgradient methods for online learning and stochastic optimization.
Journal of Machine Learning Research, pp. 2121–2159. Cited by: §4.
- Counterfactual multi-agent policy gradients. In AAAI Conference on Artificial Intelligence, Cited by: §1.1.
- Stabilising experience replay for deep multi-agent reinforcement learning. In Proceedings International Conference on Machine Learning, pp. 1146–1155. Cited by: §1.1.
- Multiagent planning with factored MDPs. In Advances in neural information processing systems, pp. 1523–1530. Cited by: §1.1.
- Coordinated reinforcement learning. In ICML, Vol. 2, pp. 227–234. Cited by: §1.1, §1.
- Cooperative multi-agent control using deep reinforcement learning. In International Conference on Autonomous Agents and Multiagent Systems, Sao Paulo, Brazil, pp. 66–83. Cited by: §1.1.
- Adam: a method for stochastic optimization. arXiv:1412.6980. Cited by: §4.
- Sparse cooperative Q-learning. In Proceedings International Conference on Machine Learning, pp. 61. Cited by: §1.1, §1.
- An algorithm for distributed reinforcement learning in cooperative multi-agent systems. In Proc. International Conference on Machine Learning (ICML), Palo Alto, USA, pp. 535–542. Cited by: §1.1, §3, §4, footnote 1.
- Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning 8 (3-4), pp. 293–321. Cited by: §1.
- Value-function reinforcement learning in markov games. Cognitive Systems Research 2 (1), pp. 55–66. Cited by: §1.1, §1.
- Hysteretic Q-learning: an algorithm for decentralized reinforcement learning in cooperative multi-agent teams.. In Proc. IEEE/RSJ International Conference on Intelligent Robots and Systems, San Diego, USA, pp. 64–69. Cited by: §1.1, §4.
- Playing Atari with deep reinforcement learning. arXiv:1312.5602. Cited by: §1.
- A concise introduction to decentralized pomdps. Vol. 1, Springer. Cited by: footnote 1.
- Deep decentralized multi-task multi-agent reinforcement learning under partial observability. In Proceedings International Conference on Machine Learning, pp. 2681–2690. Cited by: §1.1.
- Markov decision processes: discrete stochastic dynamic programming. Wiley, NY. Cited by: §3.
- Monotonic value function factorisation for deep multi-agent reinforcement learning. arXiv:2003.08839. Cited by: §1.1, §6, §6.8, §6.9.
- QMIX: monotonic value function factorisation for deep multi-agent reinforcement learning. In International Conference on Machine Learning, pp. 4295–4304. Cited by: §1.1.
- QTRAN: learning to factorize with transformation for cooperative multi-agent reinforcement learning. In International Conference on Machine Learning, pp. 5887–5896. Cited by: §1.1.
- Value-decomposition networks for cooperative multi-agent learning based on team reward. In Proceedings International Conference on Autonomous Agents and Multiagent Systems, pp. 2085–2087. Cited by: §1.1.
- Reinforcement learning: an introduction. MIT Press. Cited by: §2.
- Multiagent cooperation and competition with deep reinforcement learning. PloS one 12 (4). Cited by: §1.1.
- Multi-agent reinforcement learning: independent vs. cooperative agents. In Proceedings International Conference on Machine Learning, pp. 330–337. Cited by: §1.1, §1, §1.
- Extending Q-learning to general adaptive multi-agent systems. In Proc. Advances in Neural Information Processing Systems, Vancouver, Canada, pp. 871–878. Cited by: §1.1.
- Reinforcement learning to play an optimal nash equilibrium in team markov games. In Proc. Advances in Neural Information Processing Systems, Vancouver, Canada, pp. 1603–1610. Cited by: §1.1, footnote 1.
- Q-learning. Machine Learning 8 (3-4), pp. 279–292. Cited by: §1.
- Multi-agent reinforcement learning: a selective overview of theories and algorithms. arXiv:1911.10635. Cited by: §1.
6.1 Proof of Lemma 1
We assume that if for two states and we have , then .
We start by rewriting equation (2b) for convenience:
Now assume that we have some team optimal policy . We define as follows:
In simple terms is the -value if agent takes action while the rest of the agents act optimally (note that this is not the same as ). Due to assumption 1, can be written as a function of the observations. Therefore, we define as:
Note that by construction we get:
Note that (18), (20) and (23) satisfy equations (4a), (4b) and (4c), respectively. However, functions are defined on a per agent basis (which means that there are such functions) while functions depend on the type (and hence there are such functions). Therefore, we still have to prove that functions corresponding to agents of the same type are equal. From assumption 2, it follows that choosing in (3) to be we get:
which completes the proof.
6.2 Proof of remark 1
Consider the matrix game with two homogeneous agents, each of which has two actions () and the following reward structure:
For this case , , and are given by:
6.3 Proof of Theorem 1
We start defining the following auxiliary constants and operators:
where is the maximum mean reward, is the minimum mean reward and is the minimum probability assigned to any by the discrete distribution .
Repeated application of operator to any converges to set with probability one. The mean rate of convergence is exponential with Lipschitz constant .
See appendix 6.4.
Repeated application of operator to any converges to set with probability one. The mean rate of convergence is exponential with Lipschitz constant .
See appendix 6.5.
6.4 Proof of Lemma 3
We start defining , where . The first part of the proof consists in showing that any sequence of the form is equal to where and is the number of times that operator is applied in the aforementioned sequence. Applying operator to we get:
Therefore, for any . Applying operator we get:
where to simplify notation we defined . Further application of we get: