1 Introduction
1.1 Background and Related Work
1.1.1 Reinforcement Learning
Reinforcement learning (RL) methods seek to learn a policy (a function which takes the observation and returns an action) that achieves the maximum expected total reward within a given environment. Singleagent environments are traditionally modeled as a Markov Decision Process (“MDP”) or a partiallyobservable MDP (“POMDP”) Boutilier [1996]. An MDP models decisionmaking as a process where an agent repeatedly takes a single action, receives a reward, and transitions to a new state (receiving complete knowledge of the state). A POMDP extends this to include environments where the agent may not be able to observe the whole state.
MDPs can be formally defined as:
Definition 1.
A Markov Decision Process (MDP) is a 4tuple , where:

is a finite set of states.

is the (finite) set of actions available for the agent to take.

is the transition function. It has the property that , .

is the reward function.
According to this definition, the agent is in one of several possible states, given by , and performs one of the possible actions, given by . When the agent is in state and takes action , the transition function
gives the probability that the agent will transition into state
(i.e., if and denote the state and action, respectively, at time , ), while the reward function specifies the agent’s reward. Classical reinforcement learning aims to devise a strategy for the agent to take that maximizes its expected reward.Definition 2.
A PartiallyObservable Markov Decision Process (POMDP) is a 6tuple , where:

is the (finite) set of states.

is the set of actions available to the agent.

is the transition function. It has the same property as in a standard MDP, i.e., .

is the reward function.

is the set of possible observations the agent may perceive.

is the observation function. It has the property that .
The view of the POMDP is similar to that of the standard MDP: at each time step , the agent begins in state and selects an action to perform; specifies the probability of transitioning into state from state when selecting action ; and is the reward the agent receives when selecting action in state and transitioning into state . However, the agent does not see the full state at time , but only sees an observation . The observation function specifies the probability that the agent sees observation after selecting action and transitioning into the true state
. At any point in time, the agent does not know the true state, but instead can only compute a probability distribution on possible states, called the
belief distribution, based on the observations it has seen and the actions it has performed.A policy gives a strategy for the agent: given an observation , the policy gives a probability distribution on actions (and thus, it has the property that ) for the agent to take. The goal of reinforcement learning is to learn an optimal (i.e., one which maximizes expected reward) policy, denoted .
1.1.2 MultiAgent Reinforcement Learning
In multiagent reinforcement learning, there are several models similar to MDPs (or POMDPs) that account for multiple agents. Perhaps the most simple model for multiagent learning is the aptly named Multiagent MDP (or “MMDP”) introduced by Boutilier [1996].
Definition 3.
A Multiagent Markov Decision Process (MMDP) is a tuple , where:

is a finite set of states.

is the number of agents. Then, the set of agents is .

is a family of sets where is the set of actions available for agent to take.

is the transition function. It has the property that , .

is the reward function.
The work of Boutilier [1996] was interested in studying optimal planning and learning in coordinated decision making processes. In such settings, agents work cooperatively toward a single shared goal, and thus it was natural for this purpose to have a single reward function that specifies the shared reward. However, there are many other natural RL settings in which it makes sense for agents to have separate rewards.
The Stochastic Games model (sometimes called Markov Games), introduced by Shapley [1953]
in the study of game theory. This model differs from MMDPs in that it has a separate reward function for each agent.
Definition 4 (Stochastic Game).
A Stochastic Game (POSG) is a tuple:
where:

is the set of possible states.

is the number of agents. The set of agents is .

is the set of possible actions for agent .

is the transition function. It has the property that , .

is the reward function for agent .
Both MMDPs and Stochastic Games assume that the current state is fully observable. In this sense, they can be seen as multiagent generalizations of standard MDPs. To model settings with partial observability (as with POMDPs), we can instead utilize a decentralized POMD, or “DecPOMDP”, introduced in Bernstein et al. [2002] as a partiallyobservable generalization of MMDPs. For sitauations in which agents each have their own reward function, we can instead define a partiallyobservable Stochastic Game, or POSG (see, e.g., Lowe et al. [2017]). We present the definition of the latter below.
Definition 5 (PartiallyObservable Stochastic Game).
A PartiallyObservable Stochastic Game (POSG) is a tuple , where:

is the set of possible states.

is the number of agents. The set of agents is .

is the set of possible actions for agent .

is the transition function. It has the property that , .

is the reward function for agent .

is the set of possible observations for agent .

is the observation function. for all and .
1.1.3 Nonstationarity
In single agent RL, a nonstationary environment is one that changes during learning [Choi et al., 2000, Chades et al., 2012]. In MARL, the term typically refers to nonstationarity arising from the behaviors of other agents (which must be learned as parts of the environment) changing as they learn Matignon et al. [2012]. This often results in agents having to iteratively relearn each other’s policies to achieve a globally optimal set of policies.
In singleagent learning, Choi et al. [2000] considered nonstationary systems such as elevator control, where the environmental conditions might change over the course of a day or week. They introduced hiddenmode MDPs (HMMDP) to model nonstationary environments; however, this model is not well suited for the nonstationarity of multiagent learning, since it assumes the environment will only take on a small number of possible “modes” (a single agent’s view of the environment depends on all other agents’ policies, so the number of possible modes would be massive) and that the mode does not change often (as agents learn, their strategies may change constantly and frequently). The HMMDP model can be seen as a more restricted kind of POMDP, since every hiddenmode MDP has an equivalent POMDP. Later work Chades et al. [2012] introduced another model, called Mixed Observability MDP (MOMDP), which can also be seen as a special kind of POMDP; they showed that HMMDPs and MOMDPs are both PSPACEcomplete to solve.
In MARL, the nonstationarity problem caused by having multiple agents learning simultaneously was further studied in the context of deep MARL by Omidshafiei et al. [2017] who formally defined what it means for an individual agent’s process to be stationary.
1.2 Our Contributions
While models like POSGs describe the environment in which agents act, they do not describe the process by which these agents learn how they will act. We propose a new model, MultiAgent Informational Learning Process, which does not attempt to model the games or environments in which agents will act, but rather attempts to model how algorithms gradually learn policies over time. We do this by considering learning as a process of acquiring information about the environment and other agents. The MultiAgent Informational Learning Processattempts to provide a way of modeling how much information agents have learned about different aspects of their environment (and the other agents in the environment), so as to provide a new perspective with which to view how quickly different algorithms are able to learn policies. Our model attempts to capture the problem of nonstationarity by considering an agent as losing information when another agent’s behavior changes due to learning a more effective policy.
2 Model Overview
2.1 Coordination Tensor Sets
Here, we present the notion of a coordination tensor set
of a game. The coordination tensor set is a collection of values that describe how much certain agents must coordinate their strategies in an optimal policy.
For convenience we define . Formally, we define for an agent and a group of agents as the “coordination value” for with respect to . This value describes how much an optimal policy for agent depends on the collective behavior of the agents in . In our model, will correspond to the amount of information agent must learn about the behavior of as a group. We normalize the total amount of information any single agent must learn to so that for every agent, . The inequality here is to account for the information agent must learn about the stationary portions of its environment (e.g., states/observations, and its reward function); we define to be the amount of information that agent must learn that does not depend on the behavior of any other agent or group of agents. The collection of all coordination values is the coordination tensor set.
2.1.1 Modeling Learning through Sequential Information Updates
Our model describes the process of learning via an iterative update process that describes how much information is learned by an agent in each (discrete) time step of the learning process. First, we define to be the amount of information agent has at time , so that corresponds to the situation where agent has learned all the information necessary to have learned an optimal policy.
We further break down the information each agent has in terms of how much information it has about each group of agents as well as how much information it has about the environment. Formally, we define to be the amount of information agent has about the collective behavior of (relative to the total information needs to learn in order to have learned an optimal policy); when , has learned enough about the joint behavior of the agents in for an optimal policy. We similarly define to be the amount of information agent has learned about the stationary portions of its environment.
(1) 
For any agent and any , define to be the amount of new information agent learns about at time . In order to give an explicit formula for , we note that the amount of new information gained should be a function of the amount of information left to learn about . We denote this function by in (2).
(2) 
In (2), the value (which we refer to as the learning rate coefficient) allows for scaling separately for each to account for differences in learning rate. Aside from scaling by , should be the same for all agents, because the shape of this “learning curve” will depend on factors that are independent of the agents involved, such as properties of the learning algorithm itself; across different agents, only the rate of learning (i.e., the “steepness” of this curve) may change, which is accounted for by the scaling factor .
Next, consider a situation where (a very optimistic learning rate, for the sake of example), and . In this case ; that is, agent learns all the remaining unlearned information about . Learning everything in a single time step is the best any algorithm can ever hope for and no algorithm can do better than this. Thus, we restrict our interest to functions where for all (in practice, will likely be much smaller).
We will find it useful to refer to the total information gain of agent at time , which we shall write as .
(3) 
In the following section, we will refer to a quantity denoted , and similarly to . This quantity represents the amount of information the agents of a group jointly have about each other.
(4)  
(5) 
We define and in equations (4) and (5) respectively. This ensures the desirable property that (and similarly for ). As a further example, the “group information” of a two agent group is
2.1.2 Modeling Nonstationarity Via Information Loss
We now address how to model the nonstationarity issue in our informational model. One way to view the nonstationarity issue is that from the view of an individual agent , as other agents learn new knowledge, their behavior changes, thus rendering invalid some of the “knowledge” (or information) agent had about those agents. We model this phenomenon as information loss. Just as we defined for any agent and the information gain at time as , we define the information loss as . Note however, that we do not define since information about the environment will never be invalidated (since it is nonstationary), and so there is no environmental information loss.
(6) 
Similar to information gain, we can write the total amount of information lost by agent at time step as .
2.1.3 The Full Update Process
With and defined, we can now fully specify the information update process. Specifically, we define as a recurrence relation below.
The process unfolds as follows. Agent begins with some small amount of information . At each time step, agent gains information equal to and loses information according to . We would like to know at which time step all agents have learned (nearly) all the information they need. We allow for some small tolerance, and we wish to find the first time step for which for all agents . For convenience, we denote by the minimum time step for which .
Our model is incredibly general as it depends on many parameters that specify properties of both the game to be learned and the learning algorithm used to learn the policies of agents. An instance of this model (which we call a MultiAgent Informational Learning Process) can be fully specified by all of the parameters described above, which we summarize in Definition 6 below.
Definition 6 (Mailp).
A MultiAgent Informational Learning Process (MAILP) is a fully specified by:

A number of agents

The coordination tensor set , where and , with the property that for all agents , we have . The dependence on the environment is then defined for agent as .

The collection of learning rates , where and

The learning shape function , which should be an increasing function such that for all .
Justin Terry was supported by the QinetiQ Fundamental Machine Learning Fellowship.
References
 The complexity of decentralized control of markov decision processes. Mathematics of Operations Research 27 (4), pp. 819–840. External Links: Document, Link, https://doi.org/10.1287/moor.27.4.819.297 Cited by: §1.1.2.
 Planning, learning and coordination in multiagent decision processes. In Proceedings of the 6th conference on Theoretical aspects of rationality and knowledge, pp. 195–210. Cited by: §1.1.1, §1.1.2, §1.1.2.

MOMDPs: a solution for modelling adaptive management problems.
In
AAAI Conference on Artificial Intelligence
, Cited by: §1.1.3, §1.1.3.  An environment model for nonstationary reinforcement learning. In Advances in Neural Information Processing Systems 12, S. A. Solla, T. K. Leen, and K. Müller (Eds.), pp. 987–993. Cited by: §1.1.3, §1.1.3.
 Multiagent actorcritic for mixed cooperativecompetitive environments. In Advances in neural information processing systems, pp. 6379–6390. Cited by: §1.1.2.

Independent reinforcement learners in cooperative markov games: a survey regarding coordination problems.
The Knowledge Engineering Review
27 (1), pp. 1–31. External Links: Document Cited by: §1.1.3.  Deep decentralized multitask multiagent reinforcement learning under partial observability. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 2681–2690. Cited by: §1.1.3.
 Stochastic games. Proceedings of the National Academy of Sciences 39 (10), pp. 1095–1100. External Links: Document, ISSN 00278424 Cited by: §1.1.2.
Comments
There are no comments yet.