# Information State Embedding in Partially Observable Cooperative Multi-Agent Reinforcement Learning

Multi-agent reinforcement learning (MARL) under partial observability has long been considered challenging, primarily due to the requirement for each agent to maintain a belief over all other agents' local histories – a domain that generally grows exponentially over time. In this work, we investigate a partially observable MARL problem in which agents are cooperative. To enable the development of tractable algorithms, we introduce the concept of an information state embedding that serves to compress agents' histories. We quantify how the compression error influences the resulting value functions for decentralized control. Furthermore, we propose three natural embeddings, based on finite-memory truncation, principal component analysis, and recurrent neural networks. The output of these embeddings are then used as the information state, and can be fed into any MARL algorithm. The proposed embed-then-learn pipeline opens the black-box of existing MARL algorithms, allowing us to establish some theoretical guarantees (error bounds of value functions) while still achieving competitive performance with many end-to-end approaches.

## Authors

• 3 publications
• 24 publications
• 7 publications
• 41 publications
• ### Value-Decomposition Networks For Cooperative Multi-Agent Learning

We study the problem of cooperative multi-agent reinforcement learning w...
06/16/2017 ∙ by Peter Sunehag, et al. ∙ 0

• ### Scalable Reinforcement Learning Policies for Multi-Agent Control

This paper develops a stochastic Multi-Agent Reinforcement Learning (MAR...
11/16/2020 ∙ by Christopher D. Hsu, et al. ∙ 8

• ### Cooperative and Competitive Biases for Multi-Agent Reinforcement Learning

Training a multi-agent reinforcement learning (MARL) algorithm is more c...
01/18/2021 ∙ by Heechang Ryu, et al. ∙ 0

• ### Deep Reinforcement Learning for Swarm Systems

Recently, deep reinforcement learning (RL) methods have been applied suc...
07/17/2018 ∙ by Maximilian Hüttenrauch, et al. ∙ 0

• ### Energy-based Surprise Minimization for Multi-Agent Value Factorization

Multi-Agent Reinforcement Learning (MARL) has demonstrated significant s...
09/16/2020 ∙ by Karush Suri, et al. ∙ 0

• ### Finite-Sample Analyses for Fully Decentralized Multi-Agent Reinforcement Learning

Despite the increasing interest in multi-agent reinforcement learning (M...
12/06/2018 ∙ by Kaiqing Zhang, et al. ∙ 2

• ### Multi-Agent Image Classification via Reinforcement Learning

We investigate a classification problem using multiple mobile agents tha...
05/13/2019 ∙ by Hossein K. Mousavi, et al. ∙ 9

## Code Repositories

### marl-embedding

Information State Embedding in Partially Observable MARL https://arxiv.org/abs/2004.01098

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Multi-agent reinforcement learning (MARL) is a prominent and practical paradigm for modeling multi-agent sequential decision making under uncertainty, with applications in a wide range of domains including robotics [duan2012multi], cyber-physical systems [wang2016towards], and finance [lee2007multiagent]. Many practical problems require agents to make decisions based on only a partial view of the environment and the (private) information of other agents, e.g., intention of pedestrians in autonomous driving settings, or location of surveillance targets in military drone swarm applications. This partial information generally precludes agents from making optimal decisions due to the requirement that each agent maintains a belief over all other agents’ local histories – a domain that, in general, grows exponentially in time [nayyar2014common].

We consider a partially observable setting, but restrict attention to problems in which the agents are cooperative, that is, they share the same objective.111This is in contrast with the more challenging general case of non-cooperative agents in which agents may act strategically to achieve their (individual) goals. Even under this simpler (cooperative) setting, the primary challenge still exists: due to the lack of explicit communication, agents possess noisy and asymmetric information about the environment, yet their rewards depend on the joint actions of all agents. As in the general (non-cooperative) setting, this informational coupling requires agents to maintain a growing amount of information.

Many approaches have been proposed to address this challenge; we roughly categorize them into two classes: concurrent learning and centralized learning. In concurrent learning approaches, agents learn and update their own control policies simultaneously. However, since reward and state processes are coupled, the environment becomes non-stationary from the perspective of each agent, and hence concurrent solutions do not converge in general [gupta2017cooperative]. On the other hand, centralized learning approaches, as the name suggests, reformulate the problem from the perspective of a virtual coordinator [nayyar2013decentralized]. Despite its popularity [dibangoye2016optimally, dibangoye2018learning]

, the centralized approach suffers from high computational complexity. A centralized algorithm needs to assign an action to each possible history sequence of the agents, and the cardinality of such sequences grows exponentially over time. In fact, decentralized partially observable Markov decision processes (Dec-POMDPs), an instance of the general decentralized control model, are known to be NEXP-complete

[bernstein2002complexity].

In this paper, we propose to address the computational issues in the centralized approach by extracting a summary, termed an information state embedding, from the history space, then learning control policies in the compressed embedding space that possess some quantifiable performance guarantee. This procedure, which we term the embed-then-learn pipeline, is depicted in Figure 1.

In the first stage, an embedding scheme with a quantifiable compression error is extracted from the history space,222The exact nature of this extraction process is instance-specific. For some instances of embeddings, the embedding of a history sequence can be calculated a priori, while for the others, the embedding itself must be learned from data (as discussed in Section 4). where our metric of compression error, to be defined in Section 3, favors an embedding with higher predictive ability. In the second stage, we learn a policy in this compressed embedding space. We prove how the embedding error propagates over time and our theoretical results provide an overall performance bound of the policy in terms of the compression error. Therefore, in this paradigm, the cooperative MARL problem can be reduced to finding an information state embedding with a small compression error.

We also introduce three empirical instances of information state embeddings, and demonstrate how to extract an embedding from data. Despite the empirical results that end-to-end learning333

In deep learning, end-to-end learning generally means training one single neural network that takes in the raw data as input and outputs the ultimate goal task.

generally leads to state-of-the-art performance [hausknecht2015deep], our approach breaks a partially observable MARL problem into two stages: embedding followed by learning. It provides a new angle that also enjoys fair performance, while allowing for more theoretical justifications. This approach opens the black-box in end-to-end approaches, and makes an initial step towards understanding their great empirical successes.

Related Work. In the seminal work of [witsenhausen1973standard], it was proved that a decentralized problem can be converted to a centralized standard form and hence be solved via a dynamic program. Following this paradigm, the common information approach [nayyar2013decentralized] formulates the decentralized control problem as a single agent POMDP based on the common information of all the agents. A similar work [dibangoye2016optimally] transforms the Dec-POMDP into a continuous-state MDP, and again demonstrates that standard techniques from POMDPs are applicable. More recently, [tavafoghi2018unified] introduces a sufficient information approach that investigates the sufficient conditions under which an information state is optimal for decision making. However, this approach does not offer a constructive algorithm of the sufficient information state. In fact, it is generally difficult, in multi-agent settings, to determine whether an information state that is more compact than the whole history exists. Our work extends this approach, in that we learn or extract a compact information state from data by approximately satisfying these sufficient conditions for optimal decision-making.

Learning a state representation that is more compact than an explicit history is also of great interest even in single agent partially observable systems. When the underlying system has an inherent state, as in POMDPs, it is helpful to directly learn a generative model of the system [ma2017pflstm, moreno2018neural]. When an inherent system state is not available, it is still possible to learn an internal state representation that captures the system dynamics. Two well-known approaches are predictive state representation (PSR) [littman2002predictive, downey2017predictive] and causal state representation learning [zhang2019learning]. A recent work [subramanian2019approximate] points out that PSR is not sufficient for RL problems, and proposes to extend PSR by introducing a set of sufficient conditions for performance evaluation. Drawing a comparison to the present paper, we generalize their analysis to the multi-agent setting, where our work can be regarded as learning a state representation in a decentralized partially observable system.

There is also no lack of empirical/heuristic approaches for MARL under partial observability. In concurrent learning, a learning scheme is proposed in

[banerjee2012sample] where agents take turns to learn the optimal response to others’ policies. The authors in [gupta2017cooperative] extend single-agent RL algorithms to the multi-agent setting and empirically evaluate their performance. Another empirical work [omidshafiei2017deep] studies multi-task MARL and introduces a multi-agent extension of experience replay [mnih2015human]. A centralized learning algorithm, termed oSARSA, is proposed in [dibangoye2018learning] to solve a centralized continuous-state MDP formulated from the MARL problem. For a more detailed survey on MARL, please refer to [zhang2019multi]. Compared with those more end-to-end solutions, our embed-then-learn pipeline enables more theoretical analysis.

Contribution. We summarize our contributions as follows: 1) We address the computational issue in partially observable MARL, by compressing each agent’s local history to an information state embedding that lies in a fixed domain. 2) Given the compression error of an embedding, we prove that the value function defined on the embedding space has bounded approximation error. 3) We propose three embeddings, and empirically evaluate their performance on some benchmark Dec-POMDP tasks.

Outline. The remainder of the paper is organized as follows: In Section 2, we present the mathematical model of our problem and introduce technical results that we rely on. In Section 3, we define the notion of an information state embedding and present our associated theoretical results. In Section 4, we propose three instances of embeddings and demonstrate how to learn such embeddings from data. Numerical evaluations of the proposed embeddings are presented and discussed in Section 5. Finally, we conclude and propose some future directions in Section 6.

Notation. We use uppercase letters (e.g.,

) to denote random variables, lowercase letters (e.g.,

) for their corresponding realizations, and calligraphic letters (e.g., ) for the corresponding spaces where the random variables reside. Subscripts denote time indices, whereas superscripts index agents. For , the notation denotes , e.g., is a shorthand for , with a similar convention for superscripts. Finally,

denotes the set of all probability measures over

.

## 2 Model and Preliminaries

We adopt a similar model as the one developed in [tavafoghi2018unified] and for completeness we state it here. Consider a team of non-strategic, i.e., cooperative, agents indexed by over a finite horizon . The state of the environment , the private observation of agent , and the common observation444This definition of common observations can be equivalently considered as the innovations defined in the partial history sharing setting [zhang2019online, nayyar2013decentralized], where each agent sends a subset of its local history to a shared memory according to a predefined sharing protocol. of all agents , follow the following dynamics:

 Xt+1 =ft(Xt,At,Wxt), Yit+1 =lit+1(Xt+1,At,Wit+1), Zt+1 =lct+1(Xt+1,At,Wct+1).

where and are independent random variables for all . We also assume are all finite sets. The initial state follows a fixed distribution . Note that the Dec-POMDP model [oliehoek2016concise] considers the special case where the agents share no common information, i.e., is empty.

Define common information as the aggregate of common observations from time to . Similarly, let denote each agent’s private information, assumed to be unknown to the other agents , where is the space of agent ’s private information at time . Define the joint history to be the collection of all agents’ actions and observations up to and including time . Accordingly, define each agent’s local history to be agent ’s information at time . Under the assumption of perfect recall, each agent’s strategy maps its local history to a distribution over its actions, i.e., . We also refer to as agent ’s control policy at and as agent ’s control law.

We consider stochastic policies rather than deterministic policies throughout the paper. To understand the underlying reason, consider the single agent case, a standard POMDP. Even though deterministic policies are known to be optimal in POMDPs, their existence relies on knowing the belief state with certainty. In any learning approach, the belief state may not be known with certainty at any time. Any approximation to the true belief state may give rise to stochastic policies that outperform deterministic policies (an extreme case is when the action is based on only the most recent observation as in [singh1994learning]). Since a Dec-POMDP can be translated to an equivalent (centralized/single-agent) POMDP [nayyar2013decentralized], the requirement to consider stochastic policies when only an approximate belief state is available carries over to the multi-agent setting as well.

At each time , all the agents receive a joint reward . The agents’ joint objective is to maximize the total reward over the whole horizon:

where the expectation is taken with respect to the probability distribution on the states induced by the joint control law

.

It has been shown in [nayyar2013decentralized] that this decentralized control problem can be formulated as a centralized POMDP from the perspective of a virtual coordinator. The coordinator only has access to the common information , not the agents’ private information . The centralized state corresponds to the environment state and agents’ private information in the decentralized problem. The centralized actions, termed prescriptions (to be defined in Definition 1), are mappings from each agent’s private information to a local control action, i.e., . The information state (also referred to as belief state) in the centralized POMDP is a belief over the environment state and all the agents’ private information, conditional on the common information and joint strategies (see e.g., Lemma 1 in [nayyar2013decentralized]). For optimal decision making, the coordinator needs to maintain a belief over all the agents’ private information, the domain of which grows exponentially over time. Therefore, the centralized approach is generally intractable.

In the following, we review relevant definitions and structural results from the literature. [Sufficient private and common information [tavafoghi2018unified]] We say , which is a function of and denoted by , is sufficient private information if it satisfies the following conditions:

1. [label=()]

2. Recursive update:

 Sit=ϕit(Sit−1,Hit∖Hit−1;g),∀t∈T∖{1},
3. Sufficient to predict future observations:

 Pg{S1:Nt+1=s1:Nt+1,Zt+1=zt+1|P1:Nt,Ct,A1:Nt}=Pg{S1:Nt+1=s1:Nt+1,Zt+1=zt+1|S1:Nt,Ct,A1:Nt},

and ,

4. Sufficient to predict future reward: For any ,

 E~g[Rt∣Ct,Pit,A1:Nt]=E~g[Rt∣Ct,Sit,A1:Nt],
5. Sufficient to predict others’ private information:

 P~g{S−it=s−it|Pit,Ct}=P~g{S−it=s−it|Sit,Ct},

and

Sufficient common information is defined as the conditional distribution over the environment state and the joint private information of all the agents, given the current common information and previous strategies :

 Πt=Pg1:t−1(Xt,S1:Nt∣Ct).

Due to the above definition, an agent can make decisions using only its sufficient private information and common information. Such a decision making strategy is termed a sufficient information based (SIB) strategy for agent at time .

Let denote the mapping from common information to sufficient common information, i.e., . The authors of [tavafoghi2018unified] showed that the SIB belief can be updated recursively via Bayes’ rule, using only the previous common belief and the new common observation . That is, there exists a mapping , such that:

 Πt=ψσt−1t(Πt−1,Zt). (1)

When the belief is fixed, the SIB strategy only depends on . This induces another function, termed a prescription. [Prescriptions [nayyar2013decentralized]] A prescription is a mapping from agent ’s sufficient private information to a distribution over actions at time .

According to Theorem 3 in [tavafoghi2018unified], given perfect information of the system dynamics (i.e., and ), the optimal planning solution to the decentralized control problem can be found via the following dynamic program:

 VT+1(πT+1)=0,∀πT+1∈ΠT+1, (2)

and at every ,

 Vt(πt) =maxγ1:Nt:S1:Nt→Δ(A1:Nt)Qt(πt,γ1:Nt),∀πt∈Πt, Qt(πt,γ1:Nt) =E[Rt(Xt,γ1:Nt(S1:Nt))+Vt+1(ψγtt(πt,Zt+1))∣∣Πt=πt],∀πt∈Πt. (3)

In a learning problem, the system model is unknown to the agents. Agents must infer this information through interaction with the environment. The remainder of the paper is dedicated to solving the learning problem.

## 3 Information State Embedding

The definition of sufficient information (see Definition 2) characterizes a compression of history that is sufficient for optimal decision making. However, it does not offer an explicit way to construct such an information state, nor to learn it from data. In this section, we define the notion of an information state embedding, an approximate version of sufficient information, and analyze the approximation error it introduces. In the next section, we provide explicit algorithms to learn this information state embedding from data.

Since a Dec-POMDP can be formulated as a centralized POMDP [nayyar2013decentralized], it might seem reasonable to apply (single-agent) state representation techniques, e.g.,  [subramanian2019approximate], directly to the centralized problem to learn an appropriate information state embedding. However, this is not very helpful because the “state” in the centralized POMDP is derived from the common information in the decentralized problem, and hence single-agent state representation techniques would only compress common information. However, the intractability of the decentralized problem arises from the exponential growth of private information. To address the computational bottleneck in the decentralized problems, what we really need is a compact embedding of the agents’ private information.

Let denote a compression of agent ’s sufficient private information at time , where the detailed properties of this compression would be clear later. This compression mapping is denoted by , and we assume that is injective555A function is injective if for all . We note that the injective assumption does not contradict the fact that is a compression. The injective property restricts the cardinality of the domain and co-domain, yet compression concerns the dimensionality of them. For computational reasons, we assume throughout the paper that has a fixed domain.666A fixed domain is not a theoretical requirement in our analysis, it is just desirable from a computational perspective. Given its private information embedding, an agent makes decisions using its embedded strategy defined as .

Define the compressed common belief to be the conditional distribution over the current environment state and the joint private information embeddings of all the agents, given the current common information and previous embedded strategies :

 ^Πt=P^g1:t−1(Xt,^S1:Nt∣Ct). (4)

Following Definition 1, we define the common information compression mapping and embedded prescriptions accordingly. Analogous to (3), we can also define a dynamic program based on our embedded information:

 ^VT+1(^πT+1)=0,∀^πT+1∈^ΠT+1, (5)

and for every at every ,

 ^Vt(^πt) =max ^γ1:Nt:^S1:Nt→Δ(A1:Nt)^Qt(^πt,^γ1:Nt),∀^πt∈^Πt (6) ^Qt(^πt,^γ1:Nt) =E[Rt(Xt,^γ1:Nt(^S1:Nt))+^Vt+1(^ψ^γtt(^πt,Zt+1))∣^Πt=^πt],∀^πt∈^Πt.

To quantify the performance of any such information-embeddings, we formally define an -information state embedding as follows.

[-information state embedding] We call an -information state embedding if it satisfies the following two conditions:

1. [label=()]

2. Approximately sufficient to predict future rewards: For any and any realization of sufficient private information , common information , and actions :

 ∣∣E[Rt(Xt,a1:Nt)∣πt,s1:Nt,a1:Nt]−E[Rt(Xt,a1:Nt)∣^πt,^s1:Nt,a1:Nt]∣∣≤ϵ.
3. Approximately sufficient to predict future beliefs: For any Borel subset , define

 μt(B;a1:Nt) =P(^Πt+1∈B∣πt,s1:Nt,a1:Nt), νt(B;a1:Nt) =P(^Πt+1∈B∣^πt,^s1:Nt,a1:Nt).

Then

 K(μt(a1:Nt),νt(a1:Nt))≤δ,

where denotes the Wasserstein or Kantorovich-Rubinstein distance between two distributions.

By Kantorovich-Rubinstein duality [edwards2011kantorovich], Definition 3 suggests for any Lipschitz continuous function with Lipschitz constant (with respect to the Euclidean metric). To obtain an error bound on the value function, we make the following assumption:

[Lipschitz continuity of value functions] Value functions are Lipschitz continuous for all with Lipschitz constant upper bound , i.e., .

We note that Lipschitz continuity over the compressed common information space is a mild assumption. This is because by centralizing the problem as a single agent POMDP, the value function is piecewise linear convex over the belief state [sondik1971optimal], which is Lipschitz continuous over the (non-compressed) common information space.

Combining Definition 3(b) and Assumption 3, we know for any realization and , :

 ∣∣E[^V(^Πt+1)∣πt,s1:Nt,a1:Nt]−E[^V(^Πt+1)∣^πt,^s1:Nt,a1:Nt]∣∣≤LVδ.

Next, we extend the approximation error analysis in [subramanian2019approximate] to the multi-agent setting. Our main result is that, by compressing the exponentially growing history to an -information state embedding, the error of value functions over the whole horizon is bounded as stated in the theorem below: For any and any realization and , let and denote the optimal prescriptions in the two dynamic programming solutions (3) (6), respectively. We have:

 ∣∣Qt(πt,γ∗1:Nt)−^Qt(^πt,^γ∗1:Nt)∣∣ ≤(T−t+1)(ϵ+LVδ), ∣∣Vt(πt)−^Vt(^πt)∣∣ ≤(T−t+1)(ϵ+LVδ).
###### Proof.

We prove by backward induction. As the basis of induction, Theorem 3 holds at time by construction. Suppose Theorem 3 holds at time , ; then for time , we define an auxiliary set of prescriptions that produces exactly the same action distribution as . Specifically, for any , and for any realization of , this definition implies . This is possible because our private information embedding process is assumed to be injective. The existence of such an oracle suggests that, given only the embedded information, it is always possible to recover the optimal action distributions that are produced by the complete information. Nevertheless, our embedded-information-based dynamic program ends up generating a different set of prescriptions , and hence the oracle is only used for analysis purposes. Together with Definition 3(a), the following holds for every realization and :

 ∣∣E[Rt(Xt,γ∗1:Nt(s1:Nt))∣πt,s1:Nt,γ∗1:Nt]−E[Rt(Xt,^γ01:Nt(^s1:Nt))∣^πt,^s1:Nt,^γ01:Nt]∣∣≤ϵ, (7)

where is a shorthand for . Similarly, by combining the definition of with Remark 3, we also have:

 ∣∣E[^V(^Πt+1)∣πt,s1:Nt,γ∗1:Nt]−E[^V(^Πt+1)∣^πt,^s1:Nt,^γ01:Nt]∣∣≤LVδ. (8)

To see this, notice that for any Borel subset of , we have:

 μt(B;γ∗1:Nt) ≜ P(^Πt+1∈B∣πt,s1:Nt,γ∗1:Nt) = =

and also note that by Kantorovich-Rubinstein duality,

 K(μt(B;γ∗1:Nt),νt(B;^γ01:Nt)) = sup∥f∥Lip≤1∣∣∣∫fdμt(γ∗1:Nt)−∫fdνt(^γ01:Nt)∣∣∣ = sup∥f∥Lip≤1∣∣∑a1:Nt∈A1:NtP(a1:Nt∣s1:Nt,γ∗1:Nt)∫fdμt(a1:Nt) −∑a1:Nt∈A1:NtP(a1:Nt∣^s1:Nt,^γ01:Nt)∫fdνt(a1:Nt)∣∣ (9) ≤ sup∥f∥Lip≤1∑a1:Nt∈A1:NtP(a1:Nt∣s1:Nt,γ∗1:Nt)∣∣∣∫fdμt(a1:Nt)−∫fdνt(a1:Nt)∣∣∣. (10)

Equation (9) holds because the probability measures are finite and the coefficients are non-negative. Inequality (10) follows from the triangle inequality and the fact that . Since the supremum of summation is no larger than the summation of suprema:

 K(μt(B;γ∗1:Nt),νt(B;^γ01:Nt)) ≤ ∑a1:Nt∈A1:NtP(a1:Nt∣s1:Nt,γ∗1:Nt)sup∥f∥Lip≤1∣∣∣∫fdμt(a1:Nt)−∫fdνt(a1:Nt)∣∣∣ ≤ ∑a1:Nt∈A1:NtP(a1:Nt∣s1:Nt,γ∗1:Nt)K(μt(a1:Nt),ν1:Nt(a1:Nt)) (11) ≤ δ. (12)

Inequality (11) holds because Kantorovich-Rubinstein duality implies that is the upper bound of any function with Lipschitz constant no larger than , and hence holds for the specific function . Finally, Equation (12) is due to Definition 3(b) and the fact that .

Using this oracle , for any realization of sufficient private information and common information at time , let and ; then we have:

 Qt(πt,γ∗1:Nt) = E[Rt(Xt,γ∗1:Nt(s1:Nt))+Vt+1(ψγ∗tt+1(πt,Zt+1))∣∣πt,s1:Nt,γ∗1:Nt] (13) = E[Rt(Xt,γ∗1:Nt(s1:Nt))+Vt+1(Πt+1)∣πt,s1:Nt,γ∗1:Nt] (14) ≤ E[Rt(Xt,γ∗1:Nt(s1:Nt))+^Vt+1(^Πt+1)∣πt,s1:Nt,γ∗1:Nt]+(T−t)(ϵ+LVδ). (15)

Equation (13) is by the definition of in the dynamic programming. Equation (14) is by the definition of . Inequality (15) comes from our induction hypothesis. Using the results from Equations (7) and (8), we have:

 Qt(πt,γ∗1:Nt) ≤ (E[Rt(Xt,^γ01:Nt(^s1:Nt))∣^πt,^s1:Nt,^γ01:Nt]+ϵ) +(E[^Vt+1(^Πt+1)∣^πt,^s1:Nt,^γ01:Nt]+LVδ)+(T−t)(ϵ+LVδ) (16) = ^Qt(^πt,^γ01:Nt)+(T−t+1)(ϵ+LVδ) (17) ≤ ^Qt(^πt,^γ∗1:Nt)+(T−t+1)(ϵ+LVδ). (18)

Equation (17) follows from the definition of . Inequality (18) holds because is optimal to the embedded dynamic program, and hence its value is no smaller than that of . ∎

Our result suggests that the error of carrying out dynamic programming using only the embedded information is upper bounded linearly in time from dynamic programming with the full information. To obtain a small value error upper bound, embedding schemes with small compression errors ( and ) should be designed. In the following sections, we propose and empirically evaluate several such designs.

## 4 Learning Information State Embeddings

In this section, we introduce three empirical instances of information state embedding, and we demonstrate how to learn an embedding from data. A theoretical upper bound for the values of these embeddings is generally unclear, and is not the focus of this paper. Instead, our intention is to use several examples to demonstrate the feasibility of the embed-then-learn framework, and we evaluate the performance of these instances empirically. Note that for the first two embeddings, the embedding is simply defined a priori and does not require any learning. The third embedding differs in the sense that the embedding itself must be learned via a training procedure.

Finite memory embedding. The first instance, named finite memory embedding

(FM-E), simply keeps a fixed memory, or window, of the local history as an information state. Specifically, each agent maintains a one-hot encoded vector of a fixed window of its most recent actions and observations, and its decision only depends on this fixed memory. This embedding can be updated recursively via

, where is the length of the fixed window.777Although finite-memory decision making has been well studied in the single agent setting [white1994finite], it is still generally open how the truncation of history influences the performance in the multi-agent setting. This embedding can be regarded as a simplification of [banerjee2012sample], where the authors define each complete history sequence to be an information state and perform -learning [watkins1992q] on such an information state space. Their method does not scale well to longer horizons due to the explosion of the new state space; in contrast, FM-E is more scalable as its embedding space has fixed size, but it comes at a price of losing long-term memory.

Principal component analysis embedding. The second instance, principal component analysis embedding (PCA-E), uses PCA [pearson1901liii]

to reduce the local history to a fixed-size feature vector. PCA is a simple and well-established algorithm for dimensionality reduction. Given a specified dimensionality, PCA keeps the largest variance during compression. Note that this goal differs from our intention of maintaining the predictive capability as stated in Definition

3. We also note that PCA can be implemented recursively [li2000recursive] to handle sequential data.

Recurrent neural network embedding. The third implementation, named recurrent neural network embedding (RNN-E), uses an RNN to compress history. Each agent uses an LSTM network [hochreiter1997long] (a variant of RNN) that maps its local history to a fixed-size vector at each time step, and more specifically, we treat the fixed-size hidden state of the LSTM network as our information state embedding. Recursive update of this embedding is inherent in the structure of LSTM: , where is the cell state of the LSTM network that keeps a selective memory of history, and is the network parameters to be learned from data.

For all of the three instances, we use deep Q-learning (DQNs) [mnih2015human] to learn a policy. Following the embed-then-learn procedure, as illustrated in Figure 1, we feed the information state embedding into the DQN to get the Q-value for each candidate action. In the single agent setting, a similar network structure, named DRQN [hausknecht2015deep], concatenates LSTM and DQN, and adopts an end-to-end structure where LSTM directly outputs the Q-values. In contrast, we extract an embedding first so that we can theoretically bound the value function, given an upper bound on the embedding error.

Similar to the problem faced by other MARL algorithms, the environment looks non-stationary from each agent’s perspective since agents learn and update policies concurrently. To address this issue, common training schemes in the literature include centralized training and execution, concurrent learning, and parameter sharing [gupta2017cooperative]. In our simulations, we adopt parameter sharing because it has demonstrated better performance in [gupta2017cooperative]. In parameter sharing, homogeneous agents share the same network parameter values, which leads to more efficient training and partly addresses the non-stationarity issue in concurrent learning. Heterogeneous policies are still possible because agents feed unique agent IDs and different local observations into the network. However, as a standard training scheme in the literature, parameter sharing does slightly break the assumption of decentralization, as it requires either centralized learning (but still decentralized execution), or periodic gradients sharing among agents (which is still a weaker assumption than real-time sharing of local observations).

## 5 Numerical Results

In this section, we evaluate our embedding schemes on several classic -agent benchmark problems in the Dec-POMDP literature [masplan]: Grid3x3corners [amato2009incremental], Dectiger [nair2003taming], and Boxpushing [seuken2007improved]. For each of them, we compare our three instances with the state-of-the-art planning solution FB-HSVI [dibangoye2016optimally], which requires a complete model of the environment, and a learning solution oSARSA [dibangoye2018learning]. We refer to the performance reported by their authors. The authors limited the running time of their algorithms to episodes and hours, but these stopping criteria are not the binding constraints for our solutions, as our algorithms take significantly less time and fewer episodes to converge.888The benchmark problems as well as the implementations of our solutions can be found at https://github.com/xizeroplus/marl-embedding.

For RNN-E, we use a one-layer LSTM network with a hidden layer of size as the embedding network. The inputs to the LSTM are one-hot encoded actions and observations, together with the embedding from the last step. Our DQN is a two-layer fully connected network, where the input is the embedding. The DQN hidden layer has neurons, and the output size is equal to the size of action space. All activations are Rectified Linear Unit (ReLU[glorot2011deep] functions.

We adopt -greedy for policy exploration, with decreasing linearly from to over the total episodes by default. We use a buffer of size

for experience replay, and for DQN error estimation we draw a batch of

samples from the buffer. We use Mean Squared Error loss and the Adam optimizer, with a learning rate of by default for both the embedding network and DQN. We perform back-propagation after each episode, and update the target network of DQN every episodes. The size of the embedding for FM-E and PCA-E is set to . For more efficient training, we only train our networks with a horizon length of and then test on different horizons, which surprisingly achieves comparable performance as training and testing on each possible horizon length separately. We average the performance over testing episodes in each run, and all results are averaged over runs. The performance of the four algorithms over different lengths of horizon are shown in Table 1.

We can see that RNN-E and FM-E achieve high rewards over different horizons. FM-E performs better on shorter horizon problems Dectiger and Boxpushing, but RNN-E outperforms FM-E on Grid3x3corners, where horizons are longer and long-term memories are necessary. Although FM-E achieves good performance in the three examples, we note that it has a very limited scope of application, because it is easy to construct examples where short-term memories are not sufficient for decision making. For example, consider a two-agent Dec-POMDP problem as illustrated in Figure 2, which can be regarded as a modification of Grid3x3corners. Agent 1 starts from state 1 in the maze, and Agent 2 starts from state 23. The goal for the two agents is to meet at the destination state 12 as soon as possible (i.e., they receive a time-discounted unit reward when both of them are in state 12, and no reward otherwise). Candidate actions include moving one step towards any of the four directions. They always receive the same observation no matter what states they are in and what actions they take. If an agent runs into a wall, it stays where it is. For each agent, we can see that it is sufficient to only count how many times it has been going right, and the optimal strategy is to switch from going right to going down / up when the count reaches . Now suppose Agent 1 only has a finite memory of length 4. Then this agent performs poorly because it cannot distinguish states 5, 6, 7, and 8. If it decides to deterministically go right, it will get stuck in state 8 forever. If it has some probability of going down, then it wastes time in states 5, 6, and 7. Therefore, finite memory agents obtain low rewards in this example.

On the other hand, RNN-E is able to summarize the whole history rather than only keeping a short-term memory, but it comes at a price that RNNs are generally difficult to train mostly due to the vanishing and exploding gradient problems.

PCA-E does not perform well on the three tasks. We believe the reason is that PCA is designed to keep the largest possible variance of data, which is generally not the same as the most predictive information of the history as required by Definition 3

. The oSARSA algorithm performs comparably well as the planning solution FB-HSVI, and generally outperforms our solutions. This is because oSARSA relies on centralized learning, which is a much stronger assumption than the parameter sharing assumption that our solutions rely on. The centralized scheme of oSARSA also incurs heavy computation, as oSARSA requires to solve a mixed-integer linear program at each step.

## 6 Concluding Remarks and Future Directions

In this paper, we have introduced the concept of information state embedding for partially observable cooperative MARL. We have theoretically analyzed how the compression error of the embedding influences the value functions. We have also proposed three instances of embeddings, and empirically evaluated their performance on partially observable MARL benchmarks.

An interesting future direction would be to theoretically analyze the compression errors of the common embedding strategies we have used, which helps close the loop of our theoretical analysis. It would also be interesting to design other empirical embeddings that explicitly reduce this compression error.