In multi-agent cooperative tasks, agents learn from the experiences generated by continuously interacting with the environment to maximize the cumulative shared reward. Recently, multi-agent reinforcement learning (MARL) has been applied to real-world cooperative systems Bhalla et al. (2020); Xu et al. (2021). However, in many industrial applications, continuously interacting with the environment and collecting the experiences during learning is costly, risky, and time-consuming. One way to address this is offline RL, where the agent can only access a fixed dataset of experiences and learn the policy without further interacting with the environment. However, in multi-agent environments, the dataset of each agent is often pre-collected individually by different behavior policies, which are not necessary to be expert, and each dataset contains the individual action of the agent instead of the joint action of all agents. Therefore, the dataset does not satisfy the paradigm of centralized training, and the agent has to learn the coordinated policy in an offline and fully decentralized way.
The main challenge of offline RL is the extrapolation error, an error in value estimate incurred by the mismatch between the experience distributions of the learned policy and the dataset Fujimoto et al. (2019), e.g., the distance of the learned action distribution to the behavior action distribution, and the bias of the transition dynamics estimated from the dataset to the true transition dynamics. Recently, almost all offline RL methods Fujimoto et al. (2019); Levine et al. (2020); Jaques et al. (2019) focus on constraining the learned policy to be close to the behavior policy to avoid the overestimate of the values of out-of-distribution actions, but ignore to correct the transition bias since the deviation of estimated transition dynamics would not be too large if the single-agent environment is stationary.
However, in decentralized multi-agent environments, from the perspective of each agent, other agents are a part of the environment, and the transition dynamics experienced by each agent depend on the policies of other agents. Even in a stationary environment, the experienced transition dynamics of each agent will change as other agents update their policies Foerster et al. (2017). Since the behavior policies of other agents would be inconsistent with their learned policies which are unknowable in decentralized multi-agent environments, the transition dynamics estimated from the dataset by each agent would be different from the transition dynamics induced by the learned policies of other agents, causing large errors in value estimate. The extrapolation error would lead to suboptimal policies. Moreover, trained on different distributions of experiences collected by various behavior policies, the estimated values of the same state might be much different between agents, which causes that the learned policies cannot coordinate with each other.
To overcome the suboptimum and miscoordination caused by transition bias in decentralized learning, we introduce value deviation and transition normalization to deliberately modify the estimated transition probabilities from the dataset. During data collection, if one agent takes an optimal action while other agents take suboptimal actions at a state, the transition probabilities of low-value next states will become large. Thus, the Q-value of the optimal action will be underestimated, and the agents will fall into suboptimum. Since the other agents are also trained, the learned policies of other agents would become better than the behavior policies. For each agent, the transition probabilities of high-value next states induced by the learned policies would be larger than those estimated from the dataset. Therefore, we let each agent be optimistic toward other agents and multiply the transition probabilities by the deviation of the value of next state from the expected value over all next states, to make the estimated transition probabilities close to the transition probabilities induced by the learned policies of other agents.
Value deviation could decrease the extrapolation error and help the agents escape from the suboptimum. However, in some cases, the behavior policies of other agents might be highly deterministic, which makes the distribution of experiences unbalanced. If the transition probabilities of high-value next states are extremely low, value deviation may not remedy the underestimate. Moreover, due to the diversity in experience distributions of agents, the value of the same state might be overestimated by some agents while underestimated by others, which results in miscoordination of the learned policies. To address these two problems, we normalize the transition probability estimated from the dataset to be uniform. Transition normalization balances the extremely biased distribution of experiences and builds the consensus about value estimate. By combining value deviation and transition normalization, the agents would learn high-performing and coordinated policies in an offline and fully decentralized way.
Although value deviation and transition normalization make transition dynamics non-stationary, we mathematically prove the convergence of Q-learning under such non-stationary transition dynamics. By importance sampling, value deviation and transition normalization take effect only as the weights of the objective function, which makes our method easy to implement. The proposed method is instantiated on BCQ Fujimoto et al. (2019), termed MABCQ, to additionally avoid out-of-distribution actions. We build the offline datasets and evaluate MABCQ in four multi-agent mujoco scenarios Todorov et al. (2012); Brockman et al. (2016); de Witt et al. (2020). Experimental results show that MABCQ greatly outperforms BCQ, and ablation studies demonstrate the effectiveness of value deviation and transition normalization. To the best of our knowledge, MABCQ is the first method for offline and fully decentralized multi-agent reinforcement learning.
2 Related Work
MARL. Many MARL methods have been proposed for learning to solve cooperative tasks in an online manner. Some methods Lowe et al. (2017); Foerster et al. (2018); Iqbal and Sha (2019) extend policy gradient into multi-agent cases. Value factorization methods Sunehag et al. (2018); Rashid et al. (2018); Son et al. (2019) decompose the joint value function into individual value functions. Communication methods Das et al. (2019); Ding et al. (2020) share information between agents for better cooperation. All these methods follow the paradigm of centralized learning and decentralized execution (CTDE), where the agents could access the information from other agents during centralized training. However, in our offline and decentralized setting, the datasets of agents are different; each dataset contains individual actions instead of joint actions; and the agents cannot be trained in a centralized way.
For decentralized learning, the key challenge is the obsolete experiences in replay buffer. Fingerprints Foerster et al. (2017) deals with obsolete experience problems by conditioning the value function on a fingerprint that disambiguates the age of the sampled data. Lenient-DQN Palmer et al. (2018) extends the leniency concept and introduces optimism in the value function update by forgiving suboptimal actions. Concurrent experience replay Omidshafiei et al. (2017) induces correlations in local policy updates, making agents tend to converge to the same equilibrium. However, these methods require additional information, e.g., training iteration number, exploration rate, and timestamp, which often are not provided by the offline dataset.
Offline RL. Offline RL requires the agent to learn from a fixed batch of data
, consisting of single-step transitions without exploration. Unlike imitation learning, offline RL does not assume that the offline data is provided by a high-performing expert but has to handle the data generated by suboptimal or multi-modal behavior policies. Most offline RL methods consider the out-of-distribution actionLevine et al. (2020) as the fundamental challenge, which is the main cause of the extrapolation error Fujimoto et al. (2019) in value estimate in the single-agent environment. To minimize the extrapolation error, recent methods first approximate the behavior policy with max-likelihood over the dataset and then enforce the learned policy to be close to the behavior policy . Some methods introduce a regularization term to the objective function of as,
where is a sample-based divergence function. It could be Kernel MMD Kumar et al. (2019), Wasserstein distance, or KL divergence Jaques et al. (2019). BCQ Fujimoto et al. (2019) does not use the regularization term but generates actions by sampling from the perturbed behavior policy. BCQ contains a Q-network, a perturbation network, and a conditional VAE. The conditional VAE Kingma and Welling (2013); Sohn et al. (2015) is trained to model the behavior policy, which outputs actions that follow the action distribution in the dataset given the state. The agent generates some actions from the VAE, adds small perturbations on the actions to increase the diversity using perturbation network, and then selects the action with the highest value in the Q-network. A finite-sample analysis for offline MARL Zhang et al. (2018) has been studied, but the agents are assumed to get individual rewards instead of the shared reward, and be connected by communication networks, which is not the fully decentralized setting. All these methods do not consider the extrapolation error introduced by the transition bias, which is a fatal problem in offline and decentralized MARL.
Considering agents in an environment, there is a multi-agent MDP Oliehoek and Amato (2016) with the state space and the joint action space . At each timestep, each agent gets state and performs an individual action , and the environment transitions to the next state by taking the joint action with the transition probability and we assume the environment to be deterministic. The agents would get a shared reward , which is simplified to just depending on state Schulman et al. (2015). The agents learn to maximize the expected return , where is a discount factor and is the time horizon of the episode. However, in the fully decentralized learning, is partially observable to the agent since the agent cannot observe the joint action . During execution, from the perspective of each agent , there is a viewed MDP with the individual action space and the transition probability
where denotes the joint action of all agents except agent , and denotes the policy of the agent. As the transition probability depends on the policies of other agents, if other agents are also updating their policies, becomes non-stationary. Moreover, if the agent cannot interact with the environment, is unknown.
In offline and decentralized settings, each agent could only access a fixed offline dataset , which is pre-collected by behavior policies and contains the tuples . As defined in BCQ Fujimoto et al. (2019), the visible MDP is constructed on , which has the transition probability111The transition probability in the following sections means the one calculated from unless otherwise stated.
where is the number of times the tuple is observed in . However, since the learned policies of other agents might be greatly different from the behavior policies, would be biased from , which creates large extrapolation errors and differences in value estimate between agents, and eventually leads to uncoordinated suboptimal policies.
To intuitively illustrate the miscoordination caused by the transition bias, we devise offline datasets in a matrix game for two agents, with the payoff depicted in Table 1. The action distributions of the behavior polices of the two agents are and , respectively. Table 2 shows the transition probabilities and expected returns calculated by the independent agents in the datasets. Since the datasets are collected by poor behavior policies, when one agent chooses the optimal action, the other agent would choose the suboptimal action with a high probability, which leads to low transition probabilities of high-value next states. Thus, the agents underestimate the optimal actions and converge to the suboptimal policies , rather than the optimal policies .
3.2 Importance Weights
3.2.1 Value Deviation
If the behavior policies of some agents are low-performing during data collection, they usually take suboptimal actions to cooperate with the optimal actions of other agents, which leads to high transition probabilities of low-value next states. When agent performs Q-learning with the dataset , the Bellman operator is approximated by the transition probability to estimate the expectation over :
If of a high-value is lower than , the Q-value of this pair is underestimated, which would cause large extrapolation error and guide the agent to the convergence of suboptimal policy.
As discussed before, during execution, the transition probability viewed by agent is and the environment is deterministic, thus only depends on the learned policies of other agents, which are unavailable. However, since the policies of other agents are also updating towards maximizing the Q-values, of high-value next states would grow higher than . Based on this intuition, we let each agent be optimistic towards other agents and modify as
where the state value , is the deviation of the value of next state from the expected value over all next states, which increases the transition probabilities of the high-value next states and decreases those of the low-value next states, and is a normalization term to make sure the sum of the transition probabilities is one. Value deviation makes the transition probability to be close to and hence decreases the extrapolation error. The optimism towards other agents helps the agents escape from local optima and discover potential optimal actions which are hidden by the poor behavior policies.
3.2.2 Transition Normalization
In real-world applications, the action distribution of behavior policy might be unbalanced, which makes the transition probabilities biased, e.g., the transition probabilities of Agent (i.e., action distribution of Agent ) in Table 1. If the transition probability of a high-value next state is extremely low, value deviation cannot correct the underestimate. Moreover, since of each agent is individually collected by different behavior policies, the diversity in transition probabilities of agents leads to that the value of the same state will be overestimated by some agents, while be underestimated by others. Since the agents are trained to reach high-value states, the large divergences on state values will cause miscoordination of the learned policies. To overcome these problems, we normalize the biased transition probability to be uniform over next states,
where is a normalization term that is the number of different given in . Transition normalization enforces that each agent has the same when it acts the learned action on the same state , and we have the following proposition.
In episodic environments, if each agent performs Q-learning on , all agents will converge to the same if they have the same transition probability on any state where each agent acts the learned action .
The proof is provided in Appendix A. ∎
However, to satisfy for all , the agents should have the same set of at , which is a strong assumption. In practice, although the assumption is not strictly satisfied, transition normalization could still normalize the biased transition distribution, encouraging the estimated state value to be close to each other.
3.2.3 Optimization Objective
We combine value deviation , which is denoted as , and transition normalization , which is denoted as , and modify as,
where is the normalization term. In a sense, makes the offline learning on similar to the online decentralized MARL. In the initial stage, is close to since is not updated, and the transition probabilities are uniform, meaning other agents are acting randomly. During training, the transition probabilities of high-value states gradually grow under value deviation, which is an analogy of that other agents are improving their policies in the online learning. Starting from the normalized transition probabilities and changing the transition probabilities following the same optimism principle, the agents increase the values of potential optimal actions optimistically and unanimously, and build consensuses about value estimate. Therefore transition normalization and value deviation encourage the agents to learn high-performing policies and help the emergence of coordination. Moreover, although is non-stationary (i.e., changes over updates of Q-value), we have the following theorem about the convergence of Bellman operator under ,
Under the non-stationary transition probability , the Bellman operator is a contraction and converges to a unique fixed point when , if the reward is bounded by the positive region .
The proof is provided in Appendix A. ∎
As any positive affine transformation of the reward function does not change the optimal policy Zhang et al. (2021), Theorem 1 holds in general. Moreover, we could rescale the reward to make arbitrarily close to so as to obtain a high upper bound of .
In deep reinforcement learning, directly modifying the transition probability is infeasible. However, we could modify the sampling probability to achieve the same effect. The optimization objective of decentralized deep Q-learning is calculated by sampling the batch from according to the sampling probability . By factorizing , we have
Therefore, we can modify the transition probability as and scale with . Then, the sampling probability can be re-written as
Since is independent from , it could be regarded as a scale factor on . Scaling will not change the expected target value , so sampling batches for update according to the modified sampling probability could achieve the same effect of modifying the transition probability. Using importance sampling, the modified optimization objective is
where and could be seen as the weights of the objective function.
To implement MABCQ in high-dimensional continuous spaces, for each agent , we train a Q-network , a perturbation network , and a conditional VAE . In execution, each agent generates actions by , adds small perturbations on the actions using , and then selects the action with the highest value in . The policy can be written as
is updated by minimizing
is calculated by the target networks and , where is correspondingly the policy induced by and .
is updated by maximizing
To estimate , we need and , which can be estimated from the sample without actually going through all . We estimate using the target networks to stabilize along with the updates of and . To avoid extreme values, we clip to the region , where is the optimism level.
To estimate , we train a VAEand , where is the density of unit Gaussian distribution. The conditional density is and the transition probability is when the integral interval is a small constant. Approximately, we have
and the constant is considered in . In practice, we find that falls into the region for almost all samples. For completeness, we summarize the training of MABCQ in Algorithm 1.
4.1 Environments and Datasets
To evaluate the effectiveness of MABCQ in high-dimensional complex environments, we adopt multi-agent mujoco de Witt et al. (2020), which splits the original action space of the mujoco tasks Todorov et al. (2012); Brockman et al. (2016) into several sub-spaces (MIT License). We consider four tasks, which are HalfCheetah, Walker, Hopper, and Ant. As illustrated in Figure 1, different colors indicate different agents. Each agent independently controls one or some joints of the robot and could get the state and reward of the robot, which are defined in the original tasks.
For each environment, we collect datasets for the agents. Each dataset contains 1 million transitions . For data collection, we train a joint policy, which controls all joints, using SAC algorithm Haarnoja et al. (2018) provided by OpenAI Spinning Up Achiam (2018) (MIT License), and store an intermediate policy and an expert policy during the training. The offline dataset is a mixture of four parts: transitions are split from the experiences generated by the SAC agent at the early training, transitions are generated from that the agent acts the intermediate policy while other agents act the expert policy, transitions are generated from that agent performs the expert policy while other agents act the intermediate policy, transitions are generated from that all agents perform the expert policy. For the last three parts, we add a small noise to the policy to increase the diversity of the dataset.
We compare MABCQ against the following methods:
MABCQ w/o . Removing from MABCQ.
MABCQ w/o . Removing from MABCQ.
BCQ. Removing both and from MABCQ.
DDPG Lillicrap et al. (2016). Each agent is trained using independent DDPG on the offline without action constraint and transition probability modification.
Behavior. Each agent takes the action generated from the VAE .
The baselines have the same neural network architectures and hyperparameters as MABCQ. All the models are trained for five runs with different random seeds. All the learning curves are plotted using mean and standard deviation. More details about experimental settings and hyperparameters are available in AppendixD.
4.2 Performance and Ablation
Figure 2 shows the learning curves of all the methods in the four tasks. Without action constraint and transition probability modification, DDPG severely suffers from the large extrapolation error and can hardly improve the performance throughout the training. BCQ outperforms the behavior policies but only arrives at the mediocre performance. In Figure 2(a) and Figure 2(c), the performance of BCQ even descends in the later stage of learning. During the collection of , when agent takes a “good” action, other agents usually take “bad” actions, making BCQ underestimate the “good” actions, especially in the latter stage. The learning curves of MABCQ w/o are similar to those of BCQ in the first three tasks. That is because other agents’ policies are assumed to be random using only transition normalization, which is far from the learned policies and leads to large extrapolation errors. But in Ant, MABCQ w/o outperforms BCQ in the later stage, which is attributed to the value consensus built by the normalized transition probabilities. By optimistically increasing the transition probabilities of high-value next states, MABCQ w/o encourages the agents to learn potential optimal actions and obviously boosts the performance. MABCQ combines the advantages of both value deviation and transition normalization and outperforms other baselines.
To interpret the effectiveness of transition normalization, we select a subset of states and calculate the difference in value estimates, , on this subset, where is calculated as . The results are illustrated in Figure 3. The of MABCQ is lower than that of MABCQ w/o , which verifies that transition normalization could decrease the difference in value estimates among agents. If there is a consensus among agents about which states are high-value, the agents would select the actions that most likely lead to the common high-value states. This promotes the coordination of policies and helps MABCQ outperform MABCQ w/o .
4.3 Hyperparameter Tuning
The optimism level controls the strength of value deviation. If is too small, value deviation has weak effects on the objective function. But if is too large, the agent will be over optimistic about other agents’ learned policies, which could result in large extrapolation errors and uncoordinated policies. Figure 4 shows the learning curves of MABCQ with different . It is commonly observed that increasing elevates the performance, especially in HalfCheetah. However, in Walker and Ant, if we set a large (), the performance slightly drops due to overoptimism. Even so, with any positive , MABCQ does not underperform that without value deviation.
5 Conclusion and Discussion
In this paper, we proposed MABCQ for offline and fully decentralized multi-agent reinforcement learning. MABCQ modifies the transition probability by value deviation that increases the transition probabilities of high-value next states, and by transition normalization that normalizes the biased transition probabilities. Mathematically, we show that under the non-stationary transition probability after modification, offline decentralized Q-learning converges to a unique fixed point. Empirically, we show that MABCQ could help the agents escape from the suboptimum, learn coordinated policies, and greatly outperform the baselines in a variety of multi-agent offline datasets.
Although we consider the setting where each agent can get the state of the environment, MABCQ could also be potentially applied to partially observable environments where transitions are defined on partial observations. However, if the partial observability is too limited, Q-value cannot be accurately estimated from the observation. Many MARL methods adopt recurrent neural networks to utilize the history information, which however is impractical if the timestamp is not included in the dataset. Moreover, if a state is estimated as a high-value state by an agent but not included in the datasets of other agents, the other agents cannot learn corresponding optimal actions to cooperate with that agent at that state, and thus miscoordination may occur. In this case, each agent should conservatively estimate the values of states that are absent from the datasets of other agents to avoid miscoordination. We leave the observation limitation and the state absence to future work.
-  (2018) Spinning Up in Deep Reinforcement Learning. Cited by: §4.1.
Deep multi agent reinforcement learning for autonomous driving.
Canadian Conference on Artificial Intelligence (Canadian AI), Cited by: §1.
-  (2016) OpenAI gym. External Links: Cited by: §1, §4.1.
TarMAC: targeted multi-agent communication.
International Conference on Machine Learning (ICML), Cited by: §2.
-  (2020) Deep multi-agent reinforcement learning for decentralized continuous cooperative control. arXiv preprint arXiv:2003.06709. Cited by: §1, §4.1.
-  (2020) Learning individually inferred communication for multi-agent cooperation. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §2.
-  (2018) Counterfactual multi-agent policy gradients. In AAAI Conference on Artificial Intelligence (AAAI), Cited by: §2.
-  (2017) Stabilising experience replay for deep multi-agent reinforcement learning. In International Conference on Machine Learning (ICML), Cited by: §1, §2.
-  (2019) Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning (ICML), Cited by: §1, §1, §2, §3.1.
-  (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning (ICML), Cited by: §4.1.
-  (2019) Actor-attention-critic for multi-agent reinforcement learning. In International Conference on Machine Learning (ICML), Cited by: §2.
-  (2019) Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. arXiv preprint arXiv:1907.00456. Cited by: §1, §2.
-  (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §2.
-  (2019) Stabilizing off-policy q-learning via bootstrapping error reduction. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §2.
-  (2020) Offline reinforcement learning: tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643. Cited by: §1, §2.
-  (2016) Continuous control with deep reinforcement learning.. In International Conference on Learning Representations (ICLR), Cited by: 4th item.
-  (2017) Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §2.
-  (2016) A concise introduction to decentralized pomdps. Springer. Cited by: §3.1.
-  (2017) Deep decentralized multi-task multi-agent reinforcement learning under partial observability. In International Conference on Machine Learning (ICML), Cited by: §2.
-  (2018) Lenient multi-agent deep reinforcement learning. In International Conference on Autonomous Agents and MultiAgent Systems (AAMAS), Cited by: §2.
-  (2018) QMIX: monotonic value function factorisation for deep multi-agent reinforcement learning. In International Conference on Machine Learning (ICML), Cited by: §2.
-  (2015) Trust region policy optimization. In International Conference on Machine Learning (ICML), Cited by: §3.1.
-  (2015) Learning structured output representation using deep conditional generative models. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §2.
-  (2019) QTRAN: learning to factorize with transformation for cooperative multi-agent reinforcement learning. In International Conference on Machine Learning (ICML), Cited by: §2.
-  (2018) Value-decomposition networks for cooperative multi-agent learning based on team reward. In International Conference on Autonomous Agents and Multiagent Systems (AAMAS), Cited by: §2.
-  (2012) Mujoco: a physics engine for model-based control. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Cited by: §1, §4.1.
-  (2021) Hierarchically and cooperatively learning traffic signal control. In AAAI Conference on Artificial Intelligence (AAAI), Cited by: §1.
-  (2021) BRAC+: going deeper with behavior regularized offline reinforcement learning. Cited by: §3.2.3.
-  (2018) Finite-sample analysis for decentralized batch multi-agent reinforcement learning with networked agents. arXiv preprint arXiv:1812.02783. Cited by: §2.
Appendix A Proofs
Proposition 1. In episodic environments, if each agent performs Q-learning on , all agents will converge to the same if they have the same transition probability on any state where each agent acts the learned action .
Considering the two-agent case, we define as the difference in the .
For the terminal state , we have . If , recursively expanding the term, we arrive at . We can easily show that it also holds in the -agent case. ∎
Theorem 1. Under the non-stationary transition probability , the Bellman operator is a contraction and converges to a unique fixed point when , if the reward is bounded by the positive region .
We initialize the Q-value to be , where denotes . Since the reward is bounded by the positive region , the Q-value under the operator is bounded to . Based on the definition of , it can be written as , where . Then, we have the following,
The third term of the penultimate line is because: if ,