1 Introduction
Singleagent deep reinforcement learning has achieved impressive performance in many domains, including playing Go [Silver16, Silver17] and Atari games [dqn1, dqn2]. However, many real world problems, such as traffic congestion reduction [Bazzan08, Sunehag18], antenna tilt control [Dandanov17], and dynamic resource allocation [Nguyen18] are more naturally modeled as multiagent systems. Unfortunately, directly deploying singleagent reinforcement learning to each agent in a multiagent system does not result in satisfying performance [Tang93, maddpg]. Particularly, in multiagent reinforcement learning [Nguyen18, maddpg, Foerster17, Foerster18, Iqbal19, Jiang18, Das19, Foerster16, Kim19, Shu19, Han19]
, estimating the value function is challenging, because the environment is nonstationary from the perspective of an individual agent
[maddpg, Foerster17]. To alleviate the issue, recently, multiagent deep deterministic policy gradient (MADDPG) [maddpg] proposed a centralized critic whose input is the concatenation of all agents’ observations and actions. Similar to MADDPG, Foerster17, Foerster18, Kim19, Jiang18, Das19, Iqbal19, Yang18 also deploy centralized critics to handle a nonstationary environment.However, concatenating all agents’ observations and actions assigns an implicit order, , the placement of an agent’s observations and actions will make a difference in the predicted outcome. Consider the case of two homogeneous
agents and let us denote the action and observation of the two agents with ‘A’ and ‘B.’ There exists two equally valid permutations (AB) and (BA) which represent the environment. Using a permuted input in classical deep nets will result in a different output. Consequently, referring to the same underlying state of the environment with two different vector representations makes learning of the critic sampleinefficient: the deep net needs to learn that both representations are identical. Due to an increase in the number of possible permutations, this representational inconsistency worsens as the number of agents grows.
To address this concern, we propose the ‘permutation invariant critic’ (PIC). Due to the permutation invariance property of PICs, the same environment state will result in the same critic output, irrespective of the agent ordering (as shown in fig:idea_fig). In addition, to tackle environments with homogeneous and heterogeneous agents (, agents which have different action space, observation space, or play different roles in a task), we augment PICs with attributes. This enables the proposed PIC to model the relation between heterogeneous agents.
For rigorous results we follow the strict evaluation protocol proposed by Henderson17 and Colas18 when performing experiments in multiagent particle environments (MPEs) [maddpg, Mordatch17]. We found that permutation invariant critics result in to higher average test episode rewards than the MADDPG baseline [maddpg]. Furthermore, we scaled the MPE to 200 agents. Our permutation invariant critic successfully learns the desired policy in environments with a large number of agents, while the baseline MADDPG [maddpg] fails to develop any useful strategies.
In summary, our main contributions are as follows: a) We develop a permutation invariant critic (PIC) for multiagent reinforcement learning algorithms. Compared with classic MLP critics, the PIC achieves better sample efficiency and scalability. b) To deal with heterogeneous agents we study adding attributes. c) We speedup the multiagent particle environment (MPE) [maddpg, Mordatch17] by a factor of 30. This permits to scale the number of agents to 200, 30 times more than those used in the original MPE environment (6 agents). Code is available at https://github.com/IouJenLiu/PIC.
2 Related Work
We briefly review graph neural nets, which are the building block of PICs, and multiagent deep reinforcement learning algorithms with centralized critics.
Graph Neural Networks. Graph neural networks are deep nets which operate on graph structured data [scarselli2009graph]. Input to the network are hence a set of node vectors and connectivity information about the nodes. More notably, these graph networks are permutation equivariant, , the ordering of the nodes in a vector representation does not change the underlying graph [zaheer2017deep]. Many variants of graph networks exists, for example, Graph Convolutional Nets (GCN) [kipf2017semi], the Message Passing Network [gilmer2017neural], and others [zaheer2017deep, qi2017pointnet]. The relation and difference between these approaches are reviewed in [battaglia2018relational]. The effectiveness of graph nets has been shown on tasks such as link prediction [schlichtkrull2018modeling, zhang2018link], node classification [hamilton2017inductive, kipf2017semi], language and vision [NarasimhanNIPS2018, SchwartzCVPR2019a], graph classification [yeh2019diverse, duvenaud2015convolutional, zhang2018end, ying2018hierarchical], . Graph nets also excel on point sets, , [zaheer2017deep, qi2017pointnet]. Most relevant to our multiagent reinforcement learning setting, graph networks have been shown to be effective in modeling and reasoning about physical systems [battaglia2016interaction] and multiagent sports dynamics [hoshen2017vain, kipf2018neural, yeh2019diverse]. Different from these works, here, we study the effectiveness of graph nets for multiagent reinforcement learning.
Multiagent Reinforcement Learning. To deal with nonstationary environments form the perspective of a single agent, MADDPG [maddpg] uses a centralized critic that operates on all agents’ observations and actions. Similar to MADDPG, Foerster18 use a centralized critic. In addition, to handle the credit assignment problem [Nguyen18, Panait05, Chang03], a counterfactual baseline has been proposed to marginalize one agent’s action and keep the other agents’ actions fixed. In “Monotonic Value Function Factorisation” (QMIX) [qmix] each agent maintains its own value function which conditions only on the agent’s local observation. The overall value function is estimated via a nonlinear combination of an individual agent’s value function. Iqbal19 propose an attention mechanism which enables the centralized critic to select relevant information for each agent. However, as discussed in sec:intro, the output of centralized critics parameterized by classic deep nets differs if the same environment state is encoded with a permuted vector. This makes learning inefficient.
“Graph convolutional RL” (DGN) [Jiang19] is concurrent work on arXiv which uses a nearestneighbor graph net as the Qfunction of a deep Qnetwork (DQN) [dqn1, dqn2]. However, the nearestneighbor graph net only has access to local information of an environment. Consequently, the Qfunction in DGN is not a fully centralized critic. Therefore, it suffers from the nonstationary environment issue [maddpg, Foerster17]. In addition, due to the DQN formulation, DGN can only be used in environments with discrete action spaces. Note, DGN considers homogeneous cooperative agents and leaves environments with heterogeneous cooperative agents to future work. In contrast, our permutation invariant critic is fully centralized and can be scaled to a large number of agents. Thanks to different node attributes, our approach can handle environments with heterogeneous cooperative agents. In addition, our approach is designed for continuous state and action spaces.
3 Preliminaries
3.1 Deep Deterministic Policy Gradient
In classic singleagent reinforcement learning, an agent interacts with the environment and collects rewards over time. Formally, at each timestep , with the horizon, the agent finds itself in state and selects an action according to a deterministic policy . Hereby, is the state space, and is the action space. Upon executing an action, the agent runs into the next state and obtains a scalar reward . Given a trajectory of length collected by following a policy, we obtain the discounted return , where is the discount factor. The goal of reinforcement learning is to find a policy which maximizes the return .
Deep deterministic policy gradient (DDPG) [ddpg] is a widely used deep reinforcement learning algorithm for continuous control. In DDPG, a deterministic policy , which is parameterized by , maps a state to an action . A critic , which is parameterized by , is deployed to estimate the return of taking action at state . The parameters of the actor policy are updated iteratively so that the action maximizes the critic , ,
(1) 
Here is drawn from the replay buffer which stores experience tuples , ,
. Using the chain rule, the gradient is
To optimize , similar to deep Qlearning [dqn1, dqn2], we minimize the loss
(2) 
The experience tuple is drawn from the replay buffer and the target value is defined as
(3) 
where is a recent Qnetwork parameterized by a past .
3.2 Multiagent Markov Decision Process
To extend singleagent reinforcement learning to the multiagent setting, we first define the multiagent Markov decision process (MDP). We consider partially observable multiagent MDPs
[Littman94]. An agent partially observable multiagent MDP is defined by a transition function , a set of reward functions , a state space , a set of observation spaces , and a set of action spaces . Action space , observation space and reward function correspond to agent . The transition function maps the current state and the actions taken by all the agents to a next state, , . Each agent receives reward and observation that is related to the state, , .The goal of agent is to maximize the expected return . Note that, the goal of cooperative agents is to maximize the collective expected return .
3.3 Multiagent Deep Deterministic Policy Gradient
In this paper, we study multiagent reinforcement learning using the decentralized execution and centralized training framework [Foerster16, maddpg]. Multiagent deep deterministic policy gradient (MADDPG) [maddpg] is a wellestablished algorithm for this framework. Consider agents with policies , which are parameterized by . The centralized critics associated with the agents are parameterized by . Following DDPG, the parameters for policy are updated iteratively so that the associated critic is optimized via
(4) 
where and are the concatenation of all agents’ observation and action at timestep , , and . Note, is the observation received by agent at time step . Using the chain rule, the gradient is derived as follows:
(5) 
Following DDPG, the centralized critic parameters are optimized by minimizing the loss
(6) 
where is the concatenation of rewards received by all agents at timestep , , . The target value is defined as follows
(7) 
where is a Qnetwork parameterized by a past .
4 Permutation Invariant Critic (PIC) and Environment Improvements
We first describe the proposed permutation invariant critic (PIC), then show improvements for MPE.
4.1 Permutation Invariant Critic (PIC)
Consider training homogeneous cooperative agents using a centralized critic [Nguyen18, maddpg, Foerster17, Foerster18, Iqbal19, Jiang18, Das19, Foerster16, Kim19, Shu19, Han19]. As discussed in sec:back, the input to the centralized critic is the concatenation of all agents observations and actions. Let denote the concatenation of all agents’ observations at timestep , where is the dimension of each observation . Similarly, we represent the concatenation of all agents’ actions at timestep via , where is the dimension of each agent’s action . Note that concatenating observations and actions implicitly imposes an agent ordering. Any agent ordering seems plausible. Importantly, shuffling the agents observations and actions doesn’t change the state of the environment. One would therefore expect the centralized critic to return the same output if the input is concatenated in a different order. Formally, this property is called permutation invariance, , we strive for a critic such that
where and are two permutation matrices from the set of all possible permutation matrices .
To achieve permutation invariance, we propose to use a graph convolutional neural net (GCN) as the centralized critic. In the remainder of the section, we describe the GCN model in detail and discuss how to deploy the permutation invariant critic to environments with homogeneous and heterogeneous agents.
Permutation Invariant Critic. We model an agent environment as a graph. Each node represents an agent, and the edges capture the relations between agents. To compute the scalar output of the critic we use graph convolution layers . A graph convolution layer takes node representations and the graph’s adjacency matrix as input and computes a new representation for each node. More specifically, maps the input to the output , where and are the input and output node representation’s dimension for layer . Formally,
(8) 
where is the graph’s adjacency matrix, are the layer’s trainable weight matrices,
is an elementwise nonlinear activation function, and
is an identity matrix of size
.Note that in Equation 8, each agent’s representation, , each row of , is multiplied with the same set of weights and . Due to this weight sharing scheme, a permutation matrix applied at the input is equivalent to applying it at the output, ,
(9) 
Another advantage of the weight sharing scheme is that the number of trainable parameters of PIC does not increase with the number of agents.
Subsequently, a pooling layer is applied to the th graph convolutional layer’s representation . Pooling is performed over the agents’ representation, , over the rows of . We refer to the output of the pooling layer as
. Either average pooling or max pooling is suitable. Average pooling, subsequently denoted
, averages the node representations, , . Max pooling, subsequently referred to as , takes the maximum value across the rows for each of the columns, , . Both max pooling and average pooling satisfy the permutation invariance property as summation and elementwise maximization are commutative operations. Therefore, an layer graph convolutional net is obtained via .Homogeneous setting. If agents in an environment are homogeneous, we first concatenate the observations and actions into a matrix , , . Setting and , we construct a permutation invariant critic as follows:
(10) 
Hereby maps the output of the graph nets to a real number, which is the estimated scalar critic value for the environment observation and action . We model with a standard fully connected layer which maintains permutation invariance. To ensure that the permutation invariant critic is fully centralized, , to ensure that we consider all agents’ actions and observations, we use an adjacency matrix corresponding to a complete graph, , is a matrix of all ones with zeros on the diagonal. As later mentioned in sec:exp we also study other settings but found a complete graph to yield best results.
Heterogeneous setting. Consider cooperative agents, which are divided into multiple groups. In this heterogeneous setting, agents in different groups have different characteristics, , size and speed, or play different roles in the task. In a heterogeneous setting, a permutation invariant critic is not directly applicable, because the relation between two heterogeneous agents differs from the relation between two homogeneous agents. For instance, the interaction between two fastmoving agents differs from the interaction between a fastmoving and a slowmoving agent. However, in the aforementioned critic , relations between all agents are modeled equivalently.
To address this concern, for the heterogeneous setting, we propose to add node attributes to the PIC. With node attributes, the PIC can distinguish agents from different groups. Specifically, for group , we construct a group attribute , where denotes the dimension of the group attribute. Let denote the input representation of agent , , the th row of . We obtain the augmented representation via where denotes the group index of agent . We perform the augmentation for each agent to obtain the augmented representation . Using Equation 10 and setting and , results in a PIC that can handle environments with heterogeneous agents.
4.2 Improved MPE Environment
The multiple particle environment (MPE) [maddpg, Mordatch17]
is a multiagent environment which contains a variety of tasks and provides a challenging open source platform for our community to evaluate new algorithms. However, the MPE targets only settings with a small number of agents. Specifically, we found it challenging to train more than 30 agents as it takes more than 100 hours. To scale the MPE to more agents, we improve the implementation of the MPE. More specifically, we develop vectorized versions for many of the computations, such as computing the force between agents, computing the collision between agents, . Moreover, for tasks with global rewards, instead of computing rewards for each agent, we only compute the reward once and send it to all agents. With this improved MPE, we can train up to 200 agents within one day.
5 Experiments
In this section, we first introduce tasks in the multiple particle environment (MPE) [maddpg, Mordatch17]. We then present the details of the experimental setup, evaluation protocol, and our results.
MLP Critic  MLP Critic + Data Augaumentation  Ours Permutation Invariant Critic  
# of agents  final  absolute  final  absolute  final  absolute  
Cooperative Navigation 
N=3  362.73  362.71  361.76  361.55  355.99  355.74 
N=6  3943.2  3933.3  4025.4  4016.2  3383.2  3381.8  
N=15  6489.7  6394.3  6280.0  6222.4  1999.1  1977.6  
N=30  20722  20583  21205  20840  11363  11294  
N=100  128100  128086  128024  128013  71495  71074  
N=200  502349  502348  509963  507457  436215  433846  
Prey & Predator 
N=3  38.52  40.91  43.78  44.99  65.16  67.69 
N=6  26.15  30.34  24.84  17.49  176.70  184.22  
N=15  3982.80  4416.23  4198.41  4401.64  10139  10239  
N=30  377.19  386.49  93.86  75.59  6662.1  6745.6  
N=100  19894  21114  28741  29347  99391  100812  
Cooperative Push 
N=3  171.26  170.20  171.96  171.74  155.17  155.10 
N=6  561.73  542.29  672.80  672.27  401.79  395.70  
N=15  2538.3  2536.0  2645.2  2610.1  2231.1  2225.7  
N=30  3499.7  3465.9  3761.5  3688.3  3117.1  3094.3  
Heterogeneous Navigation 
N=4  100.74  100.35  98.96  98.67  83.84  83.54 
N=8  398.31  397.66  683.24  684.39  398.31  397.66  
N=16  3410.0  3405.9  3479.0  3470.8  1825.8  1820.4  
N=30  12944  12934  12812  12809  6293  6270  
N=100  121563  121472  130028  129436  94324  93996 
Environment. We evaluate the proposed approach on an improved version of MPE [maddpg, Mordatch17]. We consider the following four tasks:

[noitemsep,topsep=0pt,parsep=0pt,partopsep=0pt]

Cooperative navigation: agents move cooperatively to cover landmarks in the environment. The reward encourages the agents to get close to landmarks.

Prey and predator: slower predators work together to chase fastmoving preys. The predators get positive reward when colliding with preys. Preys are environment controlled.

Cooperative push: agents work together to push a large ball to a landmark. The agents get rewarded when the large ball approaches the landmark.

Heterogeneous navigation: small and fast agents and big and slow agents work cooperatively to cover landmarks. The reward encourages the agents to get close to landmarks. If a small agent collides with a big agent, a large negative reward is received.
Note cooperative navigation, prey and predator and cooperative push are environments with homogeneous agents. In each task, the reward is global for cooperative agents, , cooperative agents always receive the same reward in each timestep.
Experimental Setup. We use MADDPG [maddpg]
with a classic MLP critic as the baseline. We implement MADDPG and the proposed approach in Pytorch
[pytorch]. To ensure the correctness of our implementation, we compare it with the official MADDPG code [maddpg_code] on MPE. Our implementation reproduces the results of the official code. Please see tb:base in the supplementary for the results. Following MADDPG [maddpg], the actor policy is parameterized by a twolayer MLP with 128 hidden units per layer, and ReLU activation function. The MLP critic of the MADDPG baseline has the same architecture as the actor. Our permutation invariant critic (PIC) is a twolayer graph convolution net (GCN) with 128 hidden units per layer and a max pooling at the top. The activation function for GCN is also a ReLU. Following MADDPG, the Adam optimizer is used. The learning rates for the actor, MLP critic, and our permutation invariant critic are
. The learning rate is linearly decreased to zero at the end of training. Agents are trained for episodes in all tasks (episode length is either or steps). The size of the replay buffer is one million, the batch size is , and the discounted factor . In an eightagent environment, the MLP critic and the PIC have around 44k and 40k trainable parameters respectively. In an 100agent environment, the MLP critic has 413k trainable parameters while the PIC has only 46k parameters.Evaluation Protocol. To ensure a fair and rigorous evaluation, we follow the strict evaluation protocols suggested by Colas18 and Henderson17. For each experiment, we report final metrics and absolute metrics. The final metric [Colas18] is the average reward over the last evaluation episodes, , episodes for each of the last ten policies during training. The absolute metric [Colas18] is the best policy’s average reward over
evaluation episodes. To analyze the significance of the reported improvement over the baseline, we perform a twosample ttest and boostrapped estimation of the
confidence interval for the mean reward difference obtained by the baseline and our approach. We use the Scipy [scipy] implementation for the ttest, and the Facebook Boostrapped implementation with boostrap samples for confidence interval estimation. All experiments are repeated for five runs with different random seeds.Results. We compare the proposed permutation invariant critic (PIC) with an MLP critic and an MLP critic with data augmentation. Data augmentation shuffles the order of agents’ observations and actions when training the MLP critic, which is considered to be a simple way to alleviate the ordering issue. The results are summarized in tb:main, where denotes the number of agents, ‘final’ and ‘absolute’ are the final metric and absolute metric respectively. We observe that data augmentation does not boost the performance of an MLP critic much. In some cases, it even deteriorates the results. In contrast, as shown in tb:main, a permutation invariant critic outperforms the MLP critic and the MLP critic with data augmentation by about in all tasks.
The training curves are given in fig:r_plot and fig:q_plot. As shown in fig:q_plot, the loss (Equation 6) of the permutation invariant critic is lower than that of MLP critics. Please see the supplementary material for more training curves. fig:q_loss_and_time (top) shows the ratio of the MLP critic’s average loss to the permutation invariant critic’s loss on cooperative navigation environments. We observed that the ratio grows when the number of agents increases. This implies that a permutation invariant approach gives much more accurate value estimation than the MLP critic particularly when the number of agents is large.
To confirm that the performance gain of our permutation invariant critic is significant, we report the 2sample ttest results and the boostrapped confidence interval on the mean difference of our approach and the baseline MADDPG with MLP critic. For the 2sample ttest, the tstatistic and the pvalue are reported. The difference is considered significant when the pvalue is smaller than . tb:t_test summarizes the analysis on cooperative navigation environments. As shown in tb:t_test, the pvalue is much smaller than which suggests the improvement is significant. In addition, positive confidence interval and means suggest that our approach achieves higher rewards than the baseline. Please see the supplementary material for additional analysis.
In addition to fully connected graphs we also tested nearest neighbor graphs, , each node is connected only to its nearest neighbors. The distance between two nodes is the physical distance between the corresponding agents in the environment. We found that using fully connected graphs achieves better results than using nearest neighbor graphs. We report absolute rewards for (, , fully connected graph): the results on cooperative navigation (), prey and predator (), cooperative push (), and heterogeneous navigation () are , , , and respectively.
fig:q_loss_and_time (bottom) compares training time of MADDPG on the original MPE and our improved MPE. MADDPG is trained for episodes with a PIC. As shown in fig:q_loss_and_time (bottom), training agents in the original MPE environment takes more than 100 hours. In contrast, with the improved MPE environment, we can train 30 agents within five hours, and scale to 200 agents within a day of training.
6 Conclusion
We propose and study permutation invariant critics to estimate a consistent value for all permutations of the agents’ order in multiagent reinforcement learning. Empirically, we demonstrate that a permutation invariant critic outperforms classical deep nets on a variety of multiagent environments, both homogeneous and heterogeneous. Permutation invariant critics lead to better sample complexity and permit to scale learning to environments with a large number of agents. We think that the permutation invariance property is important for deep multiagent reinforcement learning. Going forward we will study other designs to scale to an even larger number of agents.
Acknowledgements: This work is supported in part by NSF under Grant 1718221 and MRI 1725729, UIUC, Samsung, 3M, Cisco Systems Inc. (Gift Award CG 1377144), Adobe, and a Google PhD Fellowship to RY. We thank NVIDIA for providing GPUs used for this work and Cisco for access to the Arcetri cluster.
References
Appendix A MADDPG Baseline
We implement the MADDPG baseline in Pytorch. To ensure we implement MADDPG correctly, we compare the performance of our implementation with MADDPG’s official code on the MPE. As shown in tb:base, our implementation reproduces the results of MADDPG in all environments.
Cooperative Navigation  Prey and Predator  Push  
good  adversary  good  adversary  good  
MADDPG official code  379.57  2.91  5.48  7.05  1.67 
Our implementation  367.75  5.97  14.79  6.54  1.81 
Appendix B Environment Details and Group Embedding
In this section, we first provide details of observation and action space in each environment we considered in our experiments. Subsequently, we discuss a PIC’s group embedding.
In all the four environments, the action dimension is five. One dimension is noop. The other four dimensions represent the left, right, forward, and backward force applied to the particle agent. An agent’s observation always contains its location and velocity. Depending on the environment the observation may contain relative location and velocity of neighboring agents and landmarks. In our experiments, the number of visible neighbors in an agent’s observation is equal to or less than ten because we empirically found a larger number of visible neighbors to not boost the performance of MADDPG but rather slow the training speed. Note, the number of visible neighbors in an agent’s observation is different from the nearest neighbor graph discussed in sec:exp. The in the nearest neighbor graph refers to the number of agent observations and actions which are used as input to the centralized critic, while the number of visible neighbors in an agent’s observation is a characteristic of an environment. The details of the observation representation for each environment are as follows:

[noitemsep,topsep=0pt,parsep=0pt,partopsep=0pt]

Cooperative navigation: An agent observes its location and velocity, and the relative location of the nearest landmarks and agents. is 2, 5, 5, 5, 5, 5 for an environment with 3, 6, 15, 30, 100, 200 agents. As a result, the observation dimension is 14, 26, 26, 26, 26, 26 for an environment with 3, 6, 15, 30, 100, 200 agents.

Prey and predator: A predator observes its location and velocity, the relative location of the nearest landmarks and fellow predators, the relative location and velocity of the nearest preys. is (2, 1), (3, 2), (5, 5), (5, 5) (5, 5) for an environment with 3, 6, 15, 30, 100 agents. As a result, the observation dimension is 16, 28, 34, 34, 34 for an environment with 3, 6, 15, 30, 100 agents.

Cooperative push: An agent observes its location and velocity, the relative location of the target landmark and the large ball, and the relative location of the nearest agents. is 2, 5, 10, 10 for an environment with 3, 6, 15, 30 agents. As a result, the observation dimension is 12, 18, 28, 28 for an environment with 3, 6, 15, 30 agents.

Heterogeneous navigation: An agent observes its location and velocity, and the relative location of the nearest landmarks and agents. is 2, 5, 5, 5, 5 for an environment with 3, 6, 15, 30, 100 agents. Consequently, the observation dimension is 14, 26, 26, 26, 26 for an environment with 3, 6, 15, 30, 100 agents.
In the heterogeneous environment Heterogeneous navigation, the number of groups is two, , two groups of agents that have different characteristics. For the PIC, the group embedding for each group is a twodimensional vector, , for each group
in the environment. We train the embedding along with the network parameters. The group embedding is randomly initialized from a normal distribution
.Appendix C Ttest and Training Curves
The twosample ttest and the confidence interval of the mean difference of the MLP critic and the PIC are summarized in tb:t_test_spreada, tb:t_test_taga, tb:t_test_pusha, and tb:t_test_heteroa. pvalues smaller than and positive confidence intervals indicated that PIC’s improvement over the MLP critic is significant. The training curves for MADDPG with MLP critic and our PIC are shown in fig:r_plot_all1 and fig:r_plot_all2. The PIC outperforms the MLP critic in all the environment settings.
valign=t ttest Boostrap C.I. tstat. (pvalue) mean (C.I.) N=3 abs. 1.42 (1.9e1) 5.81 (1.47, 13.09) final 1.41 (1.9e1) 5.76 (1.65, 13.18) N=6 abs. 9.46 (3.2e5) 634.4 (487.9, 757.1) final 9.34 (4.1e5) 642.2 (496.1, 762.3) N=15 abs. 21.04 (2.5e5) 4244 (3819, 4575) final 20.6 (2.8e5) 4280 (3854, 4616) N=30 abs. 3.17 (2.2e2) 9546 (6719, 12639) final 3.10 (2.1e2) 9841 (6692, 13201) N=100 abs. 17.8 (2.9e3) 56939 (51842, 62764) final 17.3 (3.0e3) 56529 (51153, 62343) N=200 abs. 2.72 (4.9e2) 73611 (25930, 116198) final 2.66 (4.8e2) 73747 (26557, 116457)  valign=t ttest Boostrap C.I. tstat. (pvalue) mean (C.I.) N=3 abs. 6.61 (1.6e4) 22.7 (15.2, 29.7) final 5.38 (8.0e4) 21.3 (13.6, 29.6) N=6 abs. 3.14 (3.4e2) 201 (88.2, 316) final 3.11 (3.5e2) 201 (86.4, 318) N=15 abs. 20.34 (1.7e6) 5838 (5468, 6248) final 20.29 (5.7e7) 5941 (5609, 6303) N=30 abs. 5.91 (4.0e3) 6821 (5176, 9006) final 5.82 (4.3e3) 6755 (5106, 8925) N=100 abs. 6.92 (7.3e2) 92753 (82974, 102532) final 7.44 (7.1e2) 89980 (80796, 99163) 
valign=t ttest Boostrap C.I. tstat. (pvalue) mean (C.I.) N=3 abs. 4.03 (1.4e2) 16.6 (10.3, 24.4) final 4.04 (1.4e2) 16.7 (10.4, 24.6) N=6 abs. 66.4 (1.0e10) 276.5 (269, 287) final 67.2 (1.3e11) 271.0 (263, 281) N=15 abs. 4.90 (6.3e3) 384 (250, 507) final 5.12 (5.2e3) 414 (280, 546) N=30 abs. 4.41 (7.1e3) 593 (400, 778.316) final 4.64 (5.4e3) 644 (419, 844)  valign=t ttest Boostrap C.I. tstat. (pvalue) mean (C.I.) N=3 abs. 7.07 (1.7e3) 15.1 (11.8, 18.4) final 6.76 (2.2e3) 15.1 (11.6, 18.6) N=6 abs. 19.7 (2.1e5) 285 (259, 317) final 19.6 (2.2e5) 286 (260, 318) N=15 abs. 62.2 (3.0e9) 1650 (1619, 1678) final 61.9 (6.5e9) 1653 (1619, 1684) N=30 abs. 74.1 (3.7e9) 6538 (6380, 6725) final 76.0 (2.6e9) 6518 (6362, 6695) N=100 abs. 5.82 (1.9e2) 36327 (27272, 50292) final 5.65 (1.8e2) 36618 (26339, 51579) 