I Introduction
Multiagent deep reinforcement (RL) algorithms have shown superhuman performance in gaming environments such as the Dota 2 video game [1] and StarCraft II [2]. For industrial applications, millions of dollars can be saved by making small improvements in productivity. The unprecedented performance of multiagent deep RL in learning sophisticated policies in collaborative and competitive games makes one wonder why industries are reluctant in using multiagent deep RL algorithms to achieve more efficient policies. Recently, DulacArnold, Mankowitz, and Hester listed nine main challenges in applying deep RL in reallife applications [3]
. DulacArnold et al. formulated these challenges in a more formal way using a Markov Decision Process framework and presented some existing work in the literature to solve them
[4]. In previous work, we discussed additional challenges in using deep RL in an industrial environment [5]. Among them, multiagent systems with a variable number of agents have received very limited attention from the research community [6]. However, agent failures are very common in industrial applications. An agent’s failure during the operation or its return to the operation after maintenance or repair can change the number of operating agents abruptly.Moreover, a change in the production target can increase or decrease the number of operating agents in an industrial setting from one shift to the next. Therefore, it is critical to address the systems with varying number of agents to make the application of multiagent deep RL in the industrial setting a viable solution. In this paper, we propose a new deep RL multiagent algorithm which can address the challenge of multiagent systems with a variable number of agents. Moreover, our solution has the capacity to learn an optimal policy when we have action dependency among neighboring agents. This can be very helpful for industrial settings with collaborating agents because it is often impossible for an agent to make optimal decisions independent of the agents with which it has a high level of interaction.
Section II reviews deep RL solutions for multiagent systems. In this section we present the pros and cons of common multiagent deep RL algorithms and justify the need for developing our proposed solution. Section III presents the Knearest multiagent deep RL as a new algorithm to address some of the challenges which have prevented the industrial applications to fully benefit from deep RL algorithms. Section IV presents the application of Knearest multiagent RL in a fleet traffic management system. The fleet traffic control problem is a real industrial challenge in a Hitachi production site which motivated us to develop the Knearest multiagent RL algorithm. Section V concludes the paper.
Ii Multiagent Deep RL Literature review
RL  Challenge  
Methods  Variable number of agents  Nonstationary training  Action dependency  Model management 
Centralized  ✗  
Centralized training/decentralized execution  ✗  ✗  
Independent Qlearning (IDQ)  ✗  ✗  ✗  
IDQ + weightsharing  ✗  
Knearest RL + weightsharing  
20cm : the method addresses the challenge.  
✗: the method cannot address the challenge. 
Many realworld problems are multiagent problems. When the number of agents is small, it is possible to model multiagent problems using a centralized approach wherein we train a centralized policy over the agents’ joint observations and produce a joint set of actions. In the centralized approaches, the policy network generates all of the agents’ actions using the global observations. Therefore, these methods can achieve optimal policies even when the agents’ actions are interdependent.
However, one can imagine that the centralized approaches do not scale well for many applications and we will quickly have state and action spaces with very large dimensions. Moreover, they are not practical solutions to address industrial problems, in which the number of agents is not constant. In many realworld applications, the number of agents can change based on the product’s demand, the equipment’s scheduled maintenance and repairs, unexpected breakdowns, operator availability, etc.
The centralized RL algorithms consider the global feature space as the union of the agents’ features and generate actions for all of the agents simultaneously. Therefore, these methods assume that the number of agents is fixed. Removing an agent leaves a hole in the network. Moreover, we cannot add additional agents during an episode, which can be the case in many industries. A trivial solution to this problem is to learn different policy networks for each different number of agents, so that we can enact the appropriate policy when the number of agents changes. However, the range of agents could vary widely in an industrial application. To have a model trained and updated for each combination is expensive or even infeasible, especially for a large system with hundreds of agents.
A more realistic approach is to use an autonomous learner for each agent such as independent deepQnetwork (IDQN) which distinguishes agents by identities [7]. Even though the independent learners address the challenges of large action space and a variable number of agents to some extent, they suffer from the convergence point of view as the environments become nonstationary. In fact, these algorithms model the other agents as part of the environment and, therefore, the policy networks must chase moving targets as the agents’ behaviors change during the training [8].
Moreover, when we use the IDQN framework, we force the agents to take actions independently. Therefore, unlike the centralized RL algorithms, IDQN cannot achieve an optimal policy when an optimal action for an agent depends on the other agents’ actions. This limits the performance of IDQN in many collaborative industrial applications wherein each agent is interacting with several other agents’ and must consider their behavior in its decisionmaking to maximize the overall performance. Finally, from a practical point of view, it can be difficult to train and maintain a separate policy network for each agent, especially in largescale industrial systems in which there can be hundreds of agents.
To address the convergence problem, centralized learning with decentralized execution approaches have been proposed in recent years. In these methods, a centralized learning approach is combined with a decentralized execution mechanism to have the best of both worlds. Lowe et al. proposed a multiagent deep deterministic policy gradient (MADDPG), which includes a centralized critic network and decentralized actor networks for the agents [9]. Sunehag et al. proposed a linear additive value decomposition approach wherein the total Q value is modeled as a sum of individual agents’ Q values [10]. Rashid et al. proposed the QMIX network, which allows for a richer mixing of Q agents compared to the linear additive value decomposition [11].
Even though the centralized learning with decentralized execution approaches have shown promising results in many applications, they have the same limitations as the centralized RL methods when it comes to the industrial problems in which the number of agents is not constant. Moreover, because the execution is decentralized, they cannot solve multiagent problems in which the optimal action for an agent depends upon the other agents’ actions.
Foerster, Assael, Freitas, and Whiteson proposed a single network with shared parameters to reduce the number of learned parameters and speed up the learning [12]. In a previous work, we applied a similar approach to address a dynamic dispatching problem in an industrial multiagent environment [13]. Having a shared policy among agents solves the agent failure challenge and addresses the nonstationary problem to some extent. To further reduce the nonstationary problem, Foerster, Assael, Freitas, and Whiteson disabled experience replay [12]. A single network with shared parameters also solves the model management challenge because it only requires training and maintaining one model for all of the agents. However, like the autonomous learner algorithms and the centralized learning with decentralized execution approaches, this approach cannot achieve optimal policy when optimal action for an agent depends upon the other agents’ actions. Table I presents a summary of advantages and disadvantages of different multiagent RL solutions.
Iii Knearest Multiagent RL
In this section, we present Knearest deep RL which is designed to address two main challenges in multiagent industrial problems: 1) varying number of agents and 2) action dependency among the agents. A varying number of agents is one of the main challenges that industries face when dealing with multiagent problems. This problem often arises because of a change in the production plan or equipment failures. Action dependency is another major issue in many industrial applications. This occurs when the optimal action of an agent depends upon the action another agent takes. On one side of the spectrum, independent learner methods can address the variable agent challenge, but they fail to perform well when there is action dependency among the agents. On another side of the spectrum, centralized RL algorithms can fully address the action dependency. However, they are not suitable for situations in which there are a variable number of agents. Knearest RL proposes a tradeoff solution which combines the best of both worlds.
We observed that in many industrial applications with collaborative agents the optimal actions an agent takes are highly dependent upon the nearby agents’ actions, but they are not dependent upon the actions taken by other agents that the agent is not interacting with. For example, in the traffic control problem, the optimal speed for a given vehicle can dramatically change if the speed of a nearby vehicle changes but the speed of two vehicles on different sides of the production site hardly have any impact on one another. The Knearest RL aims to use this locality to address action dependency when it matters while avoiding challenges that come with the centralized solutions.
Iiia Knearest agents
Our proposed algorithm finds the knearest agents to each agent and generates their actions using a common policy network. The metric with which to measure the distance between two agents depends upon the problem. For example, in a traffic control problem where each vehicle is an agent, the distance between two vehicles can be defined based upon the traveling distance between them, but in a production line in which each operating machine is an agent, the distance between each two agents may be defined as the time it takes for a product to travel between the machines. Consider a system with number of agents. After grouping each agent with its knearest agents, we will have unique groups, wherein the number of agents in each group is equal to . Figure 1 shows an example wherein , and at this time step, we have three groups; .
The number of groups, , would be different based upon the agents’ formation. The following theorem provides the upper bound and lower bound for .
Theorem 1
At any given time, the number of groups, , in knearest RL algorithm is:
(1) 
where represents the ceil function of which is the smallest integer that is greater than or equal to .
Proof: see Appendix A.
IiiB Weight sharing
Since all the groups have the same number of agents equal to , we can have a policy network with outputs to assign actions of agents in each group. Similar to [12], different groups have the same policy networks. The weight sharing makes our model easy to train and maintain. Moreover, our solution can address dependent actions because one network selects optimal actions for each agent and the nearby agents. Note that independent learner solutions are special cases of Knearest RL when , and the centralized RL algorithms are the special case of Knearest RL when .
IiiC Post processing
In our algorithm, it is possible that an agent belongs to more than one group. In this case, the agent may receive different commands from policy networks of different groups that it belongs to. We can address this problem using different aggregation strategies based on the application. In the traffic control case study, we use the weighted average of different actions. The weight of each group is inversely proportional to the total distance of the agent from the other agents in the group. Therefore, the agent gives a higher weight to the group with which it is closer. Figure 1 shows two examples of an agent belonging to more than one group. Consider agent x. At this time step, x belongs to group A and group B. Using our weighted sum aggregation method, agent x gives higher weight to the action proposed by policy network of group B because it is located closer to the agents of this group than the agents of group A. The focus of this work is on continuous action space. However, it is possible to develop an aggregation method for discrete action space as well. For example, we can simply choose the action from the group that the agent is closest to.
IiiD Offpolicy training
The Knearest RL algorithm learns a policy for each group of k agents. Using an offpolicy RL algorithm to train the network, we must save the observation of k agents, the actions of the k agents, the next observations of the k agents and the total reward of the k agents in the replay buffer. We consider the reward of each group equal to the total reward of the k agents to encourage them to collaborate with other local agents in the group in order to maximize the common profit. We keep track of the group observation and reward in the next step and save the results in the replay buffer even though after the agents take the actions, distances between different agents may change and in the next step we may have a different number of groups. All of the group policy networks share the samples and the weights during the offpolicy training. The idea is to learn a policy that generates an optimal solution for any nearby agents.
We use the Soft Actor Critic (SAC) algorithm [14] which is an offpolicy deep RL algorithm that uses maximum entropy to improve robustness as our RL algorithm to learn the network policy for each group. The SAC algorithm achieves stateoftheart performance on various continuous action space benchmarks. Moreover, it is very stable, and its performance does not change significantly across different training episodes. Nonetheless, our solution is not dependent upon the SAC algorithm and it can be used with any offpolicy RL algorithm.
IiiE The overall algorithm
Here are the steps to train the Knearest RL algorithm:

Identify the knearest agents to each agent. For a system with agents, this will generate groups wherein each group has agents. The distance can be defined based on agent interactions for any given problem. For example, in the traffic control, the traveling distance between two vehicles can represent the distance between them.

Remove redundant groups. Two groups are redundant if they have the same agents. This will reduce the number of groups to (see equation (1)).

Train an SAC policy network with k+1 actions to control each group. We use weight sharing for each group. The reward for each group is the total rewards of its agents. This encourages collaboration among the agents.

Apply actions generated by each policy network to the agents; if an agent belongs to more than one group, apply a weighted action. The weights are based on how close the agent is to each group’s agents.

Go back to the first step.
The knearest RL can be applied to any multiagent problem in which we can define the nearby agents as a subset of highly interacting agents. An example for these systems is a fleet traffic control problem where nearby vehicles have much higher impact on each other. In the next section, we demonstrate this application.
Iv Case Study
We developed the Knearest multiagent deep RL algorithm to address traffic control in a Hitachi fleet management system. In this setting, different vehicles travel in a production site to transfer personnel and materials. The destination of each vehicle is set by a separate dispatching algorithm. Our goal is to set the speed for each vehicle in order to minimize the traffic congestion and therefore, maximize the total number of cycles in each shift. Reducing the traffic congestion can lead to millions of dollars in cost saving and reduce a significant amount of carbon emission.
Considering the actuator delays which are the time between when a vehicle receives a speed command and the time that the vehicle reaches that desired speed, different roads and intersections in different parts of the facility, and the multiagent nature of the problem where each vehicle represents an agent, it is not feasible to solve this problem analytically. Multiagent deep RL seems like a natural solution for this problem. Therefore, we developed the Knearest multiagent deep RL algorithm to utilize the potentials of multiagent deep RL while addressing the following challenges in the system:

The number of available vehicles could change in any day and it is very expensive to have a different model for different combinations of vehicles as each model has to be maintained over time.

The actions of each vehicle (the speed of vehicle) cannot be set independent of the nearby vehicles because an independent decisionmaking process can lead to a suboptimal solution for the entire system.
The current default algorithm sets the speed of each vehicle equal to the maximum speed limit of each road. Therefore, we call the default algorithm the maximum speed algorithm. Coming up with an algorithm to achieve a higher number of cycles per shift than the maximum speed algorithm is challenging because the vehicles are not allowed to travel faster than the speed limit. In fact, our algorithm should increase the overall number of cycles by reducing some vehicles’ speeds which may sound counterintuitive. However, this is possible because higher speeds may lead to queuing and congestion further down the road.
We use a simulator developed by Hitachi which is the digital twin of the actual production facility to train and test the Knearest deep RL algorithm. The simulator is designed to mirror the actual vehicles’ speed profiles and the facility map. We set which means each group includes five vehicles. This hyper parameter is set based upon our understanding of the level of vehicle interactions in the facility and the fact that the facility always has at least five traveling vehicles. For other problems, based upon the number of interacting agents and the minimum number of operating agents, we may need a higher or a lower .
Each episode during the training represents a hour shift in the factory. We use the SAC deep RL algorithm to learn the policy network for the groups. The observation for each group includes the location of each vehicle in the group, the distance between each pair of vehicles in the group, the current speed of each vehicle in the group, the time elapsed since each vehicle in the group has left the previous stop which could be a loading site or a discharge site, the distance it traveled since the last stop, and the distance it has left to travel to reach its current goal determined by the dispatching algorithm.
We renew the speed targets every seconds. We keep track of vehicles in each group until the next step to save the next observation. The reward for each group is the total distance traveled by the vehicles in the group from the time we set the speeds until the next time step.
(2) 
where is the reward for group number and represents the distance traveled by vehicle number in group over the last seconds. We set the reward equal to the total distance traveled by the group’s vehicles since the previous time step for each group to encourage collaboration among the vehicles.
Fig. 2 shows the number of cycles we achieve in one shift using the default maximum speed algorithm and our proposed Knearest deep RL algorithm. The black line shows one standard deviation of each method over 10 runs. The gray line shows the minimum and maximum number of cycles for the 10 runs. We run each algorithm multiple times because there are several stochastic variables in the simulator, such as the time it takes for the personnel or load to load or discharge which can affect the number of cycles. Fig. 2 shows that the Knearest RL algorithm clearly outperforms the maximum speed algorithm. This is significant because the maximum speed is a very strong baseline method which most drivers choose intuitively.
Fig. 3
shows the histogram distribution of vehicles’ speeds operating under the maximum speed algorithm and the Knearest RL policy for two consecutive 12hours shifts. When we use the Knearest RL policy, there are fewer moments at which the vehicles are moving at the highest speed limit (60 mph). However, there are also much fewer instances that vehicles are stopped or moving less than 5 mph due to congestion. This figure shows how Knearest RL improves the overall performance by occasionally lowering the vehicles’ speed below the maximum speed limit.
The simulator operated with vehicles during the training. To demonstrate the robustness of Knearest RL, we run another experiment in which we test the trained policy with fewer vehicles. Fig. 4 shows that the Knearest policy trained with vehicles outperforms the maximum speed policy even when the number of vehicles is reduced to . This demonstrate the robustness of our algorithm and the fact that it can be applied in environments with a variable number of agents.
V Conclusions
In this work, we presented Knearest deep RL to address challenges of a variable number of agents, action dependency, model management, and nonstationary training in order to make deep multiagent RL more applicable in the industrial settings. First, Knearest deep RL can perform in an environment with a variable number of agents as long as the total number of agents does not fall below the number of agents in each group, . The experimental results showed that our knearest policy is robust enough to outperform the baseline method even when we have five fewer agents in the test than the training.
Second, Knearest deep RL can address action dependency among highly interactive agents. This makes achieving optimal policy feasible in many industrial applications. Third, by using weightsharing, our proposed Knearest deep RL algorithm makes model management very simple as we only have to train and maintain one policy network for all of the groups. Unfortunately, simplicity of the model training and maintenance is not a high priority for many deep RL researchers. However, it is crucial in any industrial application. Finally, our experimental results showed that our solution converges even though we used the experience replay. This has a high practical value because the experience replay can improve the sample efficiency and reduce the time and the computational costs of the learning process.
References
 [1] C. Berner, G. Brockman, B. Chan, V. Cheung, P. Debiak, C. Dennison, D. Farhi, Q. Fischer, S. Hashme, C. Hesse, et al., “Dota 2 with large scale deep reinforcement learning,” arXiv preprint arXiv:1912.06680, 2019.
 [2] O. Vinyals, I. Babuschkin, J. Chung, M. Mathieu, M. Jaderberg, W. M. Czarnecki, A. Dudzik, A. Huang, P. Georgiev, R. Powell, et al., “Alphastar: Mastering the realtime strategy game starcraft ii,” DeepMind blog, p. 2, 2019.
 [3] G. DulacArnold, D. Mankowitz, and T. Hester, “Challenges of realworld reinforcement learning,” arXiv preprint arXiv:1904.12901, 2019.
 [4] G. DulacArnold, N. Levine, D. Mankowitz, J. Li, C. Paduraru, S. Gowal, and T. Hester. “An empirical investigation of the challenges of realworld reinforcement learning.” arXiv preprint arXiv:2003.11881, 2020.
 [5] H. Khorasgani, H. Wang, and C. Gupta. “Challenges of Applying Deep Reinforcement Learning in Dynamic Dispatching.” arXiv preprint arXiv:2011.05570, 2020.
 [6] G. Wang and J. Shi, “Actorcritic for multiagent system with variable quantity of agents,” in International Conference on Internet of Things as a Service, pp. 48–56, Springer, 2018.
 [7] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., “Humanlevel control through deep reinforcement learning,” nature, vol. 518, no. 7540, pp. 529–533, 2015.

[8]
J. N. Tsitsiklis, “Asynchronous stochastic approximation and qlearning,” Machine learning, vol. 16, no. 3, pp. 185–202, 1994
 [9] R. Lowe, Y. I. Wu, A. Tamar, J. Harb, O. P. Abbeel, and I. Mordatch, “Multiagent actor critic for mixed cooperativecompetitive environments,” in Advances in neural information processing systems, pp. 6379–6390, 2017.
 [10] P. Sunehag, G. Lever, A. Gruslys, W. M. Czarnecki, V. F. Zambaldi, M. Jaderberg, M. Lanctot, N. Sonnerat, J. Z. Leibo, K. Tuyls, et al., “Valuedecomposition networks for cooperative multiagent learning based on team reward.,” in AAMAS, pp. 2085–2087, 2018.
 [11] T. Rashid, M. Samvelyan, C. S. De Witt, G. Farquhar, J. Foerster, and S. Whiteson, “Qmix: Monotonic value function factorisation for deep multiagent reinforcement learning,” arXiv preprint arXiv:1803.11485, 2018.
 [12] J. Foerster, I. A. Assael, N. De Freitas, and S. Whiteson, “ Learning to communicate with deep multiagent reinforcement learning,” in Advances in neural information processing systems, pp. 2137–2145, 2016.
 [13] C. Zhang, P. Odonkor, S. Zheng, H. Khorasgani, S. Serita, C. Gupta, and H. Wang. “Dynamic dispatching for largescale heterogeneous fleet via multiagent deep reinforcement learning.” In 2020 IEEE International Conference on Big Data (Big Data), pp. 14361441. IEEE, 2020.
 [14] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. ”Soft actorcritic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor.” In International conference on machine learning, pp. 18611870. PMLR, 2018.
Appendix A
Let say score of agent , is equal to the number of groups agent belongs to. Each agent at least belongs to one group, therefore:
(3) 
Moreover, we know that each group has agents, therefore:
(4) 
Combining equations (3) and (4) we have:
(5) 
Since is an integer, we have:
(6) 
Consider the case in which the groups do not share any agent. Since we have agents in each group, we have . This shows that we cannot derive a lower bound higher than the one presented in equation (1).
For the upper bound, it is trivial that we cannot have more than groups as the algorithm starts with assigning one group to each agent. Therefore, we have:
(7) 
Consider the case in which is greater than , and all the agents are positioned in a circle equally distanced from each other (see Figure 5). In this case, each agents generates a group with the agent in each right and the agent in its left. This scenario shows a rare example where we actually will end up with groups. This shows that we cannot derive a higher bound higher lower than the one presented in equation (1).
Comments
There are no comments yet.