Cooperative tasks achieved by a group of agents is one of the attracting topic to be studied along the context of autonomous systems. Lee et al. (2018); Hu et al. (2020) Since it is likely for each agent to have individual biases in its actuator or sensor performances, it is an important ability for the autonomy to analyse these inherent biases, and to revise the control plan appropriately to continue the group mission. Such biases dynamically vary during the mission as time degradation, sometimes growing up to a failure of some functionality of an agent. To compose proper updates of the plan, the origin of the bias should be identified.
Suppose the command base controlling a group of agents, and the base has detected the bias in the position of an agent (e.g., no change in -direction observed at all). There are two possibilities for the observed bias, one due to actuator failures (agent not capable to move), or another due to sensor failures (capable to move, but the move not captured). Depending on the hypothesises [the failure occurs in actuators () or sensors ()], the updated plan as calibrated by the failure would be quite different. However, it is generally difficult to identify which factor causes the bias only through the communication between the base and the agent because the agent itself is unreliable. A quick idea for the identification is to make a group actions assisted by other agents, making a collision to the agent with the failure. The collision surely makes the displacement of the agent, which should be detected unless the sensor has failure. By this way, we can identify which hypothesis is correct by planning a group motion. Rather than planned by humans, such group motions are preferable to be designed autonomously by the system, as a form of ’the strategy to acquire environmental information’. Friston (2010)
If we can define a proper value functions (say, ) to measure the ability to capture the distinction between hypothesises, the proper group motion would be generated so that it could maximize . Note, however, that among all possible combinations of group actions, such actions with finite form quite tiny sub-groups; most of the possible actions gives . Namely, the sub-spaces with finite exist in the whole state space as sparse manner (sparse rewards). In such cases, gradient-based optimizations do not work well to propose proper action plans. For such sparse reward optimizations, the reinforcement learning can be used as an effective alternative. The leaning has intensively been investigated in the application to autonomous systems. Huang et al. (2005); Xia and El Kamel (2016); Zhu et al. (2018); Hu et al. (2020)
Reinforcement learning Nachum et al. (2018); Sutton and Barto (2018); Barto (2002) is forming an established field, applied to robotics and system-controls. Peng et al. (2018); Finn and Levine (2017) Methodological improvements have intensively been studied especially by verifications on gaming platforms. Mnih et al. (2015); Silver et al. (2017); Vinyals et al. (2019) The topic dealt in this research is becoming a subfield entitled as the multi-agent reinforcement learning (MARL). Busoniu et al. (2006); Gupta et al. (2017); Straub et al. (2020); Bihl et al. (2020a); Gronauer and Diepold (2021) As specific examples of multi-agent missions, unmanned aerial vehicles (UAV), Bihl et al. (2020a); Straub et al. (2020) sensor resource management (SRM), Malhotra et al. (2017, 1997); Hero and Cochran (2011); Bihl et al. (2020a) etc. are being considered. The problem addressed in this study can also be regarded a form of the problems dealing with non-stationary environments for multi-agent reinforcement learning. Nguyen et al. (2020); Foerster et al. (2017) As time evolves, agents would lose homogeneity by disabilities occurred in some of them, becoming heterogeneous. Coping with such heterogeneity in multiagent reinforcement learning has also been discussed. Busoniu et al. (2006); Calvo and Dusparic (2018); Bihl et al. (2020a); Straub et al. (2020); Gronauer and Diepold (2021) The problem of sparse rewards in reinforcement learning has been recognized and discussed as one of the current challenges in reinforcement learning. Wang and Taylor (2017); Bihl et al. (2020a)
As a prototype of such a problem, we considered a system composed of three agents on moving on a ()-plane administrated by a command base to perform a cooperative task. As the task, each agent is asked to convey an item to a goal post individually. The third agent (#3) is assumed to have the actuator failure being incapable to make move along -direction. By a quick check with tiny displacements given to each agent, the command base can detect the problem existing on #3, but not identify whether it is due to the actuators or the sensors. Consequently, the base sets hypothesises, and , and start planning the best cooperative motions to distinguish the hypothesises by using the reinforcement leaning. We observed that the learning actually concluded the collision to #3 as the best to recognize the failure situation. By the collision, the base could identify that #3 has the problem in actuators, not sensors. The base then starts planing the group motions to achieve the conveying task taking into account the disability of #3. We observed the learning concluding such a cooperation that other agents behave as if they assist to compensate the disability of #3 by pushing it toward the goal.
Let the state space for the agents be . For instance, for three agents () on a -plane with each position, , their state can be specified as , as a point in the six-dimensional space. The state is driven by a command according the operation plan generated in the command base. When is made for a given , the state is updated depending on which hypothesis is taken, each of which restricts by individual constraint:
can then be the measure to evaluate the performance to distinguish the hypothesises. The best operation plan for the distinction should therefore be determined as,
Naive idea to perform the optimization using gradient-based methods is not well woking because of the sparseness as explained in Sec.Introduction; For most of , and then the gradient is zero for most of , being incapable to choose the next update. We therefore take the reinforcement learning for the optimization as alternative.
The reinforcement learning assumes the value function which measures the gain by taking the operation for a state . The leaning optimizes such a decision that maximizes not the temporal but the long-standing benefit,
, which is the estimate of accumulative gain for future. The benefit is evaluated by the self-consistent manner (Bellman equation) as,Sutton and Barto (2018)
where the second term sums over all possible states() and actions() subsequent to the present choice . The function is composed as a linear combination over , representing how the contributions get reduced over the time toward future.
is regarded as the table (-table) with respect to and as rows and columns. Starting from initial values on the table as random numbers, the self-consistent iterations update the table as follows: For the random initial values, the temporary decision for initial is made formally by
By this , a ’point’ at on the -table is updated from the previous random value as
where referred from the second term is still filled by the random number. The operation then promotes the state as . For the updated , the similar procedures,
are repeated. As such, the
-table is updated as patchwork manner by the sensible values replacing the initial random numbers. Assisted by the neural-network interpolation, the values for the whole range of the table are filled and then converged by the self-consistent iteration to get final-table. In the implementation, a user specifies the form of , and , providing to the package. In this study, we used OpenAI Gym Brockman et al. (2016) package.
Denoting the converged table as , we can generate the series of operations to update the state as,
The work flow to achieve the mission for the agents as described in the last paragraph in Sec. Introduction is as follows:
] To examine if the errors are found in any agents, the base issues the commands to move each agent by tiny displacements (and consequently #3 is found to have the error).
] Corresponding to each possible hypothesis ( and ), the virtual spaces are prepared as putting each constraint.
] The reinforcement learning () is performed at the command base using the virtual space, getting ’the operation plan ’ designed to distinguish the hypothesises.
] The plan is performed by the agents. The command base compares the observed trajectory with those obtained in the virtual spaces at the step . By the comparison, the hypothesis giving the closer trajectory to the observed one is identified as what happens ( in this case).
] By taking the virtual space as the identified hypothesis, another learning is performed to get the optimal plan for the original mission (conveying items to goal posts in this case).
] Agents are operated according to the plan .
All learnings and operations (as simulation, not by real machines) are performed on a Linux server. The learning part is the most time-consuming, taking around half a day using a single processor without any parallelization in this study. For the learning, we used OpenAI Gym, Brockman et al. (2016) PPO2 (proximal policy optimization, version2) algorithm Schulman et al. (2015)
on the LSTM (long-short time memory) network structure. We didn’t make any specific tuning for hyperparameters as default setting, though it has been pointed out that hyperparameter optimization (HPO) can significantly improve reinforcement learning performance.Henderson et al. (2018); Straub et al. (2020); Bihl et al. (2020b); Snoek et al. (2012); Domhan et al. (2015); Bihl et al. (2020a); Young et al. (2020)
The learning at the step  is performed using two virtual spaces, , corresponding to hypothesises,
Each can take such possibilities under each constraint of the hypothesis (e.g., cannot be updated due to the actuator error). For an operation , the state on each virtual space is updated as,
Taking the value function,
the two-fold -table is updated self-consistently as
Denoting the converged table as , the sequence of operations is obtained in the way given as Eq.(6) to get
The operation sequence generates the two-fold sequence of the (virtual) state evolutions as
as shown in Fig. 1(a).
At the step , the agents are operated according to the plan, Eq.(11), to update (real) states as
At the step , the learning is performed for another value function which earns higher score when agents get closer to the goal post (scaling as inverse of the distance) as well as the big bonus when they arrive the goal. The operation sequence is then obtained as
by which the states of the agents are updated as
as shown in Fig. 1(b).
Fig. 1(a) shows two-fold trajectories, Eq. (12), corresponding to the hypothesises, and . While for the agent #1, the branching occurs for #2 during the operations. The branching surely earns the score by the value function, in Eq. (LABEL:evaluationAlpha), implying that the learning has properly been performed to get capable to capture the difference between and . The red dotted circle shown in (a) is actually the collision between #2 and #3, inducing the difference between and (note that the trajectories just show the center positions of agents while each agent has the finite radius as its size, so the trajectories themselves do not intersect even when the collision occurs). We note that the strategy by the collision is never given in the rule-base manner. The agents autonomously take this strategy deducing from the reinforcement learning.
Three square symbols (closed) located as the edges of a triangle in Fig. 1 are the goal posts for the conveying mission. Fig. 1(b) shows the real trajectories for the mission, where the initial locations of the agents are the final locations of the panel (a). From the initial locations, the agents #1 and #2 immediately arrived at the goals and departed from the goals after the completion of the mission. It is observed for the #1 that it collides with the #3 and keeps on pushing it so that #3 can get closer to the goal. Though the behavior is just the consequence to earn more from the value function, , it is interesting that the behavior seems as if the agent #1 wants to assist the disabled #3 cooperatively. Identified the constraint for the agents by the learning , the consequent learning is confirmed to generate the optimal operation plans so that the team can earn the best benefit by their cooperative behavior as if the autonomous decision by the team has been made.
Agents performing a group mission generally have the possibility to suffer from errors during a mission. For the cause of the observed errors, multiple hypotheses are possible. Some of cooperative behaviors such as the collision between agents would be capable to identify the cause of the error. We considered autonomous planning of such group behaviors by using machine learning technique. Diffrent hypothesises of the cause lead to different states to be expected as updated from the same initial state by the same operation. The larger the difference gets, it is better operation plan as that being capable to distinguish the hypotheses for the cause. Namely, the magnitude of the difference can be the value function to optimize the desired operation plan. Gradient-based optimizations do not work well because a tiny fraction among the vast possible operations (e.g., collisions) can capture the difference, leading to the sparse distribution of the finite value for the value function. As we developed, reinforcement learning is the appropriate choice to be applied for such problems. The optimal plan concluded by the reinforcement learning actually became an operation that caused the agents to collide with each other. Getting identified the cause by this plan, the revised mission plan to incorporate the failure was developed by another learning, in which other agents without failure seems to help the disabled agent.
The computations in this work have been performed using the facilities of Research Center for Advanced Computing Infrastructure at JAIST. R.M. is grateful for financial supports from MEXT-KAKENHI (19H04692 and 16KK0097), from the Air Force Office of Scientific Research (AFOSR-AOARD/FA2386-17-1-4049;FA2386-19-1-4015).
- Lee et al. (2018) H. Lee, H. Kim, and H. J. Kim, IEEE Transactions on Automation Science and Engineering 15, 189 (2018).
- Hu et al. (2020) J. Hu, H. Niu, J. Carrasco, B. Lennox, and F. Arvin, IEEE Transactions on Vehicular Technology 69, 14413 (2020).
- Friston (2010) K. Friston, Nature Reviews Neuroscience 11, 127 (2010).
- Huang et al. (2005) B.-Q. Huang, G.-Y. Cao, and M. Guo, in 2005 International Conference on Machine Learning and Cybernetics, Vol. 1 (2005) pp. 85–89.
- Xia and El Kamel (2016) C. Xia and A. El Kamel, Robotics and Autonomous Systems 84, 1 (2016).
- Zhu et al. (2018) D. Zhu, T. Li, D. Ho, C. Wang, and M. Q.-H. Meng, in 2018 IEEE International Conference on Robotics and Automation (ICRA) (2018) pp. 7548–7555.
- Nachum et al. (2018) O. Nachum, S. Gu, H. Lee, and S. Levine, in Proceedings of the 32nd International Conference on Neural Information Processing Systems (2018) pp. 3307–3317.
- Sutton and Barto (2018) R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, 2nd ed. (The MIT Press, 2018).
- Barto (2002) A. G. Barto, in The Handbook of Brain Theory and Neural Networks, Second Edition, edited by M. A. Arbib (The MIT Press, Cambridge, MA, 2002) pp. 963–972.
- Peng et al. (2018) X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel, in 2018 IEEE International Conference on Robotics and Automation (ICRA) (2018) pp. 3803–3810.
- Finn and Levine (2017) C. Finn and S. Levine, in 2017 IEEE International Conference on Robotics and Automation (ICRA) (2017) pp. 2786–2793.
- Mnih et al. (2015) V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, Nature 518, 529 (2015).
- Silver et al. (2017) D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis, Nature 550, 354 (2017).
- Vinyals et al. (2019) O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev, J. Oh, D. Horgan, M. Kroiss, I. Danihelka, A. Huang, L. Sifre, T. Cai, J. P. Agapiou, M. Jaderberg, A. S. Vezhnevets, R. Leblond, T. Pohlen, V. Dalibard, D. Budden, Y. Sulsky, J. Molloy, T. L. Paine, C. Gulcehre, Z. Wang, T. Pfaff, Y. Wu, R. Ring, D. Yogatama, D. Wünsch, K. McKinney, O. Smith, T. Schaul, T. Lillicrap, K. Kavukcuoglu, D. Hassabis, C. Apps, and D. Silver, Nature 575, 350 (2019).
- Busoniu et al. (2006) L. Busoniu, R. Babuska, and B. De Schutter, in 2006 9th International Conference on Control, Automation, Robotics and Vision (2006) pp. 1–6.
- Gupta et al. (2017) J. K. Gupta, M. Egorov, and M. Kochenderfer, in Autonomous Agents and Multiagent Systems, edited by G. Sukthankar and J. A. Rodriguez-Aguilar (Springer International Publishing, Cham, 2017) pp. 66–83.
- Straub et al. (2020) K. M. Straub, B. Bontempo, F. Jones, A. M. Jones, P. Farr, and T. Bihl, ensor Resource Management using Multi-Agent Reinforcement Learning with Hyperparameter Optimization, Tech. Rep. (2020) white paper.
- Bihl et al. (2020a) T. Bihl, P. Farr, K. M. Straub, B. Bontempo, and F. Jones, in Hawaii International Conference on Systems Sciences 2022 (2020) submitted.
- Gronauer and Diepold (2021) S. Gronauer and K. Diepold, Artificial Intelligence Review (2021), 10.1007/s10462-021-09996-w.
- Malhotra et al. (2017) R. P. Malhotra, M. J. Pribilski, P. A. Toole, and C. Agate, in Micro- and Nanotechnology Sensors, Systems, and Applications IX, Vol. 10194, edited by T. George, A. K. Dutta, and M. S. Islam, International Society for Optics and Photonics (SPIE, 2017) pp. 403 – 414.
- Malhotra et al. (1997) R. Malhotra, E. Blasch, and J. Johnson, in Proceedings of the IEEE 1997 National Aerospace and Electronics Conference. NAECON 1997, Vol. 2 (1997) pp. 769–776 vol.2.
- Hero and Cochran (2011) A. O. Hero and D. Cochran, IEEE Sensors Journal 11, 3064 (2011).
- Nguyen et al. (2020) T. T. Nguyen, N. D. Nguyen, and S. Nahavandi, IEEE Transactions on Cybernetics 50, 3826 (2020).
- Foerster et al. (2017) J. Foerster, N. Nardelli, G. Farquhar, T. Afouras, P. H. S. Torr, P. Kohli, and S. Whiteson, in Proceedings of the 34th International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 70, edited by D. Precup and Y. W. Teh (PMLR, 2017) pp. 1146–1155.
- Calvo and Dusparic (2018) J. Calvo and I. Dusparic, in Proc. 26th Irish Conf. Artif. Intell. Cogn. Sci. (2018) pp. 1–12.
- Wang and Taylor (2017) Z. Wang and M. E. Taylor, in Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17 (2017) pp. 3027–3033.
- Brockman et al. (2016) G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “Openai gym,” (2016), arXiv:1606.01540 .
- Schulman et al. (2015) J. Schulman, S. Levine, P. Mortiz, M. Jordan, and P. Abbeel, in Proceedings of the 32nd International Conference on Machine Learning - Volume 37 (2015) pp. 1889–1897.
- Henderson et al. (2018) P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger, in AAAI (2018).
- Bihl et al. (2020b) T. J. Bihl, J. Schoenbeck, D. Steeneck, and J. Jordan, in 53rd Hawaii International Conference on System Sciences, HICSS 2020, Maui, Hawaii, USA, January 7-10, 2020 (ScholarSpace, 2020) pp. 1–10.
- Snoek et al. (2012) J. Snoek, H. Larochelle, and R. P. Adams, in Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 2 (2012) pp. 2951–2959.
- Domhan et al. (2015) T. Domhan, J. T. Springenberg, and F. Hutter, in Proceedings of the 24th International Conference on Artificial Intelligence (2015) pp. 3460–3468.
- Young et al. (2020) M. T. Young, J. D. Hinkle, R. Kannan, and A. Ramanathan, Journal of Parallel and Distributed Computing 139, 43 (2020).