Reinforcement learning autonomously identifying the source of errors for agents in a group mission

07/20/2021 ∙ by Keishu Utimula, et al. ∙ Apple, Inc. 0

When agents are swarmed to carry out a mission, there is often a sudden failure of some of the agents observed from the command base. It is generally difficult to distinguish whether the failure is caused by actuators (hypothesis, h_a) or sensors (hypothesis, h_s) solely by the communication between the command base and the concerning agent. By making a collision to the agent by another, we would be able to distinguish which hypothesis is likely: For h_a, we expect to detect corresponding displacements while for h_a we do not. Such swarm strategies to grasp the situation are preferably to be generated autonomously by artificial intelligence (AI). Preferable actions (e.g., the collision) for the distinction would be those maximizing the difference between the expected behaviors for each hypothesis, as a value function. Such actions exist, however, only very sparsely in the whole possibilities, for which the conventional search based on gradient methods does not make sense. Instead, we have successfully applied the reinforcement learning technique, achieving the maximization of such a sparse value function. The machine learning actually concluded autonomously the colliding action to distinguish the hypothesises. Getting recognized an agent with actuator error by the action, the agents behave as if other ones want to assist the malfunctioning one to achieve a given mission.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Cooperative tasks achieved by a group of agents is one of the attracting topic to be studied along the context of autonomous systems. Lee et al. (2018); Hu et al. (2020) Since it is likely for each agent to have individual biases in its actuator or sensor performances, it is an important ability for the autonomy to analyse these inherent biases, and to revise the control plan appropriately to continue the group mission. Such biases dynamically vary during the mission as time degradation, sometimes growing up to a failure of some functionality of an agent. To compose proper updates of the plan, the origin of the bias should be identified.

Suppose the command base controlling a group of agents, and the base has detected the bias in the position of an agent (e.g., no change in -direction observed at all). There are two possibilities for the observed bias, one due to actuator failures (agent not capable to move), or another due to sensor failures (capable to move, but the move not captured). Depending on the hypothesises [the failure occurs in actuators () or sensors ()], the updated plan as calibrated by the failure would be quite different. However, it is generally difficult to identify which factor causes the bias only through the communication between the base and the agent because the agent itself is unreliable. A quick idea for the identification is to make a group actions assisted by other agents, making a collision to the agent with the failure. The collision surely makes the displacement of the agent, which should be detected unless the sensor has failure. By this way, we can identify which hypothesis is correct by planning a group motion. Rather than planned by humans, such group motions are preferable to be designed autonomously by the system, as a form of ’the strategy to acquire environmental information’. Friston (2010)

If we can define a proper value functions (say, ) to measure the ability to capture the distinction between hypothesises, the proper group motion would be generated so that it could maximize . Note, however, that among all possible combinations of group actions, such actions with finite form quite tiny sub-groups; most of the possible actions gives . Namely, the sub-spaces with finite exist in the whole state space as sparse manner (sparse rewards). In such cases, gradient-based optimizations do not work well to propose proper action plans. For such sparse reward optimizations, the reinforcement learning can be used as an effective alternative. The leaning has intensively been investigated in the application to autonomous systems.  Huang et al. (2005); Xia and El Kamel (2016); Zhu et al. (2018); Hu et al. (2020)

Reinforcement learning  Nachum et al. (2018); Sutton and Barto (2018); Barto (2002) is forming an established field, applied to robotics and system-controls.  Peng et al. (2018); Finn and Levine (2017) Methodological improvements have intensively been studied especially by verifications on gaming platforms.  Mnih et al. (2015); Silver et al. (2017); Vinyals et al. (2019) The topic dealt in this research is becoming a subfield entitled as the multi-agent reinforcement learning (MARL).  Busoniu et al. (2006); Gupta et al. (2017); Straub et al. (2020); Bihl et al. (2020a); Gronauer and Diepold (2021) As specific examples of multi-agent missions, unmanned aerial vehicles (UAV),  Bihl et al. (2020a); Straub et al. (2020) sensor resource management (SRM),  Malhotra et al. (2017, 1997); Hero and Cochran (2011); Bihl et al. (2020a) etc. are being considered. The problem addressed in this study can also be regarded a form of the problems dealing with non-stationary environments for multi-agent reinforcement learning.  Nguyen et al. (2020); Foerster et al. (2017) As time evolves, agents would lose homogeneity by disabilities occurred in some of them, becoming heterogeneous. Coping with such heterogeneity in multiagent reinforcement learning has also been discussed.  Busoniu et al. (2006); Calvo and Dusparic (2018); Bihl et al. (2020a); Straub et al. (2020); Gronauer and Diepold (2021) The problem of sparse rewards in reinforcement learning has been recognized and discussed as one of the current challenges in reinforcement learning.  Wang and Taylor (2017); Bihl et al. (2020a)

As a prototype of such a problem, we considered a system composed of three agents on moving on a ()-plane administrated by a command base to perform a cooperative task. As the task, each agent is asked to convey an item to a goal post individually. The third agent (#3) is assumed to have the actuator failure being incapable to make move along -direction. By a quick check with tiny displacements given to each agent, the command base can detect the problem existing on #3, but not identify whether it is due to the actuators or the sensors. Consequently, the base sets hypothesises, and , and start planning the best cooperative motions to distinguish the hypothesises by using the reinforcement leaning. We observed that the learning actually concluded the collision to #3 as the best to recognize the failure situation. By the collision, the base could identify that #3 has the problem in actuators, not sensors. The base then starts planing the group motions to achieve the conveying task taking into account the disability of #3. We observed the learning concluding such a cooperation that other agents behave as if they assist to compensate the disability of #3 by pushing it toward the goal.

Ii Notations

Let the state space for the agents be . For instance, for three agents () on a -plane with each position, , their state can be specified as , as a point in the six-dimensional space. The state is driven by a command according the operation plan generated in the command base. When is made for a given , the state is updated depending on which hypothesis is taken, each of which restricts by individual constraint:


The difference,


can then be the measure to evaluate the performance to distinguish the hypothesises. The best operation plan for the distinction should therefore be determined as,


Naive idea to perform the optimization using gradient-based methods is not well woking because of the sparseness as explained in Sec.Introduction; For most of , and then the gradient is zero for most of , being incapable to choose the next update. We therefore take the reinforcement learning for the optimization as alternative.

The reinforcement learning assumes the value function which measures the gain by taking the operation for a state . The leaning optimizes such a decision that maximizes not the temporal but the long-standing benefit,

, which is the estimate of accumulative gain for future. The benefit is evaluated by the self-consistent manner (Bellman equation) as, 

Sutton and Barto (2018)


where the second term sums over all possible states() and actions() subsequent to the present choice . The function is composed as a linear combination over , representing how the contributions get reduced over the time toward future.

is regarded as the table (-table) with respect to and as rows and columns. Starting from initial values on the table as random numbers, the self-consistent iterations update the table as follows: For the random initial values, the temporary decision for initial is made formally by

By this , a ’point’ at on the -table is updated from the previous random value as


where referred from the second term is still filled by the random number. The operation then promotes the state as . For the updated , the similar procedures,

are repeated. As such, the

-table is updated as patchwork manner by the sensible values replacing the initial random numbers. Assisted by the neural-network interpolation, the values for the whole range of the table are filled and then converged by the self-consistent iteration to get final

-table. In the implementation, a user specifies the form of , and , providing to the package. In this study, we used OpenAI Gym Brockman et al. (2016) package.

Denoting the converged table as , we can generate the series of operations to update the state as,


Iii Experiments

The work flow to achieve the mission for the agents as described in the last paragraph in Sec. Introduction is as follows:

  • ] To examine if the errors are found in any agents, the base issues the commands to move each agent by tiny displacements (and consequently #3 is found to have the error).

  • ] Corresponding to each possible hypothesis ( and ), the virtual spaces are prepared as putting each constraint.

  • ]  The reinforcement learning () is performed at the command base using the virtual space, getting ’the operation plan ’ designed to distinguish the hypothesises.

  • ]  The plan is performed by the agents. The command base compares the observed trajectory with those obtained in the virtual spaces at the step [1]. By the comparison, the hypothesis giving the closer trajectory to the observed one is identified as what happens ( in this case).

  • ]  By taking the virtual space as the identified hypothesis, another learning is performed to get the optimal plan for the original mission (conveying items to goal posts in this case).

  • ]  Agents are operated according to the plan .

All learnings and operations (as simulation, not by real machines) are performed on a Linux server. The learning part is the most time-consuming, taking around half a day using a single processor without any parallelization in this study. For the learning, we used OpenAI Gym, Brockman et al. (2016) PPO2 (proximal policy optimization, version2) algorithm Schulman et al. (2015)

on the LSTM (long-short time memory) network structure. We didn’t make any specific tuning for hyperparameters as default setting, though it has been pointed out that hyperparameter optimization (HPO) can significantly improve reinforcement learning performance.  

Henderson et al. (2018); Straub et al. (2020); Bihl et al. (2020b); Snoek et al. (2012); Domhan et al. (2015); Bihl et al. (2020a); Young et al. (2020)

The learning at the step [1] is performed using two virtual spaces, , corresponding to hypothesises,


Each can take such possibilities under each constraint of the hypothesis (e.g., cannot be updated due to the actuator error). For an operation , the state on each virtual space is updated as,


Taking the value function,

the two-fold -table is updated self-consistently as


Denoting the converged table as , the sequence of operations is obtained in the way given as Eq.(6) to get


The operation sequence generates the two-fold sequence of the (virtual) state evolutions as


as shown in Fig. 1(a).

At the step [2], the agents are operated according to the plan, Eq.(11), to update (real) states as


to be observed by the command base. The base compares Eq. (13) and Eq.(12) to identify which or is actually happens ( in this case).

At the step [3], the learning is performed for another value function which earns higher score when agents get closer to the goal post (scaling as inverse of the distance) as well as the big bonus when they arrive the goal. The operation sequence is then obtained as


by which the states of the agents are updated as


as shown in Fig. 1(b).

Figure 1: [realTraj] Trajectories of the agents driven by each operation plan generated by the reinforcement learning, first [panel (a)] and [panel (b)], consequently. The trajectories in (a) are the virtual states, (two-fold), branching typically for the agent #2 corresponding to the different hypothesis. Those given in (b) is the real trajectories as given in Eq. (15). The labels, (1) to (3), indicate each agent, which moves along the direction shown as red arrows. Dotted circles indicate the collision between agents.

Iv Discussions

Fig. 1(a) shows two-fold trajectories, Eq. (12), corresponding to the hypothesises, and . While for the agent #1, the branching occurs for #2 during the operations. The branching surely earns the score by the value function, in Eq. (LABEL:evaluationAlpha), implying that the learning has properly been performed to get capable to capture the difference between and . The red dotted circle shown in (a) is actually the collision between #2 and #3, inducing the difference between and (note that the trajectories just show the center positions of agents while each agent has the finite radius as its size, so the trajectories themselves do not intersect even when the collision occurs). We note that the strategy by the collision is never given in the rule-base manner. The agents autonomously take this strategy deducing from the reinforcement learning.

Three square symbols (closed) located as the edges of a triangle in Fig. 1 are the goal posts for the conveying mission. Fig. 1(b) shows the real trajectories for the mission, where the initial locations of the agents are the final locations of the panel (a). From the initial locations, the agents #1 and #2 immediately arrived at the goals and departed from the goals after the completion of the mission. It is observed for the #1 that it collides with the #3 and keeps on pushing it so that #3 can get closer to the goal. Though the behavior is just the consequence to earn more from the value function, , it is interesting that the behavior seems as if the agent #1 wants to assist the disabled #3 cooperatively. Identified the constraint for the agents by the learning , the consequent learning is confirmed to generate the optimal operation plans so that the team can earn the best benefit by their cooperative behavior as if the autonomous decision by the team has been made.

V Conclusion

Agents performing a group mission generally have the possibility to suffer from errors during a mission. For the cause of the observed errors, multiple hypotheses are possible. Some of cooperative behaviors such as the collision between agents would be capable to identify the cause of the error. We considered autonomous planning of such group behaviors by using machine learning technique. Diffrent hypothesises of the cause lead to different states to be expected as updated from the same initial state by the same operation. The larger the difference gets, it is better operation plan as that being capable to distinguish the hypotheses for the cause. Namely, the magnitude of the difference can be the value function to optimize the desired operation plan. Gradient-based optimizations do not work well because a tiny fraction among the vast possible operations (e.g., collisions) can capture the difference, leading to the sparse distribution of the finite value for the value function. As we developed, reinforcement learning is the appropriate choice to be applied for such problems. The optimal plan concluded by the reinforcement learning actually became an operation that caused the agents to collide with each other. Getting identified the cause by this plan, the revised mission plan to incorporate the failure was developed by another learning, in which other agents without failure seems to help the disabled agent.

Vi Acknowledgments

The computations in this work have been performed using the facilities of Research Center for Advanced Computing Infrastructure at JAIST. R.M. is grateful for financial supports from MEXT-KAKENHI (19H04692 and 16KK0097), from the Air Force Office of Scientific Research (AFOSR-AOARD/FA2386-17-1-4049;FA2386-19-1-4015).