AI algorithms are ever increasingly impacting aspects of our lives. The use of AI to assist in or execute decisions spans many topics and disciplines: algorithmic trading, police dispatching, ride-sharing services, online dating, etc. . As the ubiquity of AI systems increases, so too are we seeing advancements in their complexity and capability. With humans encountering AI systems at increasing frequencies, there is extensive study and need for AI algorithms which can take the user in mind in order to best perform . Therefore, research in Human-Centric AI (HCAI) is proving useful and essential for effective systems endowed with comprehension or awareness of the human. HCAI and related topics demonstrate methods for training and utilizing AI systems with the human in mind, and methods for improving the collective performance of the human-AI hybrid system.
Despite the importance or desire to augment human reasoning and skills with AI-based methods, there can be pitfalls. In fact, the consequences for mistakes by an AI can be quite severe. For instance, chat bots can start to behave unpleasantly or offensively [11, 19], self-driving cars can have fatal crashes [4, 5], or algorithms can be fooled into making erroneous decisions [1, 9]
. Therefore, we can see AI systems are not infallible and consequently still require awareness of these shortcomings. Consequently, with HCAI focusing on different mechanisms supporting the combination of human and artificial intelligence, or the augmentation of human reasoning, it is important to consider potential errors, of either AI, or human, or both. To that end, we investigate detecting sub-optimal behavior based on observations in order to delegate control to the party - either human or AI - which, at any given point in time, is predicted to perform best.
In this paper, we consider the scenario of humans and AI systems operating as a team to accomplish a task. For our scenario, the composition of the team will be such that there is a potentially heterogeneous mixture operating under the supervision of a managing agent tasked with delegation. The manager learns, via observation of behavior causing environment changes, which actor would be ideal to perform the next action in pursuit of the team’s goal. We utilize a combination of Q-Learning Reinforcement Learning (RL) agents and Instance-based Learning (IBL) agents using cognitively inspired mechanisms. The use of IBL for agents is meant to allow for a human-like process in both representation and understanding of behavior. We augment the team behavior with injected errors to simulate behavior policies which make mistakes. The method and results demonstrated in this paper are intended to illustrate a cognitively-inspired method of understanding and representing behavior to optimize team dynamics. We show that the performance of the error-induced agents is improved by up to when combined as a team under the managing agent. Moreover, a manager agent trained by observing the behavior of human and AI agents outperforms a manager agent choosing randomly by up to for key performance indices. The results demonstrate how our method enables strong team performance under a manager which can use cognitively-inspired mechanisms to extract a desirable pattern of team behavior.
Ii-a Related Work
HCAI systems are gaining momentum in the recent literature. They are proposed in several areas, and for different applications. Hereafter, we provide a summary of some significant efforts.
Ii-A1 Estimating Mental States and Predicting Behavior
It is important to define systems which can estimate the goals, beliefs, and likely future behavior of others. This predictive power allows agents to act based on an estimated understanding of others, which lets agents utilize their estimates to improve the collective performance of a team[7, 16]. This understanding can be accomplished via Theory of Mind (ToM), which enables estimation of the mental state of another . Accounting for these mental state estimates, decisions can be made. Similarly, behavior can be generated with an integrated behavior prediction model. In this case, the behavior of others can be modeled based on estimates of past behavior or other assumptions. For instance, cognitive models can be used to learn a model representing the observed decision-making process [12, 13].
Ii-A2 AI-assisted Behavior and Decisions
Commonly, humans and AI interact in the case of AI-assisted decisions and related scenarios. In the case of more sophisticated simultaneous control such as in Overcooked [2, 18], agents in the environment need to understand and/or predict human behavior to optimize their performance. In another context, AI can operate as a backup or alternate in execution of a task. As a result, control is delegated between the two based on thresholds or the human’s choice. This allows the human to offload control to the AI system when they deem it necessary or safe to do so .
The approach presented in this paper falls in this class. With respect to existing literature, we combine the concept of behavior comprehension and AI-assisted decisions to provide improved and learned control delegation based on observed participant performance.
Ii-B Reinforcement Learning: Q-Learning and IBL
RL is a method by which situations or observations are mapped to actions so as to maximize a reward signal . The maximization is performed by an agent through exploration of an environment and the available actions. The states and actions determine the feedback (reward) an agent observes, which motivates finding best actions. At its base, RL is comprised of several components: agent, environment, policy, reward signal, value function, and an optional model of the environment . These elements are utilized in the learning methods to build a representation of optimal behavior.
In Q-Learning, agents use observations of states , actions , and rewards to generate an estimate of action values in states and learn a behavior policy. The value estimates are stored and modified using an update such as:
where represents the new state resulting from in . Additionally, is the learning rate used to discount new observations and is the discount parameter for future state values. The agent selects an action which maximizes . The typical downside of Q-Learning is that speed of convergence may be low, due to the need of exploring extensively the space.
Ii-B2 Instance-Based Learning
In IBL, agents calculate an estimated utility given the current state based on the blending mechanism of ACT-R. This serves to support a cognitively-inspired approach to mimic the human’s learning, behavior, and decision-making. In our case, we utilize the reduced version of the IBL equations seen in [12, 13]. Agents observe and store instances in memory with time marker to represent past experiences. These instances represent the state (or context) , action taken , and outcome defining the reward or other feedback provided based on .
To support a model of behavior, the agent uses the past instances matching the current state/context to measure an activation level
for each action/reward pair corresponding to the given state. The strength of the activation is based on the observation times, which signify the strength of each memory corresponding to the matching instances. The activation strength of a memory depends both on the recency and the frequency of the memories (i.e. reduced strength as more time elapses). The strength of activation values define a retrieval probabilityfor each instance using a Boltzmann Distribution with temperature parameter . The strength of the retrieval probability denotes how strong a memory is and subsequently how much it should impact the measure of utility, which allows the agent to calculate an estimate of value actions
where is the outcome of instance (i.e. a reward/value). Given estimated action values, the agent selects the most suitable action . In our case, we use an additional efficiency scheme regarding the storage of instances. Following , we approximate to reduce storage complexity while maintaining accuracy.
Iii Human-Centric AI Approach to Decision-Making
As anticipated in Section I, we assume the existence of a pool of artificial and human agents, and a manager that decides per state which agent in the pool should make decision at any given point to achieve the team’s goal. In the following, we will refer to agents in the pool as navigating agents. The pool of navigating agents will be trained first to ensure they have valid policies prior to manager training and testing. We use the Q-Learning algorithm for navigating agents representing the artificial systems. These agents observe the state and then receive feedback as a reward based on the state-action pair . For the representation of human-like behavior as well as the manager, we define IBL-based agents. We use IBL for the manager as this agent needs to interpret as close as possible the human behavior (for which IBL is better suited than Q-learning). In the case of the manager, observations represent states and agents selected rather than the movement actions performed by the navigating agents, so the manager will not observe the actions of the agent they choose. This ensures the manager is only able to make its decision based on team performance, not the individual actions of navigating agents.
Iii-a Q-Learning Navigating Agents
For Q-Learning agents we exploit directly the approach of Equation 1. Therefore, agents make observations while taking actions, and the policy is built through the update of the value function. The observations are carried out for a certain number of moves () and for a specified number of games (). In the case of navigating agents (including Q-learning and IBL agents), the action space is defined by the possible actions allowed by the specific game that agents play (here “game" denotes generally a given task agents have to accomplishing by performing subsequent decisions - “moves").
Iii-B IBL Navigating Agents
Similar to Q-Learning, an IBL agent observes states, actions, and rewards. In this case, as described above, these are stored as instances with time marker for the IBL model. Time markers for new observations of an instance are stored in addition to the first marker , which denotes the first time an instance tuple was observed. Based on [12, 13], the IBL agents do not observe an immediate feedback signal; instead, all instances in a trajectory observe the same reward based on the game outcome. Consequently, the instances observed from a particular game utilize a single reward based on the final outcome (the specific form of the reward function is problem-dependent). Further, an IBL agent performs moves according to the policy available at the current time. It saves all observed instances with their corresponding time until the conclusion of the game. Hence, action values are not updated until the end of a game.
Iii-C Manager Agent
The manager selects which agent will choose the action in the current state. Similar to the IBL navigating agent, the manager learns by observing instances and time . In this case, the action space is , where refers to the number of navigating agents. As with the IBL navigating agents, the reward observed in an instance is problem-dependent, but in general it must follow that the observed value of all actions in a trajectory is the game result from the entire trajectory. Overall, the algorithm for training the manager agent follows the one used to train the IBL navigating agents. The primary difference is to select which agent should decide the action, then let the identified agent select the action.
Iv Experimental settings
In this section we describe a concrete experimental environment where we have tested the general framework presented in Section III. As a specific case, we have considered Gridworld, a game where a player needs to navigate in a grid, from a starting cell to a goal cell. Gridworld environments serve as a simple and powerful tool for testing and analyzing policy learning methods. The following sections outline the training and testing environments utilized as well as the agent configurations.
Iv-a Gridworld Environments
In our experiments, we generated Gridworld environments for the agents to navigate, parameterized by the following: grid dimensions, relative position of start and goal states, ratio of open cells to walls, number of error states of each type (if any). This generates Gridworld environments with the following characteristics. First, in addition to and , we include error states (where indicates an error state for agent ). Additionally, we include joint error states in which multiple agents could make a mistake (e.g. ). In this case, all agents indicated for that state (i.e. agents and ) may choose an error action. In error states, agents may perform an error by ignoring their policies and instead selecting an action which takes them off an optimal path to the goal. When an agent is selected in their error state, they follow their policy or select an error based on their probability of error . For each agent, determines how likely they are to make an error, which provides stochasticity. In an error state assigned to another agent, the state is treated as a normal empty grid cell (i.e in ). Finally, we utilize a single start state and corresponding goal state. Further, some states may be unreachable, but there must be a path from to utilizing actions defined in , see Figure 1.
In our experiments, the goal is for the navigating agents is to find as a team. The manager is required to select the agent at each step that will choose the action for the current state. Thus, a state denotes the team is in a certain cell, and the manager action is which navigating agent should decide the next move. As such, the manager is attempting to navigate the team through the Gridworld optimally via the selection of appropriate acting agents. For feedback, the IBL agents (manager and navigating) are only provided with a simple reward based on the outcome of the game:
where is the number of steps needed to reach the goal state. This reward motivates the agents to make good selections and minimizes the amount of direct feedback. Further, this ensures the manager is only able to make assessments based on the navigating agents selected. The goal being to force the manager to rely on outcomes of the entire trajectory rather than receiving an immediate reward for a specific action. A key factor of the manager’s success then comes down to its ability to determine the sequence of agents resulting in the best trajectory. Consequently, the manager learns to delegate control effectively to account for error-prone agents or impactful error states as choosing error-prone agents in their error states would be suboptimal.
To train the navigating agents, we utilize the Gridworld action space to navigate the grid cells and transition between states. For Q-Learning rewards, we use the following:
The use of the penalty serves to enforce a priority for shorter paths, and the penalty promotes wall avoidance.
An additional aspect of our experiments is the mixture of agents utilized. We allow for both homogeneous and heterogeneous mixtures of policy types for the navigating agents (Q-Learning or IBL). Further, the error probabilities for error states are defined per agent. This allows us to test the manager with differing team compositions.
Two additional scenarios were utilized to test the efficacy of our method. First, navigating agents tested operating in the environment without teammates or a manager agent. This demonstrates the effect of errors on solo agent performance. Second, we replace the IBL manager with a random manager. In this case, the policy of the manager selects navigating agents uniformly randomly in all states. This demonstrates the difference in impact between randomly generated teaming and a learned model of teaming.
We have replicated simulations with
grids at increasing levels of complexity for each grid (approximately 13 levels per grid), and in the following we show average results plotted with error regions signifying the variance of the results.
Our scenario was tested in the Gridworld setting with varying levels of complexity. We created error-free Gridworld environments. These base Gridworlds were used to train the navigation agents. With trained navigating agents, we then trained and measured the performance of the managing agents in the same Gridworld environments, but with increasing numbers of error states. In each case, error states were created incrementally by randomly placing error states in currently open cells. The results were then averaged across the different grids at each error state frequency level.
Our Gridworlds were created with rows and columns, excluding the boundary walls. The starting state and goal state were placed in the same cells for each grid, but the walls were randomly placed. This ensured consistent positioning of the start and goal while requiring varying paths for successful navigation. To vary the difficulty, error states are introduced into the grid and placed randomly. For our experiments, error state frequency was equal for all states, including joint error states Figure 1. At the end of each training and testing phase, an additional error state of each type was added at random to an open cell. The ratio of error states to open cells increased from to , which ensured the existence of error states while also allowing for the potential for paths with some error-free cells.
Following the IBL literature and inspired by [12, 13], the IBL-based agents used the following parameters: For the Q-Learning agents, we utilized: . The training of the navigating agents was performed on each Gridworld for episodes at a maximum game length . Manager agents utilized the same parameters and as the IBL-based navigating agents, but managers were trained for games.
V-C Team Configuration and Results
For our tests, we utilized the previously defined Gridworld configuration with a wall ratio of (similar results were seen for , , and , but omitted for space). To track performance, we measured mean game length determined by the length of path used to reach the goal state as well as frequency of agent selection by the manager agents (random vs. IBL). These measurements demonstrated the effect of grid complexity on team performance. The main factor in grid complexity in this case is the frequency of error states, which generates error state counts increasing from to for each type being incrementally placed in the grid. Further, the measurements also denote the demonstrated learned preferences regarding agent selection of the manager.
In the first set of results, we tested a team of two navigating agents (one trained with Q-learning and the second with IBL, thus representing an AI agent and a human, respectively) with equal error frequencies in error states for each agents. We hereafter show the results for for each agent , while we omit results obtained with higher error frequencies due to space reasons. The observed behavior is qualitatively equivalent, with performance of solo agents decreasing proportionally as the error frequencies increase. We show in detail results for relatively low error frequencies as, in this case, we expected a diminished effect of errors on game lengths and team performance with respect to the case of higher frequencies. This allows us to show improvements brought by our scheme even in such cases where solo agents already perform well. (a) shows the game lengths at different error frequencies, for the solo agents (nav and nav respectively), and for the cases with the random manager and the IBL manager. While the error frequencies are the same, the IBL solo agents perform worse than the Q-learning solo agents because the IBL agent in our case is less likely to perform exploration steps in comparison with the Q-Learning agent. The IBL agent is therefore more likely to develop an earlier bias in its policy and have a larger number of under-explored areas, which would lead to decreased performance when forced off its optimal path. Though, the policy with an IBL manager significantly outperforms a random manager, and is able to improve performance with respect to either solo agents, despite relatively low error frequencies.
In another aspect of performance, the proportion in which the navigating agents demonstrate errors and the IBL manager preference in selecting them show a likely correlation. (a) shows the pattern of selection probability. Specifically, curve EAx-Ey: mgr shows the percentage of times the IBL manager selects agent x in error state of type y (y=J for joint error states).
Seen in (a), the manager learns a strong preference for selecting the agents which are error-free for a given error state. On the other hand, the manager demonstrates a lack of complete bias toward the error-free agents. This can likely be attributed to two items. First, the reward signal is based solely on trajectory lengths, so most outcomes (approximately ) where a low error probability agent is selected will be identical to one with only error-free agent selections. Second, the manager has no prior information on error probabilities, so it’s understanding of agent desirability is entirely based on observation and imperfect memory, so the manager learns a preference toward lower error likelihoods without developing a complete bias. The frequency of agent selection maintains this pattern as the frequency of error states of each type is increased through the test cases, which shows awareness of error likelihood by the IBL manager. Notice the difference with respect to the random manager, which picks navigating agents with the same probability of 50%.
In the next set of results ((b), (b)), we analyze the behavior with an imbalance in error frequencies for the solo agents, setting error probabilities as . For this configuration, the impact of agent errors on game lengths is significantly higher for the second agent and so we would expect to see a much higher impact resulting from intelligent selection of agents. The significance of the error likelihood is also expected to impact the successful navigation of the error-prone agent operating in the solo case. The results indicate both of these factors. We can clearly see that the agent with high error likelihood suffers a significant increase in game lengths when operating independently. Additionally, the case of the random manager suffers more in the event of selecting the second agent, resulting in diminished performance. Regarding the IBL manager preferences in agent selection, we see again the preference aligns strongly with error likelihoods. The manager develops a very strong bias toward the better performing agent in joint error states, almost exclusively selecting the low error probability agent. This demonstrates a much stronger belief the second agent will encounter errors from the manager’s perspective.
In the final case ((c), (c)), we demonstrate the impact on performance in the case of a highly divergent team. The error probabilities are , which demonstrate a worst-case scenario where one agent cannot avoid errors when an error state is encountered. In such a case, we expect the manager to learn a strong bias to the error-free agent in and . On the other hand, in the manager can treat the two as identical as both have a chance of error. In fact, this pattern is generally demonstrated in (c). We see the manager showing little preference between the two agents in . Regarding the solo agent cases, as expected, we also only see the impact on game length for the error-prone agent. Still, we again see the improvement possible with the inclusion of a learning manager. Further, the learning manager shows significantly stronger performance in comparison to the random manager. Additional results are demonstrated in .
In this paper, we considered the case of humans and AI systems operating as a team in order to accomplish a task. For our scenario, we tested the combination of a manager agent with a team of agents navigating a Gridworld environment together. We tested a team in which we represented both human and AI behavior through models utilizing standard RL and cognitively inspired IBL models with the inclusion of errors in their behavior. This mixture served to demonstrate a case in which fallible humans and AI systems might operate together to accomplish a shared task. The manager was able to learn a preference for more successful agents and provide a significant improvement in team performance over individual performance and random agent selection. These results show a cognitively inspired model can be utilized to learn from patterns of behavior to effectively coordinate behavior in the event the team shows signs of producing errors.
Detecting adversarial example attacks to deep neural networks. In Proceedings of the 15th international workshop on content-based multimedia indexing, pp. 1–7. Cited by: §I.
-  (2019) On the utility of learning about humans for human-ai coordination. Advances in Neural Information Processing Systems 32, pp. 5174–5185. Cited by: §II-A2.
-  (2021) Future trends for human-ai collaboration: a comprehensive taxonomy of ai/agi using multiple intelligences and learning styles. Computational Intelligence and Neuroscience 2021. Cited by: §I.
-  (2019) The wild, wild west: a case study of self-driving vehicle testing in arizona. Ariz. L. Rev. 61, pp. 983. Cited by: §I.
-  (2016) Self-driving cars: whose fault is it?. Geo. L. Tech. Rev. 1, pp. 182. Cited by: §I.
-  (2022) Demonstrating Optimized Delegation between AI and Human Agents. Cited by: §V-C.
-  (2021) Theory of mind for deep reinforcement learning in hanabi. CoRR abs/2101.09328. External Links: Cited by: §II-A1.
-  (2019) Attributing awareness to others: the attention schema theory and its relationship to behavioural prediction. Journal of Consciousness Studies 26 (3-4), pp. 17–37. Cited by: §II-A1.
-  (2020) Learn2perturb: an end-to-end feature perturbation learning to improve adversarial robustness. In , pp. 1241–1250. Cited by: §I.
-  (2020) Cognitive modeling of automation adaptation in a time critical task. Frontiers in Psychology 11. Cited by: §II-A2.
-  (2016) Automation, algorithms, and politics| talking to bots: symbiotic agency and the case of tay. International Journal of Communication 10, pp. 17. Cited by: §I.
-  (2020) Cognitive machine theory of mind. Cited by: §II-A1, §II-B2, §III-B, §V-B.
-  (2020) Effects of decision complexity in goal-seeking gridworlds: a comparison of instance-based learning and reinforcement learning agents. Cited by: §II-A1, §II-B2, §III-B, §V-B.
-  (2006) Computationally efficient approximation of the base-level learning equation in act-r. In Proceedings of the seventh international conference on cognitive modeling, pp. 391–392. Cited by: §II-B2.
-  (2019) Machine behaviour. Nature 568 (7753), pp. 477–486. Cited by: §I.
-  (2019) Theory of minds: understanding behavior in groups through inverse planning. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33, pp. 6163–6170. Cited by: §II-A1.
-  (2018) Reinforcement learning: an introduction. Cited by: §II-B.
-  (2020) Too many cooks: coordinating multi-agent collaboration through inverse planning. In Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems, pp. 2032–2034. Cited by: §II-A2.
-  (2017) Why we should have seen that coming: comments on microsoft’s tay “experiment,” and wider implications. The ORBIT Journal 1 (2), pp. 1–12. Cited by: §I.