Log In Sign Up

Overoptimization Failures and Specification Gaming in Multi-agent Systems

by   David Manheim, et al.

Overoptimization failures in machine learning and AI can involve specification gaming, reward hacking, fragility to distributional shifts, and Goodhart's or Campbell's law. These failure modes are an important challenge in building safe AI systems, but multi-agent systems have additional related failure modes. These failure modes are more complex, more problematic, and less well understood in the multi-agent setting, at least partially because they are not yet observed in practice. This paper explains why this is the case, then lays out some of the classes of such failure, such as accidental steering, coordination failures, adversarial misalignment, input spoofing or filtering, and goal co-option or direct hacking.


Multiparty Dynamics and Failure Modes for Machine Learning and Artificial Intelligence

Overoptimization failures in machine learning and artificial intelligenc...

Autonomous Task Planning for Heterogeneous Multi-Agent Systems

This paper presents a solution to the automatic task planning problem fo...

Robust Multi-Agent Task Assignment in Failure-Prone and Adversarial Environments

The problem of assigning agents to tasks is a central computational chal...

Categorizing Variants of Goodhart's Law

There are several distinct failure modes for overoptimization of systems...

Towards Accountable AI: Hybrid Human-Machine Analyses for Characterizing System Failure

As machine learning systems move from computer-science laboratories into...

An Alert-Generation Framework for Improving Resiliency in Human-Supervised, Multi-Agent Teams

Human-supervision in multi-agent teams is a critical requirement to ensu...

Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals

The field of AI alignment is concerned with AI systems that pursue unint...

1 Introduction

In this paper, we show that even if artificial intelligence (AI) or machine learning (ML) systems are individually well-aligned with a goal, specific classes of over-optimization failures can create dynamics in multiparty systems that lead to new failure modes. Even specification of non-competitive or cooperative goals does not necessarily provide any guarantee for the behavior of systems. By outlining how and why these multi-agent failures can occur, the paper hopes to spur system designers to explicitly consider these failure modes in designing systems, and to find approaches for mitigating them.

When complex systems are optimized by a single agent, the representation of the system and of the goal used for optimization often lead to failures that can be surprising to the agent’s designers. These various failure modes have been referred to as Goodhart’s law [1, 2], Campbell’s law [3], faulty reward functions [4], distributional shift [4], reward hacking [5], Proxyeconomics[6], and presumably many other terms. Such failure modes are the focus of a significant body of work in AI safety, and progress has been made. The implications of the additional classes of failure that can exist for multiagent systems are more complex, less well understood, and less often discussed.

As multiple corporations, governments, and other actors develop various forms of autonomous machine learning systems, the systems become more likely to interact not just with humans and other systems they are ultimately intended for, but also directly and indirectly with one another. While the arguments put forward by Yudkowsky [7] and Bostrom [5]

for why a singleton AI is more likely than the alternatives are plausible,the issues with multi-agent dynamics are critical in the present when superhuman narrow AI already exists, as well as both in a non-singleton superhuman AI scenario, and during any race leading to either a singleton or non-singleton superhuman AI, where multiple groups deploy increasingly powerful systems that interact in unanticipated and problematic ways.

1.1 Overoptimization failure modes

Systems which are optimized using an imperfect system model have several important failure modes. First, imperfect correlates of the goal will diverge in the tail[8]

. Heavily optimized systems will end up in those regions, and even well-designed metrics do not account for every possible source of variance. Second, there are several context failures

[9], where the optimization is well behaved in the training set (“ancestral environment”) but fails in various ways as optimization pressure is applied. For example, it may drift towards an “edge instantiation” where the system may optimize all of the variables that relate to the true goal, but further gain on the metric is found by unexpected means. Alternatively, the optimizer may properly obey constraints in the initial stage, but find some “nearest unblocked strategy” [9] allowing it to circumvent designed limits when given more optimization power. These can all occur in single-agent scenarios.

1.2 Single-Agent Regulator Failures

The types of failure in multi-agent systems are closely related to the exhaustive classification of single-agent metric optimization failures in the earlier work by Manheim and Garrabrant [2]. The four single-agent over-optimization failure modes outlined there are:

  • Tails Fall Apart, or Regressional inaccuracy, where the relationship between the modeled goal and the true goal is inexact due to noise (for example, measurement error,) so that the bias grows as the system is optimized.

  • Extremal Model Insufficiency, where the approximate model omits factors which dominate the system’s behavior after optimization.

  • Extremal Regime Change, where the model does not include a regime change that occurs under certain (unobserved) conditions that optimization creates.

  • Causal Model Failure, where the agent’s actions are based on a model which incorrectly represents causal relationships, and the optimization involves interventions that break the causal structure the model implicitly relies on.

Despite the completeness of the above categorization, the way in which these failures occur can differ greatly even when only a single agent is present. In a multi-agent scenario, agents can stumble into or intentionally exploit model overoptimization failures in even more complex ways. Despite this complexity, the different multi-agent failure modes can be understood based on understanding the way in which the implicit or explicit system models used by agents fail.

2 Results

Several relatively straightforward failure modes were referred to in Manheim and Garrabrant as Adversarial Goodhart [2]. These occur where one AI system opportunistically alters or optimizes the system and uses the expected optimization of a different victim agent to hijack the overall system. For example, many electrical grids use systems that optimize using known criteria. If power lines or power plants have strategically planned maintenance schedules, the owner can manipulate prices to its own advantage, as occurred (legally) in the case of Enron[10]. This is possible because the manipulator can plan in the presence of a known optimization regime. This class of simple system manipulation by a manipulative agent is an important case, but more complex dynamics can also exist that are worth discussing. An example of a well-understood multi-agent system, the game of poker, allows clarification of why this is the case.

2.1 Texas Hold’em and the Complexity of Multi-Agent Dynamics

In many-agent systems even relatively simple systems can become complex adaptive systems due to agent behavior. Illustrating this point is the game of poker. Solutions to simplified models of two-player poker predate game theory as a field

[11], and for simplified variants, two player draw poker has a fairly simple optimal strategy [12]. These early, manually computed solutions were made possible both by limiting the complexity of the cards, and more importantly by limiting interaction to a single bet size, with no raising or interaction between the players. In the more general case of heads-up limit Texas Hold’em, significantly more work was needed, given the multiplicity of card combinations, the existence of hidden information, and player interaction, but this multi-stage interactive game is “now essentially weakly solved”[13], but this game involves only two players. In the no-limit version of the game, Brown and Sandholm recently unveiled super-human AI [14], which still falls far short of a full solution to the game, and still restricts the game to “Heads’ Up” poker, which involves only two players per game.

The complex-adaptive nature of multi-agent systems means that each agent needs model not only model the system itself, but also the actions of the other player(s). The multiplicity of potential outcomes, betting strategies, and different outcomes becomes rapidly infeasible to represent other than heuristically. In limit Texas Hold’em poker, for example, the number of card combinations is immense, but the branching possibilities for betting is the more difficult challenge. In a no-betting game of Hold’em with P players, there are

possible situations. This is hands in the 2 player case,

in the 3 player case, and growing by a similar factor when expanded to the 4, 5, or 6 player case. The probability of winning is the probability that the 5 cards on the table plus two unknown other cards from the deck are a better hand than any that another player holds. In Texas Hold’em, there are four betting stages, one after each stage of cards is revealed. Billings et al. use a reduced complexity game (By limiting betting to 3 rounds per stage) and find a complexity of

in the two-hand case[15]. That means the 2-player, 3-round game complexity is comparable in size to a no-betting 4 player game, with card combinations possible.

Unlike a no-betting game, however, the number of things to be considered includes much more than the simple probability that the hand held is better than those held by other players. That calculation does not need to be accounted for based on the additional branching due to player choices. The somewhat more difficult issue is that the additional branching requires Bayesian updates to estimate the probable distribution of hand strengths held by other players based on their decisions, which significantly increases the complexity of solving the game. The most critical challenge, however, is that each player bets based on the additional information provided by not only the hidden information provided by their cards, but also based on the betting behavior of other players. Opponent(s) make betting decisions on the basis of non-public information (in Texas Hold’em, hole cards) and strategy for betting requires a meta-update taking advantage of the information the other player reveals by betting. The players must also update on the basis of potential strategic betting by other players, which occurs when a player bets in a way calculated to deceive. To deal with this, poker players need to model the strategic decisions of other players. This complex model of strategic decisions must be re-run for all the possible combinations at each decision point in order to arrive at a conclusion. After all of this, the computer must itself decide whether to engage in strategic play.

Behaviors like bluffing and slow-play are based on these dynamics, which become much more complex as the number of rounds of betting and the number of players increases. For example, slow-play, which involves underbidding compared to the strength of your hand, requires that the players will later be able to raise the stakes. This is why the complexity of this agent modeling grows as a function of the number of choices and stages at which each agent makes a decision. This type of complexity is common in multi-agent systems, and the problem in general is much less limited in scope than what can be illustrated by a rigidly structured game like poker.

2.2 Limited Complexity Models versus the Real World

In machine learning systems, the underlying system is approximated by implicitly or explicitly learning a multidimensional transformation between inputs and outputs. This transformation approximates a combination of the relationships between inputs and the underlying system, and between the system state and the outputs. The complexity of the model learned is limited by the computational complexity of the underlying structure, and while the number of possible states for the input is large, it is typically dwarfed by the number of possible states of the system.

The critical feature of machine learning that allows such systems to be successful is that most relationships can be approximated without inspecting every available state. (All models simplify the systems they represent.) The implicit simplification done by machine learning is often quite impressive, picking up on clues present in the input that humans might not notice, but it comes at the cost of having difficult to understand and difficult to interpret implicit models of the system.

Any intelligence, whether machine learning-based, human, or AI, requires similar implicit simplification, since the branching complexity of even a relatively simple game like Go dwarfs the number of atoms in the universe. Because even moderately complex systems cannot be fully represented, the types of optimization failures discussed above are inevitable. The contrapositive to Conant and Ashby’s theorem is that if a system is more complex than the model, any attempt to control the system will be imperfect.

Learning, whether human or machine, builds approximate models based on observations, or input data. This implies that the behavior of the approximation in regions far from those covered by the training data is more likely to markedly differ from reality. The more systems change over time, the more difficult prediction becomes - and the more optimization is performed on a system, the more it will change.

2.3 Failure modes

Because an essential part of multi-agent dynamic system modeling is opponent modeling, the opponent models are a central part of any machine learning model. These opponent models may be implicit in the overall model, or they may be explicitly represented, but they are still models that are approximate. In many cases, opponent behavior is ignored - by implicitly simplifying other agent behavior to noise, or by assuming no adversarial agents exist. Because these models are imperfect, they will be vulnerable to over-optimization failures discussed above.

First, examples given in this list are incomplete, as they primarily discuss failures that occur between two parties, such as a malicious actor and a victim, or failures induced by multiple individually benign agents. This would exclude strategies where agents manipulate others indirectly, or those where coordinated interaction between agents is used to manipulate the system.

Second, this list does not include the myriad ways in which other factors can compound metric failures. These are critical, but may involve overoptimization, or multiple-agent interaction, only indirectly. For example, O’Neil discusses a class of failure involving the interaction between the system, the inputs, and validation of outputs[16]. These failures occur when a system’s metrics are validated in part based on outputs it contributes towards. For example, a system predicting greater crime rates in areas with high minority concentrations leads to more police presence, which in turn leads to a higher rate of crime found. This higher rate of crime in those areas is used to train the model, which leads it to reinforce the earlier unjustified assumption. Such cases are both likely to occur, and especially hard to recognize, when the interaction between multiple systems is complex, and it is unclear whether the system’s effects are due in part to its own actions111This class of failure seems particularly likely in systems that are trained via ”self-play,” where failures in the model of the system get reinforced by incorrect feedback on the basis of the models, which is also a case of model insufficiency failure..

The list also excludes failures that do not directly involve metric overoptimizations, such as systems learning unacceptable behavior implicitly due to biased training data, or equivalently failing to attempt to optimize for social preferences like fairness. These are again important, but they are more basic failures of system design.

With those caveats, we propose the following classes of multi-agent overoptimization failures. For each, a general definition is provided, followed by one or more toy models that demonstrate the failure mode. These models are deliberately simplified, but where possible, real-world examples of the failuresexhibited in the model are suggested. The specifics of the strategies that can be constructed and the structure of the system can be arbitrarily complex, but as explored below, the ways in which these models fail can still be understood generally.

In the toy models, and stands for the metric and goal, respectively, for agent . The metric is an imperfect proxy for the goal. Where relevant, there is a victim agent and an opponent agent that attempts to exploit it.

Failure Mode 1.

Accidental Steering is when agents alter the systems in ways not anticipated by another agent, creating one of the above-mentioned single-party over-optimization failures.


This failure mode manifests similarly to the single-agent case and differs only in that agents do not anticipate the actions of other agents. When agents have closely related goals, even if those goals are aligned, it can exacerbate the types of failures that occur in single-agent cases.

Because each agent alone does not (or cannot) trigger the failure, this differs from the single-agent case. The distributional shift can occur due to a combination of actors’ otherwise potentially positive influences by either putting the system in an extremal state where the previously learned relationship decays, or triggering a regime change where previously beneficial actions are harmful.

Model (1.1).

A set of agents each have goals which affect the system in related ways, and the metric relationship changes in the extremal region where x¿a. Each agent is able to influence the system by an amount , where , but


In the presence of multiple agents without coordination, manipulation of factors not already being manipulated by other agents is likely to be easier and more rewarding, potentially leading to inadvertent steering due to model inadequacy.

Model (1.2).

Each agent manipulates their own variable, unaware of the overall impact. Because they cannot see other agents’ variables, there is no obvious way to limit the combined impact on the system to stay below the catastrophic threshold . Because each agent is exploring a different variable, they each are potentially optimizing different parts of the system.


This type of coordination failure can occur in situations like overfishing across multiple regions, where each group catches local fish, which they can see, but at a given threshold across regions the fish population collapses, and recovery is very slow.


Smaldino and McElreath [17] shows this failure mode specifically occurring with statistical methodology in academia, where academics find novel ways to degrade statistical rigor. The more general “Mutable Practices” model presented by Braganza [6], based on part on Smaldino and McElreath, has each agent attempting to both outperform the other agents on a metric as well as fulfill a shared societal goal, allows agents to evolve and find new strategies that combine to subvert a societal goal.

Failure Mode 2.

Coordination Failure occurs when multiple agents clash despite having potentially compatible goals.


Coordination is an inherently difficult task, and can in general be considered impossible[18]. In practice, coordination is especially difficult when the goals of other agents are incompletely known or not fully understood. Coordination failures such as Yudkowsky’s Inadequate equilibria[19] are stable, and coordination to escape from such an equilibrium can be problematic even when agents share goals.


2.1 A fixed resource is split between uses by different agents. Each agent has funds , and is given in proportion to a cost . They then spend funds to exploit the resources for each use. Preferences and gains from different uses is heterogeneous - and even if all are non-negative, funds will be wasted on resource contention.


In this case, we see that conflicting instrumental goals that neither side anticipates will cause wasted funds due to contention. Above nominal spending on resources in order to capture them from aligned competitor-agents will reduce funds available for exploitation of those resources, even though less resource contention would benefit all agents.


Different forms of scientific research benefit different goals differently. Even if spending in every area benefits everyone, a fixed pool of resources implies that with different preferences, contention between projects with different positive impacts will occur. To the extent that effort must be directed towards grant-seeking instead of scientific work, the resources available for the projects themselves are reduced.


Coordination limiting overuse of public goods is a major area of research in economics. Ostrom explains how such coordination is only possible when conflicts are anticipated or noticed and where a reliable mechanism can be devised [20].


2.2 As above, but each agent has an identical reward function of . Even though all goals are shared, a lack of coordination in the above case leads to overspending. Coordinated gains would be possible if agents coordinate to minimize overall spending on resource acquisition.


Coordination mechanisms themselves can be exploited by agents. The field of algorithmic game theory has a number of results for why this is only sometimes possible, and how building mechanisms to avoid such exploitation is possible [21].

Failure Mode 3.

Adversarial misalignment occurs when a victim agent has an incomplete model of how an opponent can influence the system. The opponent’s model of the victim allows it to intentionally selects for cases where the victim’s model performs poorly and/or promotes the opponent’s goal [2].




The Opponent can select for cases where y is large and X is small, so that chooses maximal values of X, to the marginal benefit of .


A victim’s model can be learned by “Stealing” models using techniques such as those explored by Tramèr et al,[22]. In such a case, the information gained can be used for model evasion and other attacks mentioned there.


Chess and other game engines may adaptively learn and choose openings or strategies for which the victim is weakest.


Sophisticated financial actors can dupe victims into buying or selling an asset (“Momentum Ignition”) in order to exploit the resulting price changes [23].


The probability of exploitable reward functions increases with the complexity of the system it manipulates[4], and the simplicity of the agent and their reward function. The potential for exploitation by other agents seems to follow the same pattern, where simple agents will be manipulated by agents with more accurate opponent models.


3.2 An attacker can discover exploitable quirks in the goal function to make the victim agent optimize for a new goal, as in Manheim and Garrabrant’s Campbell’s law example[2], slightly adapted here.


selects after seeing ’s choice of metric. Here, the opponent selects for values with high , and the victim’s later selection then creates a relationship between and their goal, especially at the extremes. The opponent does this by selecting for a metric such that even weak selection on hijacks the victim’s selection on to achieve their goal. The agent choice of metric need not be a useful proxy for their goal absent the regulator’s action. In the example given, if , the correlation between and is zero over the full set of states, but becomes positive on the subspace selected by the victim.

Failure Mode 4.

Input spoofing and filtering - Filtered evidence can be provided or false evidence can be manufactured and put into the training data stream of a victim agent.


4.1 Victim agent receives public data about the present world-state, and builds a model to choose actions which return rewards . The opponent can generate events to poison the victim’s learned model.


See the classes of data poisoning attacks explored by Wang and Chaudhuri [24] against online learning, and of Chen et al. [25]

for creating backdoors in deep-learning verification systems.


Financial market participants can (illegally) spoof by posting orders that will quickly be cancelled in a “momentum ignition” strategy to lure others into buying or selling, as has been alleged to be occurring in high-frequency-trading [23].


Rating systems can be attacked by inputting false reviews into a system, or by discouraging reviews by those likely to be the least or most satisfied reviewers.


4.2 As in (4.1), but instead of generating false evidence, true evidence is hidden to systematically alter the distribution of events seen.


Financial actors can filter the evidence available to other agents by performing transactions they don’t want seen as private transactions or dark pool transactions.



As in (4.1), where the victim agent employs active learning. In this case, the opponent can potentially fool the system into collecting data that seems very useful to the victim from crafted poisoned sources.


Honeypots can be placed or Sybil attacks mounted by opponents in order to fool victims into learning from examples that systematically differ from the true distribution.

Failure Mode 5.

Goal co-option is when an opponent controls the system the Victim runs on, or relies on, and can therefore make changes to affect the victim’s actions.


5.1 Opponent directly modifies Victim ’s reward function to achieve a different objective than the one originally specified.


5.2 Opponent intercepts and modifies Victim ’s output.


5.3 Opponent modifies externally stored scoring rules (labels) or data inputs provided to Victim ’s output.


Xiao, Xiao, and Eckert explore a “label flipping” attack against support vector machines

[26] where modifying a limited number of labels used in the training set can cause performance to deteriorate severely.


Direct access to the victim may allow manipulation, but this may be discovered. Similar access can also allow less-obviously detectible observations, which can allow an opponent to engage in various other exploits explored earlier.

To conclude the list of failure modes, it is useful to note a critical area where the failures are induced or amplified. This is when agents explicitly incentify certain behaviors on the part of other agents, perhaps by providing payments. These public interactions and incentive payments are not fundamentally different from other failure modes, but can create or magnify any of the other modes. A second, related case is when creating incentives where an agent fails to anticipate either the ways in which the other agents can achieve the incentified target. These so-called “Cobra effects” can lead to both the simpler failures of the single agent cases explored in Manheim and Garrabrant, and also lead to the failures above.

3 Discussion

Many of the failure modes can already be seen in human systems and in certain applications of machine learning, as the examples show. They are also more general, and should be expected to occur in any system containing agents with an incomplete or inexact model of other agents. The various failure modes represented in this paper are not exhaustive, and while mitigations for each exist, the fundamental dynamics driving them seem unavoidable.

Failures that occur due to incomplete models of other agents are likely to be in some sense surprising. They cannot occur until multiple machine learning agents are deployed, and certain of the attacks require a degree of sophistication that is unlikely to be present in early agents. Specification gaming of these types will, for now, seem to be a mitigable problem, as the regulatory discussion in finance shows[23], but the current trajectory of these systems means that the problems will inevitably worsen as they become more complex and more such systems are deployed.

In certain domains, particularly finance, the failure modes have been seen widely despite the limited complexity of agents deployed. In other domains, such as bots engaging in social network manipulation or various forms of more direct interstate competition, it is plausible that deployed systems are already suffering from these failure modes in ways not yet apparent to the public. Despite this, mitigating the failures modes discussed here seems to receive little attention from those interested in building better ML systems.

In the realm of direct work for AI safety, these failure modes seem to have particularly important implications for Christiano, Shlegaris and Amodei[27]. In that work, the coordinating agents are predictive instead of agentic, so the failure modes are more restricted. The methods suggested can also be extended to agentic systems, where they may prove more worrisome. Work on safe amplification using coordinated multi-agent systems in that context has begun, and solving the challenges potentially involves mitigating several failure modes outlined here.

4 Conclusions

The failure modes outlined (accidental steering, coordination failures, adversarial misalignment, input spoofing or filtering, and goal co-option or direct hacking) are all due to models that do not account for other agent behavior. Because all models must simplify the systems they represent, the prerequisites for these failures are necessarily present in complex-enough systems where multiple non-coordinated agents interact. It is possible that governmental actors, policymakers, and commercial entities will recognize the tremendous complexities of multi-party coordination among autonomous agents and address these failure modes, or slow deployment and work towards addressing these problems even before they become apparent. Alternatively, it is possible these challenges will quickly become apparent, and will so greatly hamper these systems that mitigating these failure modes will be prioritized. This depends on how critical the failures are, and whether the public will demand they be addressed. At present, it seems hard to be hopeful on any of these fronts.

Work addressing the failure modes outlined in the paper is potentially very valuable, in part because these failure modes are mitigable or avoidable if anticipated. AI and ML system designers and users should expect that many currently successful but naive agents will be exploited in the future. Because of this, the failure modes are likely to become more difficult to address if deferred, and are therefore particularly critical to understand and address them preemptively. This may take the form of systemic changes like redesigned financial market structures, or may involve ensuring that agents have built-in failsafes, or that they fail gracefully when exploited.

Even if AI amplification remains wholly infeasible, humanity is already deploying autonomous systems with limited ability for humans to provide feedback or control in real-time. The depth of complexity is significant but limited in current systems, and the strategic interactions of autonomous systems are therefore even more limited. But just as AI for poker eventually became capable enough to understand multi-player interaction and engage in strategic play, AI in other systems should expect to be confronted with these challenges. We don’t know when the card sharks will show up, or the extent to which they will make the game unplayable for others, but we should admit now that we are as-yet unprepared for them.


This research was funded in large part by a grant from the Berkeley Existential Risk Initiative.