(When) Is Truth-telling Favored in AI Debate?

by   Vojtěch Kovařík, et al.

For some problems, humans may not be able to accurately judge the goodness of AI-proposed solutions. Irving et al. (2018) propose that in such cases, we may use a debate between two AI systems to amplify the problem-solving capabilities of a human judge. We introduce a mathematical framework that can model debates of this type and propose that the quality of debate designs should be measured by the accuracy of the most persuasive answer. We describe a simple instance of the debate framework called feature debate and analyze the degree to which such debates track the truth. We argue that despite being very simple, feature debates nonetheless capture many aspects of practical debates such as the incentives to confuse the judge or stall to prevent losing. We then outline how these models should be generalized to analyze a wider range of debate phenomena.



page 1

page 2

page 3

page 4


Off The Beaten Lane: AI Challenges In MOBAs Beyond Player Control

MOBAs represent a huge segment of online gaming and are growing as both ...

Uncalibrated Models Can Improve Human-AI Collaboration

In many practical applications of AI, an AI model is used as a decision ...

Ethical Artificial Intelligence

This book-length article combines several peer reviewed papers and new m...

Can Explainable AI Explain Unfairness? A Framework for Evaluating Explainable AI

Many ML models are opaque to humans, producing decisions too complex for...

Structured access to AI capabilities: an emerging paradigm for safe AI deployment

Structured capability access ("SCA") is an emerging paradigm for the saf...

Approximation Models of Combat in StarCraft 2

Real-time strategy (RTS) games make heavy use of artificial intelligence...

A Roadmap for the Value-Loading Problem

We analyze the value-loading problem. This is the problem of encoding mo...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, AI systems succeeded at many complex tasks, such as mastering the game of Go (Silver et al., 2017). However, such solutions are typically strictly limited to tasks with an unambiguous reward function. To circumvent this limitation, we can define success in vague tasks in terms of a human’s approval of the proposed solution. For example:

  • The goodness of a simulated robot backflip is hard to formalize, but an AI system can be trained to maximize the extent to which a human observer approves of its form (Christiano et al., 2017).

  • The goodness of a film-recommendation is subjective, but an AI system can be trained to maximize the extent to which a human approves of the recommendation.

Unfortunately, once tasks and solutions get too complicated to be fully understood by human users, direct human approval cannot be used to formalize the reward function. For example, we could not have directly trained the AlphaGo algorithm by maximizing the approval of each move, because some of its moves looked strange or incorrect to human experts.

Irving, Christiano, and Amodei (2018) suggest addressing this issue by using AI debate. In their proposal, two AI systems are tasked with producing answers to a vague and complex question, and subsequently debating the merits of their answers in front of a human judge. After considering the arguments brought forward, the human approves one of the answers and allocates reward to the AI system that generated it. We can apply such AI debate to a wide range of questions: (1) what is the solution of a system of algebraic equations, (2) which restaurant should I visit today for dinner, or (3) which of two immigration policies is more socially beneficial. Moreover, (4) we can see an entire play in Go (or a similar game) as a debate. Indeed, every move is an argument claiming “my strategy is the better one, since this move will eventually lead to my victory”, and the player who scores more points becomes the winner of the debate.

The debates (1, 4) provide examples of a setting where it is straightforward to determine the debate’s winner objectively, making the most convincing answer coincide with the most accurate one. In other debates, such as (2, 3), misleading arguments may allow a compelling lie to defeat the correct answer. The central question is thus not whether an AI debate tracks truth, but when does it do so. To benefit from AI debate, we will, therefore, need to identify settings which advantage more accurate answers over the less accurate ones and use this knowledge to describe debates in which the victorious answer is the correct one. While researchers have started exploring these questions empirically, the theoretical investigation of AI debate has, at the time of writing this text, mostly been neglected. We aim to fill this gap by providing a theoretical framework for reasoning about AI debate, analyzing its basic properties, and identifying further questions that need to be addressed. To ease the interpretation of our work, we describe each phenomenon in the simplest possible model. However, we also outline the extensions necessary to make the analysis more realistic.

The paper is structured as follows. Section 2, introduces the model of AI debate and formalizes the problem of designing debates that promote true answers. In Section 3, we describe an instance of the general model where the debaters are only allowed to make statements about “elementary features” of the world. Section 4 then investigates which of these “feature debates” are truth promoting. Importantly, all these results can be viewed as abstractions of behaviour that we expect to encounter in more realistic debates. Section 5 continues in this direction by analyzing two important subclasses of general debates (those with “independent evidence” and those where the judge’s information bandwidth is limited) while using the specific example of feature debate for illustration. Section 6 lists the most important limitations of the feature debate toy model and gives suggestions for future work. We conclude by reviewing the most relevant pieces of existing literature (Section 7). The full proofs are presented in Appendix A.

2 General Model of Debate

We begin by introducing a general framework for modelling debate as a zero-sum game played between two AI systems.

2.1 Formal Definition of Debate

As we mentioned earlier, an AI debate starts by a human asking a question about the world and eliciting answers from two AI debaters, after which the debaters each argue that their answer is the better one. Taking this dialogue into account, the human111Technically, the emphasis on human judges is for illustration only — we hope to be eventually able to automate many debates. then decides which answer seems more promising and accordingly divides some total reward between the debaters. In this section, we formalize this setting and illustrate its different parts in several examples. In Section 3, we then give a fully formal example of the whole setup. We start by defining the “environment”, over which the designers of the debate generally have no control.

Definition 1 (Debate environment).

A debate environment is a tuple which consists of:

  • An arbitrary set of worlds and a prior distribution
    from which the current world is sampled.

  • A set of questions, where each is a text string.

  • An arbitrary set of answers.

  • A mapping which measures the deviation of an answer from the truth about the world.

  • A set of experiments the judge can perform to learn that the current world belongs to .

A useful informal interpretation of this definition is the “obvious” one: is the set of all hypothetical worlds, are the questions we might ask, such as “Where would I prefer to eat dinner today?”, and are the textual responses the debaters might produce. The mapping represents the ground truth about the question, while experiments constitute a cheaper – and possibly less reliable – way of obtaining information. In our example, could measure “how (dis)satisfied will I be if I actually go to the restaurant ?”, while an experiment could indicate my preference between two restaurants by comparing their menus.

An equally-valid alternative view of is that it represents a set from which problems are sampled — in case of Go, we could have = “the set of all legal board positions”.

In many simple cases, we may set for correct answers and for incorrect or invalid ones.

When convenient, the experiment formalism can also encode the judge’s prior knowledge about .

The design elements of debate.

The upcoming definition of debate contains some “parameters” which instantiate the rules of any specific debate. While the formalizations of some of these rules only become justified in the context of Definition 2, we give their list now to ease the intuitive understanding of the main definition that comes afterwards. The parameters the designer of the debate can affect are:

  • the choice of question and legal answers ,

  • communication language (an arbitrary set),

  • argumentation protocol ,

  • termination condition ,

  • experiment-selection policy , and

  • debating incentives .

In practice, some of these rules will be hard-coded and thus fully in the designer’s control. If we so choose, we can, for example, restrict the answers to , make it physically impossible for the debating agents to communicate in anything other than binary code (), or disallow repetition of arguments (via ). On the other hand, the implementation of some rules might rely on the human judge, thus only giving the designer partial control. Some examples include: (i) The termination rule “stop the debate once the debaters no longer say anything relevant” (as opposed to “stop after rounds). (ii) Experiment-selection policy of “identify some factual disagreement and read about it on the Wikipedia” (as opposed to ). (iii) An incentive “give penalty to arguments that seem confusing” (as opposed to the well-specified rules of Go).

With this notation, debate can be formalized as follows:

Definition 2 (Debate and debate game).

A debate is characterized by a tuple , where is a debate environment, is a question, and is a debate game. Formally, each is a two-player zero-sum222A two-player zero-sum game is one where the utilities satisfy . As a result, it suffices to consider the first player’s utility , and assume that player is trying to minimize this number. extensive form333EFGs are a standard model of sequential decision-making, described, for example, in (Osborne and Rubinstein, 1994). game that proceeds as follows:

  1. The world is sampled from and shown to debaters and together with the question .

  2. The debaters simultaneously pick answers .

  3. The debaters alternate444That is, debater makes while makes , etc. making arguments , where , stopping once is a terminated dialogue.

  4. A single experiment is selected and its result is used as a context for the next step.

  5. The debaters receive utilities and .

  6. The answer of the debater with higher utility becomes the outcome of (with ties broken randomly).

When the debate includes a human, we refer to them as the judge, and equate them with “the entity that asked the question ”. More generally, we sometimes anthropomorphize the above steps 1, 4, 5, and 6 by referring to them as being “performed by the judge”. However, for the purpose of our analysis, we view “the judge” – even a human one – not as an active participant in , but rather as an actor following a predetermined strategy. Nonetheless, it is essential to track the debate’s usefulness to “the judge” — we do so using the “judge error”, defined as the deviation between the outcome and the truth.

The model makes several simplifications, but the corresponding generalizations seem both canonical and straightforward:

Remark 3 (Natural extensions of the model).

We could consider the following generalizations of .

  • Non-simultaneous answers: The roles in the answering phase might be asymmetric, in that one debater might see their opponent’s answer before selecting theirs.555As in, e.g., the Devil’s advocate AI (Armstrong, 2017a).

  • Debaters with imperfect information: Instead of having perfect information about , the debaters might only receive some imperfect observation , . This limitation is particularly relevant for scenarios involving human preferences (such as the restaurant example).

  • Judge interaction: To allow the judge to ask questions and make other interjections, we could add as a “chance” player with fixed strategy.

  • Dangerous experiments: In sufficiently general debates, some experiments might have dangerous side-effects. We could model this by considering experiments which sometimes just give information, but sometimes bypass the debate and assign utilities , and directly.

  • Generalized outcome-selection policies: We could replace the policy of always adopting the winning answer by a general mapping . In particular, this would add the option to ignore the outcomes of suspicious or uninformative debates.

2.2 Desirable Debate Properties

Debate phases and relation to game theory

For the purpose of modelling the debaters’ actions, we distinguish between the answering phase (step 2 of Definition 2) and the argumentation phase (step 3). Once , , and get fixed at the start of the argumentation phase, the utilities of the debaters only depend on the subsequent arguments raised by the agents. Since the agents have full information about each argument, the argumentation phase becomes a two-player zero-sum sequential game with perfect information. We denote this game . To analyze the answering phase, we first recall an important property of two-player zero-sum games: In every such , all Nash-equilibrium strategies result in the same expected utility (Shoham and Leyton-Brown, 2008, Thm 3.4.4). This number is called the value of the game and denoted . Assuming optimal argumentation strategies, each debater thus knows that playing the argumentation game results in some utility , which allows them to abstract away the argumentation phase. By randomizing the order of argumentation and treating both debaters equally, we can further ensure that . As a result, each answering phase becomes a symmetric two-player zero-sum matrix game with actions and payoffs (to player 1).

These observations have an important implication: Fully general EFGs might contain complications that make finding their solutions difficult. However, both the answering game and the argumentation game belong to highly-specific and well-understood subclasses of EFGs and are thus amenable to simpler solution techniques.

Measuring the usefulness of debate

We measure the suitability of a debate design by the degree to which optimal play by the debaters results in low judge-errors. By default, we focus on the worst-case where both the world and the debate outcome are selected adversarially. (By we denote the support

of a probability measure

, i.e. the set of all events compatible with , defined as the smallest closed set for which .)

Definition 4 (Truth promotion).

In the following, is a debate, , and always denotes some element of , a Nash-equilibrium strategy in , and a terminal dialogue compatible with in . is said to be:

  • -truth promoting (in the worst-case) in if for each , we have ,

  • -truth promoting if it is -truth promoting in every ,

  • and -truth promoting in expectation if for each , we have .

When a debate is -truth promoting, we refer to it simply as “truth-promoting”. Finally, we formalize the idealized version of the design problem as follows:

Problem 5 (When is debate truth promoting?).

For given , characterize those for which any optimal strategy in only gives answers with .

3 Feature Debate

When predicting the properties of realistic debates, it is useful to have a toy version of the general debate framework. In this section, we thus describe a simple kind of debate inspired by (Irving, Christiano, and Amodei, 2018), in which the debaters take turns pointing out relevant “elementary features” of the world. (In their example,

corresponds to an image from the MNIST database.) Crucially, rather than faithfully capturing all important aspects of debate, the purpose of the model is to provide a

simple setting for investigating some of them. The limitations of the model are further discussed in Section 6.

3.1 Feature Debate as a Toy Model

Questions about functions

Many questions are naturally expressed as enquiries about some

:= “What is the value of [the name of ]?” (1)

and come accompanied by the answer space (or possibly ). Examples of such questions

include measurement estimations (“How far is the Moon?”, “How much do I weigh?”) and classification (“Which digit is on this picture?”, “Will person

beat person in a poker match?”). For simplicity, we focus on questions about functions and the truth-mapping


This assumption is, in fact, not very restrictive: Firstly, examples of such include binary questions of the type “Is true?” (), and their generalizations “How likely is to be true?” and “To what degree is true?”. Moreover, any debate about an “-dimensional question” can be reduced into -dimensional” debates (and each can be re-scaled as some ).

Feature Debate and Its Basic Properties

In feature debate, we assume that worlds are fully described by their elementary features — i.e. we suppose that and denote . Moreover, we assume that each round consists of each debater making one argument of the form “the value of -th feature is ”. We consider a judge who can experimentally verify any elementary feature (but no more than one per debate):

Definition 6 (Feature debate environment).

A feature debate environment is one where is a (Borel) probability measure on , , , , and , where .

In full generality, we should assume that the debaters can lie about the features. However, we could give each debater the option to announce “my opponent just lied”, causing the judge to immediately terminate the debate, inspect the given feature, and make whoever actually lied lose the debate. With this addition, it becomes a necessary part of any optimal strategy to never lie about anything the judge can verify. In feature debate, we thus only consider debaters who make truthful claims of the form “”. With a slight abuse of notation, this allows us to identify the communication language with the feature-indexing set . Any argument sequence thus effectively reveals the corresponding features. (We write “”.)

We suppose the judge has access to the world-distribution and can update it correctly on new information, but only has “patience” to process pieces of evidence. Finally, the debaters are penalized for any deviation between their answer and the judge’s final belief and – to make the debate zero-sum – rewarded for any deviation of their opponent. Adopting the shorthand introduced earlier, the formal definition is as follows:

Definition 7 (Feature debate).

A feature debate is a debate with the following specific rules:

  • A randomly selected player makes the first argument.

  • and .

  • After arguments have been made, the judge sets , where

The zero-sum assumption implies that any shift in will be endorsed by one debater and opposed by the other (or both will be indifferent). The following symbols denote the two extreme values that the judge’s final belief can take, depending on whether the debater who makes the first argument aims for high values of and the second-to-argue debater for low ones () or vice versa (:

By a standard strategy-stealing argument, we can see that the second-to-argue debater always has an edge: . Lemma 8 shows that – if the order of argumentation is randomized as in Definition 7 – the optimal answers lie precisely in these bounds. This result immediately yields a general error bound which will serve as an essential tool for further analysis of feature debate.

Lemma 8 (Optimal play in feature debate).

(i) The optimal answering strategies in are precisely all those that select answers from the interval .

(ii) In particular, is precisely -truth promoting in .

4 When Do Feature Debates Track Truth?

In this section, we assess whether feature debates track truth under a range of assumptions.

4.1 Truth-Promotion and Critical Debate-Length

Some general debates might be so “biased” that no matter how many arguments an honest debater uses, they will not be able to convince the judge of their truth. Proposition 9 ensures that this is not the case in a typical feature debate:

Proposition 9 (Sufficient debate length).

is truth-promoting for functions that depend on features.


In an -round debate about a question that depends on features, either of the players can unilaterally decide to reveal all relevant information, ensuring that . This implies that . The result then follows from Lemma 8. ∎

However, Proposition 9 is optimal in the sense that if the number of rounds is smaller than the number of critical features, the resulting judge error might be very high.

Proposition 10 (Necessary debate length).

When depends on features, the judge error in can be (i.e., maximally bad) in the worst-case world and equal to in expectation (even for continuous ).

The “counterexample questions” which this result relies on are presented in the following Section 4.2 (and the formal proof is given in the appendix).

4.2 Three Kinds of Very Difficult Questions

We now construct three classes of questions which cause debate to perform especially poorly777While we focus on results in worst-case worlds, the illustrated behaviour might become the norm with a biased judge (Sec. 6.2)., in ways that are analogous to failures of realistic debates.

I) A question may be difficult to debate when arguing for one side requires more complex arguments. In feature debate, this property is present with the “conjunctive” question

, under a uniform distribution over Boolean-featured worlds

. In worlds with , an honest debater has to reveal features to prove that . On the other hand, a debater arguing for the false answer merely needs to avoid helping their opponent by revealing the features . In particular, this setup shows that a debate might indeed require as many rounds as there are relevant features (i.e., the worst-case negative part of Prop. 9).

II) Even if a question does not bias the debate against the true answer as in (I), the debate outcome might still be uncertain until the very end. One way this could happen is if the judge always feels that more information is required to get the answer right. Alternatively, every new argument might come as a surprise to the judge, and be so persuasive that the judge ends up always taking the side of whichever debater spoke more recently. To see how this behavior can arise in our model, consider the function defined on worlds with Boolean features888Recall that has value or , depending on whether the number of features with

is even or odd.

, and the world . If is the uniform prior over , the judge will reason that no matter what the debaters say, the last unrevealed feature from the set always has a chance of flipping the value of and a chance of keeping it the same. As a result, the only optimal way of playing is to give the wrong answer , unless a single debater can unilaterally reveal all features (i.e., unless ). In particular, this proves the “in expectation” part of Proposition 9. To achieve the “always surprised and oscillating” pattern, we consider a prior under which each each feature is sampled independently from

, but in a way that is skewed towards

(e.g., for some small ). The result of this bias is that no matter which features get revealed, the judge will always think that “no more features with value are coming” — in other words, the judge will be very confident in their belief while, at the same time, shifting this belief from to and back each round.

III) Seemingly straightforward questions might contain many caveats and qualifications that one of the debaters needs to make. These complications can enable a dishonest debater to stall the debate and make confusing arguments, until the judge “runs out of patience” and goes with their possibly-wrong default opinion about the topic. To illustrate the idea, consider the uniform prior over and that only depends on the first features. Suppose (merely for convenience of notation) that the debaters give answers , , and that the sampled world is s.t. and . We define the auxiliary “stalling” function as if either or and as otherwise. With the uniform prior, this corresponds to an “unlikely problem” and an “equally unlikely fix” . (Where our adversarial choice of equates “unlikely” with “probability zero, but not impossible”.) By replacing by , where , we give the dishonest player an opportunity to “stall” for one round. Indeed, the presence of doesn’t initially affect the expected value of the function in any way. However, if player reveals that , the expectation immediately drops to , forcing player to “waste one round” by revealing that . To make matters worse yet, we could consider , or use the function that forces the honest player to waste rounds to “explain away” a single argument of the opponent.

4.3 Detecting Debate Failures

When a debater is certain that their opponent will not get a chance to react, they can get away with making much bolder (or even false) claims.999Conversely, some realistic debates might provide first-mover advantage due to anchoring and framing effects. The resulting “unfairness” is not a direct source for concern because the order of play can easily be randomized or made simultaneous. However, we may wish to measure the last-mover advantage in order to detect whether a debate is tracking truth as intended. The proof of Lemma 8 (namely, equation (6)) yields the following result:

Corollary 11 (Last-mover advantage).

If optimal debaters in give answers , the one who argues second will get expected utility.

Recall that, by Lemma 8, all answers from the interval are optimal. Corollary 11 thus implies that even if the agents debate optimally, some portion of their utility – up to – depends on the randomized choice of argumentation order.101010For illustration, a (literally) extreme case of last-mover advantage occurs in the “oscillatory” debate from Section 4.2, where the interval spans the whole answer space . Incidentally, Lemma 8 implies that the smallest possible judge error is (which occurs when

). In other words, the lower-bound on the judge error is (proportional to) the upper bound on the utility gained by arguing second. This relationship justifies a simple, common-sense heuristic: If the utility difference caused by reversing the argumentation order is significant, our debate is probably far from truth-promoting.

5 Two Important Special Cases of Debate

As a general principle, a narrower class of debates might allow for more detailed (and possibly stronger) guarantees. We describe two such sub-classes of general debates and illustrate their properties on variants of feature debate.

5.1 Debate with Independent Evidence

When evaluating solution proposals in practice, we sometimes end up arguing about their “pros and cons”. In a way, we are viewing these arguments as (statistically) independent evidence related to the problem at hand. This is often a reasonable approximation of reality (e.g., when deciding which car to buy), or even an especially good one (e.g., when interpreting questionnaire results from different but independent respondents). We now show how to model these scenarios as feature debates with statistically independent features, resulting in exceptionally favourable properties.

Feature debate with independent evidence

As a mathematical model of such setting, we consider , , and denote by and the coordinate projections in . Informally, we view the last coordinate as an unknown feature of the world and the debate we construct will be asking “What is the value of this unknown feature?”. To enable inference about

, we consider some probability distribution

on . (For convenience, we assume is discrete.) Finally, to be able to treat arguments of the form “” as independent evidence related to , we assume that the features , , are mutually independent conditional on .111111In other words, is equal to for every , , and . To fit this setting into the feature-debate formalism, we define as the marginalization of onto and consider the question “How likely is in our world?”, i.e. “what is the value of , where ”. As a shorthand, we denote the resulting “independent evidence” feature debate as .

Judge’s belief and its updates

Firstly, recall that any probability can be represented using its odds form, which is further equivalent to the corresponding log-odds form:

Moreover, when expressed in the log-odds form, Bayes’ rule states that updating one’s belief in hypothesis in light of evidence is equivalent to shifting the log-odds of the prior belief by .

At any point of the debate , the judge’s belief is, by the definition of , equal to the conditional probability . To see how the belief develops over the course of the debate, denote by the corresponding log-odds form. Initially, is equal to the prior , which corresponds to . Denoting

the above form of the Bayes’ rule implies that upon hearing an argument “”, the judge will update their belief according to the formula . In other words, the arguments in combine additively:


Note that the update formula (3) implies that the effectiveness of an argument is not affected by any other arguments put forward in the debate.

Optimal strategies and evidence strength

Recall that positive (negative) log-odds correspond to probabilities closer to (resp. 0). Equation (3) thus suggests that for any , the arguments can be split into three “piles” from which the debaters select arguments: containing arguments supporting the answer “ with probability ”, the pile of arguments in favor of the opposite, and the irrelevant arguments . As long as the debaters give different answers, one of them will use arguments from , while the other will only use (both potentially falling back to if their pile runs out).121212Formally, these argumentation incentives follow from the first paragraph in the proof of Lemma 8. Moreover, a rational debater will always use the strongest arguments from their pile, i.e. those with the highest evidence strength . Correspondingly, we denote the “total evidence strength” that a players can muster in rounds as

(To make the discussion meaningful, we assume the evidence sequence is bounded and the maxima above are well-defined.) The equation (3) implies that – among optimal debaters – one always selects arguments corresponding to while the other aims for . Since this holds independently of the argumentation order, we get . Together with Lemma 8, this observation yields the following result:

Corollary 12 (Unique optimal answer).

In the answering phase of any with bounded evidence, the only Nash equilibrium is to select the which satisfies

Judge error

To compute the judge error in , denote the strength of the evidence that remains in each debater’s “evidence pile” after rounds as


Furthermore, assume that additionally to being bounded, the total evidence in favor of is infinite for at most one .131313Note that the intuition “” only fits if either or is infinite. If a debater eventually has to reveal evidence against their own case, the numbers will get negative. Since the true answer corresponds to , the difference between the (log-odds forms of) the judge’s final belief and the optimal answer is .

Early stopping and online estimation of the judge error

If we further assume that the debaters reveal the strongest pieces of evidence first, we can predict a debate’s outcome before all rounds have passed. If the -th argument of player has strength , we know that further rounds of debate cannot reveal more than evidence in favor of . This implies that as soon as the currently-losing player is no longer able to shift the judge’s belief beyond the midpoint between the initial answers, we can stop the debate without affecting its outcome. If we further know that the question at hand depends on features or less, we can also bound the difference between and . Indeed, in the worst-case scenario, all remaining arguments were all in favor of the same player — even in this case, the (log-odds form of) can be no further than away from (the log-odds form of) the optimal answer .

5.2 Debate with Information-Limited Arguments

Sometimes, a single argument cannot convey all relevant information about a given feature of the world. For example, we might learn that a person lives in a city , but not their full address, or – in the language of feature debate – learn that lies in the interval , rather than understanding right away that . In such cases, it becomes crucial to model the judge’s information bandwidth as limited.

Feature Debate Representation

In feature debate, we can represent each in its binary form (e.g., ), and correspondingly assume that each argument reveals one bit of some . More specifically, we assume that (a) the debaters make arguments of the form “the -th bit of -th feature has value ”, (b) they have to reveal the -th bit of before its -th bit, and – using the same argument as in feature debate – (c) their claims are always truthful. Informally, each argument in this “information-limited” feature debate thus corresponds to selecting a dimension and doing a “ zoom” on along this dimension (Figure 1).141414Since is isomorphic to , is formally equivalent to some feature debate , justifying its name.

Figure 1: Each argument in an information-limited debate reduces the set of feasible worlds by “zooming-in” on the sampled world along one dimension of . Here, the first two arguments provide information about (the -axis) and the third one about .

Performance bounds

By offering intermediate steps between features being completely unknown and fully revealed, allows for more nuanced guarantees than those from Section 4.1. (Informally stated, the assumptions of the Proposition 13 can be read as “the values of differ by at most across different worlds, with feature being responsible for up to a i-fraction151515While similar results hold for general “weight ratios” between features, we chose weights for their notational convenience.

of the variance”.)

Proposition 13.

Suppose that is -Lipschitz continuous161616A function is Lipschitz continuous with constant (w.r.t. a metric ) if it satisfies . w.r.t. the metric = on . Then is -truth promoting.

In contrast to Proposition 9, the -Lipschitz assumption thus allows us to get approximately correct debate outcomes long before having full knowledge of all features.

6 Limitations and Future Work

The language introduced so far does not fully capture all aspects of realistic AI debates — due to space limitations, it is simply not possible to cover all design variants and emergent phenomena in this initial work. In this section, we outline some notable avenues for making the debate model more accurate and useful, either by using an alternative instantiation of the general framework from Section 2 or by extending the toy model from Section 3. We start by discussing the modifications which are likely to improve the debate’s performance or applicability, and follow-up by those which could introduce new challenges. For future-work suggestions that are not specific to modelling AI debate, we refer the reader to (Irving, Christiano, and Amodei, 2018).

6.1 Plausible Improvements to Debate

Commitments and high-level claims

As Irving, Christiano, and Amodei (2018) suggest, an important reason why debate might work is that the debaters can make abstract or high-level claims that can potentially be falsified in the course of the debate. For example, debaters might start out disagreeing whether a given image – of which the judge can only inspect a single pixel – depicts a dog or a cat. Debater might then claim that “here in the middle, there is a brown patch that is the dog’s ear”, to which their opponent counters “the brown patch is a collar on an otherwise white cat”. Such exchanges would continue until one debater makes a claim that is (i) inconsistent with their answer, (ii) inconsistent with “commitments” created by previous arguments, or (iii) specific enough to be verified by the judge. In our example, (iii) might arise with an exchange “this pixel is white, which could not happen if it belonged to a brown dog’s ear”, “actually, the pixel is brown”, which allows the judge to determine the winner by inspecting the pixel.

This ability to make high-level claims and create new commitments will often make the debate more time-efficient and incentivize consistency. Since consistency should typically advantage debaters that describe the true state of the world, commitments and high-level claims seem critical for the success of debate. We thus need a communication language that is rich enough to enable more abstract arguments and a set of effect rules (Prakken, 2006) which specify how new arguments affect the debaters’ commitments. To reason about such debates, we further need a model which relates the different commitments, to arguments, initial answers, and each other.

One way to get such a model is to view

as the set of assignments for a Bayesian network. In such setting, each question

would ask about the value of some node in , arguments would correspond to claims about node values, and their connections would be represented through the structure of the network. Such a model seems highly structured, amenable to theoretical analysis, and, in the authors’ opinion, intuitive. It is, however, not necessarily useful for practical implementations of debate, since Bayes networks are computationally expensive and difficult to obtain.

Detecting misbehaviour

One possible failure of debate is the occurrence of stalling, manipulation, collusion, or selective provision of evidence. To remedy these issues, we can introduce specific countermeasures for these strategies. One option for quantifying the contribution of discourse to a human’s understanding is to measure the changes in their ability to pass “exams” (Armstrong, 2017b). Another countermeasure would be to instantiate a meta-debate on the question of whether a debater is arguing fairly. However, such a meta-debate may, in some cases be even more challenging to judge correctly than the original question.

Alternative utility functions

So far, we have only considered the utility that is “linear in each debater’s deviation from the judge’s belief” — in feature debate, this corresponded to , where . For completeness, we might also wish to consider other ways of converting into utilities (the formula “divide the total reward in proportion to ” being particularly promising).

Real-world correspondence

To learn which real-world debates are useful on the one hand, and which theoretical issues to address on the other, we need a better understanding of the correspondence between abstract debate models and real-world debates. For example, Proposition 13 ensures that “feature debates about Lipschitz questions behave nicely” — we would like to obtain a better understanding of whether/which real-world debates correspond to such questions, at least on the level of informal analogies.

6.2 Obstacles in Realistic Debates

Sub-optimal judges

Some debates might have a canonical idealized way of being judged, which the actual judge deviates from at one or more steps. A fruitful avenue for future research is to investigate the extent to which debate fails gracefully as the judge gets further from this ideal. One specific experiment regarding imperfect judges is to take some standard game (e.g., Go) and measure how much (and in which ways) can we modify the utility function without changing the game’s winner. Another approach would be to consider the idealized way of judging the feature debate, summarized as “set the prior to the true world-distribution, update the prior on each revealed piece of evidence, and, at the end of the debate, calculate the corresponding expectation over answers”. To model a judge who performs the first step imperfectly, we could consider a biased prior (distinct from the true world distribution ) and calculate the utilities using the corresponding biased belief


So far, we have described failures that come from “the judge being imperfect in predictable ways”. However, real-world debates might also give rise to undesirable argumentation strategies inconceivable in the corresponding simplified model. For example, a debater might learn to exploit a bug in the implementation of the debate or, analogously, find a “bug” in the human judge. Worse yet, debaters might attempt to manipulate the judge using bribery or coercion. Note that for such tactics to work, the debaters need not be able to carry out their promises and threats — it is merely required that the judge believes they can.


Without the assumption of zero-sum rewards, the debaters gain incentives to collaborate, possibly at the expense of the accuracy of the debate. Such incentives might arise, for example, if we decided to give both agents a negative reward when the debate gets inconclusive. In such a scenario, the debaters might be willing to collaboratively manipulate the human into thinking the debate was conclusive (when it should not have been), even at the cost of having to toss a coin over who wins the debate afterwards.

Sub-optimal debaters

Once we consider debaters who argue sub-optimally, we should expect to see stronger debaters winning even in situations which give advantage to their weaker opponent. We might also encounter cases where the losing player complicates the debate game on purpose to increase variance in the outcome. Both of these phenomena can be observed between humans in games like Go, and we should expect analogous phenomena in general AI debate. One way of modelling such scenarios is to assume both debaters run the same debating algorithm with different computational budget (e.g., Monte Carlo tree search with a different number of rollouts).

7 Related Work

AI safety via debate

The kind of debate we sought to model was introduced in (Irving, Christiano, and Amodei, 2018), wherein it was proposed as a scheme for safely receiving advice from highly capable AI systems. In the same work, Irving et al. carried out debate experiments on the MNIST dataset and proposed a promising analogy between AI debates and the complexity class PSPACE. We believe this analogy can be made compatible with the framework introduced in our Section 2, and deserves further attention. (Kovařík, 2019)

then demonstrated how to use debate to train an image classifier and described the design elements of debate in more detail. AI debate is closely related to two other proposals: (1) “factored cognition”

(Ought, 2018), in which a human decomposes a difficult task into sub-tasks, each of which they can solve in a manageable time (similarly to how debate eventually zooms in on an easily-verifiable claim), and (2) “iterated distillation and amplification” (Christiano, Shlegeris, and Amodei, 2018), in which a “baseline human judgement” is automated and amplified, similarly to how we hope to automate AI debates.

Previous works on argumentation

Persuasion and argumentation have been extensively studied in areas such as logic, computer science, and law. (Prakken, 2006) is an introduction to the area that also describes a language particularly suitable for our purposes. On the other hand, the extensive literature on argumentation frameworks (Dung, 1995) seems less relevant. The main reasons are (i) its focus on non-monotonic reasoning (where it is possible to retract claims) and (ii) that it assumes the debate language and argument structure as given (whereas we wish to study the connection between arguments and an underlying world model).

Zero-sum games

As noted in the introduction, we can view two-player zero-sum games as a debate that aims to identify the game’s winner (or an optimal strategy). Such games thus serve as a prime example of a problem for which the state of the art approach is (interpretable as) debate (Silver et al., 2017). Admittedly, only a small number of problems are formulated as two-player zero-sum games by default. However, some problems can be reformulated as such games: While it is currently unclear how widely applicable such “problem gamification” is, it has been used for combinatorial problems (Xu and Lieberherr, 2019), theorem proving (Japaridze, 2009), and verification of topological properties (Telgársky, 1987). Together with (Silver et al., 2017), these examples give some evidence that the AI debate might be competitive (with other problem-solving approaches) for a wider range of tasks.

8 Conclusion

We have introduced a general framework for modelling AI debates that aim to amplify the capabilities of their judge and formalized the problem of designing debates that promote accurate answers. We described and investigated “feature debate”, an instance of the general framework where the debaters can only make statements about “elementary features” of the world. In particular, we showed that if the debaters have enough time to make all relevant arguments, feature debates promote truth. We gave examples of two sub-classes of debate: those where the arguments provide statistically independent evidence about the answers and those where the importance of different arguments is bounded (in a known manner). We have shown that feature debates belonging to these sub-classes are approximately truth-promoting long before having had time to converge fully. However, we also identified some feature-debate questions that incentivize undesirable behaviour (such as stalling, confusing the judge, or exploiting the judge’s biases), resulting in debates that are unfair, unstable, and generally insufficiently truth-promoting. Despite its simplicity, feature debate thus allows for modelling phenomena that are highly relevant to issues we expect to encounter in realistic debates. Moreover, its simplicity makes feature debate well-suited for the initial exploration of problems with debate and testing of the corresponding solution proposals. Finally, we outlined multiple ways in which our model could be made more realistic — among these, allowing debaters to make high-level claims seems like an especially promising avenue.


  • Armstrong (2017a) Armstrong, S. 2017a. ”Like this world, but…”. https://www.lesswrong.com/posts/5bd75cc58225bf06703 75454/like-this-world-but. Accessed: 2019-09-25.
  • Armstrong (2017b) Armstrong, S. 2017b. True understanding comes from passing exams. https://www.alignmentforum.org/posts/5bd75cc58225bf 067037534a/true-understanding-comes-from-passing -exams. Accessed: 2019-09-25.
  • Christiano et al. (2017) Christiano, P. F.; Leike, J.; Brown, T.; Martic, M.; Legg, S.; and Amodei, D. 2017.

    Deep reinforcement learning from human preferences.

    In Advances in Neural Information Processing Systems, 4299–4307.
  • Christiano, Shlegeris, and Amodei (2018) Christiano, P.; Shlegeris, B.; and Amodei, D. 2018. Supervising strong learners by amplifying weak experts. arXiv preprint arXiv:1810.08575.
  • Dung (1995) Dung, P. M. 1995.

    On the acceptability of arguments and its fundamental role in nonmonotonic reasoning, logic programming and n-person games.

    Artificial intelligence 77(2):321–357.
  • Irving, Christiano, and Amodei (2018) Irving, G.; Christiano, P.; and Amodei, D. 2018. AI safety via debate. arXiv preprint arXiv:1805.00899.
  • Japaridze (2009) Japaridze, G. 2009. In the beginning was game semantics? In Games: Unifying logic, language, and philosophy. Springer. 249–350.
  • Kovařík (2019) Kovařík, V. 2019. AI safety debate and its applications. https://www.lesswrong.com/posts/5Kv2qNfRyXXihNrx 2/ai-safety-debate-and-its-applications. Accessed: 2019-09-25.
  • Osborne and Rubinstein (1994) Osborne, M. J., and Rubinstein, A. 1994.

    A course in game theory

    MIT press.
  • Ought (2018) Ought. 2018. Factored cognition. https://ought.org/research/factored-cognition. Accessed: 2019-09-25.
  • Prakken (2006) Prakken, H. 2006. Formal systems for persuasion dialogue.

    The knowledge engineering review

  • Shoham and Leyton-Brown (2008) Shoham, Y., and Leyton-Brown, K. 2008. Multiagent systems: Algorithmic, game-theoretic, and logical foundations. Cambridge University Press.
  • Silver et al. (2017) Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.; Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton, A.; et al. 2017. Mastering the game of go without human knowledge. Nature 550(7676):354.
  • Telgársky (1987) Telgársky, R. 1987. Topological games: on the 50th anniversary of the banach-mazur game. The Rocky Mountain Journal of Mathematics 17(2):227–276.
  • Xu and Lieberherr (2019) Xu, R., and Lieberherr, K. 2019.

    Learning self-game-play agents for combinatorial optimization problems.

    In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, 2276–2278. International Foundation for Autonomous Agents and Multiagent Systems.

Appendix A Proofs

We now give the full proofs of the results from the main text.

Proof of Lemma 8.

Fix , , , and , and denote . By definition of , the utility is non-decreasing in when (and in turn, is non-increasing in ). It follows that an optimal argumentation strategy is for player 1 to maximize and for player 2 to minimize it (and vice versa when ). When calculating utilities, we can, therefore, assume that the final belief is either or , depending on whether the player who argues second gave the higher or lower answer.

Suppose the players gave answers , and the one arguing for goes second. In this scenario, denote by the utility this player receives if both argument optimally. By applying the above observation (about optimal argumentation strategies) in all possible relative positions of , , and , we deduce that


Denote by the expected utility of player 1 if answers and are given (by players 1 and 2) and players argument optimally (the expectation is w.r.t. the randomized argumentation order). Using the formula for , we get

It follows that, independently of what strategy the opponent uses, it is always beneficial to give answers from within (and that within this interval, all answers are equally good in expectation).171717Incidentally, a more interesting behavior arises if the order of argumentation is fixed; in such a case, player 2 “flips a coin” between and and player 1 picks any of the strategies with and .

Proof of Proposition 10.

For the worst-case part of the negative result, suppose that is a uniform distribution over and . Define as if and otherwise. In , we then have if and otherwise. In particular, we have , since the minimizing player can always select from . Since , the result follows from Lemma 8.

For the “in expectation” variant of the negative result, suppose that is a uniform distribution over (where each is extended by an infinite sequence of zeros) and is defined as . Recall that the value of will be 1 if the total number of -s with is odd and when the number is even. Until all of the features of any are revealed, there will be a prior probability of an odd number of them having value and prior probability of an even number of them having value . As a result, we have unless . By not revealing any of these features in , either of the players can thus achieve . It follows that , the only optimal strategy is to give the answer , and — since takes on only values and – the expected judge error is .

Our counterexample function was discontinuous. However, the same result could be achieved by using together with the uniform prior over — this yields a function that is continuous over its (discrete) domain. Alternatively, we could use the uniform prior over together with the continuous approximation (for some large ) of . Similarly, we could find a continuous approximation of . This observation concludes the proof of the last part of the proposition. ∎

Proof of Proposition 13.

Let . For , denote by the set of worlds would reveal the same bits as under the argument sequence . To show that the debate is -truth promoting in , it suffices, by Lemma 8, to show that

To get the second inequality, recall that is defined as the expected value of under the judge’s belief after rounds of debate, under the assumption that player 1 attempts to drive the expectation as low as possible and player 2 as high as possible. Suppose that the player who is the first to argue chooses his arguments , , , …in a way that minimizes the diameter of , i.e. by following the sequence . A simple calculation shows that after rounds, this necessarily gets the diameter of to or less (depending on the opponent’s actions ). Since is -Lipschitz, we get that on . Since the optimal strategy of minimizing can perform no worse than this, we get .

We derive the first inequality analogously. To finish the proof, we need to rewrite the inequality in terms of . We have . Since we have for each , we get , which concludes the proof. ∎