Log In Sign Up

Forecasting Argumentation Frameworks

We introduce Forecasting Argumentation Frameworks (FAFs), a novel argumentation-based methodology for forecasting informed by recent judgmental forecasting research. FAFs comprise update frameworks which empower (human or artificial) agents to argue over time about the probability of outcomes, e.g. the winner of a political election or a fluctuation in inflation rates, whilst flagging perceived irrationality in the agents' behaviour with a view to improving their forecasting accuracy. FAFs include five argument types, amounting to standard pro/con arguments, as in bipolar argumentation, as well as novel proposal arguments and increase/decrease amendment arguments. We adapt an existing gradual semantics for bipolar argumentation to determine the aggregated dialectical strength of proposal arguments and define irrational behaviour. We then give a simple aggregation function which produces a final group forecast from rational agents' individual forecasts. We identify and study properties of FAFs and conduct an empirical evaluation which signals FAFs' potential to increase the forecasting accuracy of participants.


page 1

page 2

page 3

page 4


Value Based Argumentation Frameworks

This paper introduces the notion of value-based argumentation frameworks...

Impact of Argument Type and Concerns in Argumentation with a Chatbot

Conversational agents, also known as chatbots, are versatile tools that ...

Technical report of "Empirical Study on Human Evaluation of Complex Argumentation Frameworks"

In abstract argumentation, multiple argumentation semantics have been pr...

A Tutorial for Weighted Bipolar Argumentation with Continuous Dynamical Systems and the Java Library Attractor

Weighted bipolar argumentation frameworks allow modeling decision proble...

The Unfolding Structure of Arguments in Online Debates: The case of a No-Deal Brexit

In the last decade, political debates have progressively shifted to soci...

Preservation of Semantic Properties during the Aggregation of Abstract Argumentation Frameworks

An abstract argumentation framework can be used to model the argumentati...

Learning Gradual Argumentation Frameworks using Genetic Algorithms

Gradual argumentation frameworks represent arguments and their relations...

1 Introduction

Historically, humans have performed inconsistently in judgemental forecasting [Makridakis2010, TetlockExp2017]

, which incorporates subjective opinion and probability estimates to predictions

[Lawrence2006]. Yet, human judgement remains essential in cases where pure statistical methods are not applicable, e.g. where historical data alone is insufficient or for one-off, more ‘unknowable’ events [Petropoulos2016, Arvan2019, deBaets2020]. Judgemental forecasting is widely relied upon for decision-making [Nikolopoulos2021], in myriad fields from epidemiology to national security [Nikolopoulos2015, Litsiou2019]. Effective tools to help humans improve their predictive capabilities thus have enormous potential for impact. Two recent global events – the COVID-19 pandemic and the US withdrawal from Afghanistan – underscore this by highlighting the human and financial cost of predictive deficiency. A multi-purpose system which could improve our ability to predict the incidence and impact of events by as little as 5%, could save millions of lives and be worth trillions of dollars per year [TetlockGard2016].

Research on judgemental forecasting (see [Lawrence2006, Zellner2021] for overviews), including the recent, groundbreaking ‘Superforecasting Experiment’ [TetlockGard2016], is instructive in establishing the desired properties for systems for supporting forecasting. In addition to reaffirming the importance of fine-grained probabilistic reasoning [Mellers2015], this literature points to the benefits of some group techniques versus solo forecasting [Landeta2011, Tetlock2014art], of synthesising qualitative and quantitative information [Lawrence2006], of combating agents’ irrationality [Chang2016] and of high agent engagement with the forecasting challenge, e.g. robust debating [Landeta2011] and frequent prediction updates [Mellers2015].

Meanwhile, computational argumentation (see [AImagazine17, handbook]

for recent overviews) is a field of AI which involves reasoning with uncertainty and resolving conflicting information, e.g. in natural language debates. As such, it is an ideal candidate for aggregating the broad, polymorphous set of information involved in judgemental group forecasting. An extensive and growing literature is based on various argumentation frameworks – rule-based systems for aggregating, representing and evaluating sets of arguments, such as those applied in the contexts of

scheduling [Cyras_19], fact checking [Kotonya_20] or in various instances of explainable AI [Cyras_21]. Subsets of the requirements for forecasting systems are addressed by individual formalisms, e.g. probabilistic argumentation [Dung2010, Thimm2012, Hunter2013, Fazzinga2018] may effectively represent and analyse uncertain arguments about the future. However, we posit that a purpose-built argumentation framework for forecasting is essential to effectively utilise argumentation’s reasoning capabilities in this context.

Figure 1: The step-by-step process of a FAF over its lifetime.

In this paper, we attempt to cross-fertilise these two as of yet unconnected academic areas. We draw from forecasting literature to inform the design of a new computational argumentation approach: Forecasting Argumentation Frameworks (FAFs). FAFs empower (human and artificial) agents to structure debates in real time and to deliver argumentation-based forecasting. They offer an approach in the spirit of deliberative democracy [Bessette1980] to respond to a forecasting problem over time. The steps which underpin FAFs are depicted in Figure 1 (referenced throughout) and can be described in simple terms as follows: a FAF is initialised with a time limit (for the overall forecasting process and for each iteration therein) and a pre-agreed ‘base-rate’ forecast (Stage 1), e.g. based on historical data. Then, the forecast is revised by one or more (non-concurrent) debates, in the form of ‘update frameworks’ (Stage 2), opened and resolved by participating agents (until the specified time limit is reached). Each update framework begins with a proposed revision to the current forecast (Stage 2a), and proceeds with a cycle of argumentation (Stage 2b) about the proposed forecast, voting on said argumentation and forecasting. Forecasts deemed ‘irrational’ with a view to agents’ argumentation and voting are blocked. Finally, the rational forecasts are aggregated and the result replaces the current group forecast (Stage 2c). This process may be repeated over time in an indefinite number of update frameworks (thus continually revising the group forecast) until the (overall) time limit is reached. The composite nature of this process enables the appraisal of new information relevant to the forecasting question as and when it arrives. Rather than confronting an unbounded forecasting question with a diffuse set of possible debates open at once, all agents concentrate their argumentation on a single topic (a proposal) at any given time.

After giving the necessary background on forecasting and argumentation (§2), we formalise our update frameworks for Step 2a3). We then give our notion of rationality (Step 2b), along with our new method for aggregating rational forecasts (Step 2c) from a group of agents (§4) and FAFs overall. We explore the underlying properties of FAFs5), before describing an experiment with a prototype implementing our approach (§6). Finally, we conclude and suggest potentially fruitful avenues for future work (§7).

2 Background

2.1 Forecasting

Studies on the efficacy of judgemental forecasting have shown mixed results [Makridakis2010, TetlockExp2017, Goodwin2019]. Limitations of the judgemental approach are a result of well-documented cognitive biases [Kahneman2012], irrationalities in human probabilistic reasoning which lead to distortion of forecasts. Manifold methodologies have been explored to improve judgemental forecasting accuracy to varying success [Lawrence2006]. These methodologies include, but are not limited to, prediction intervals [Lawrence1989], decomposition [MacGregorDonaldG1994Jdwd], structured analogies [Green2007, Nikolopoulos2015] and unaided judgement [Litsiou2019]. Various group forecasting techniques have also been explored [Linstone1975, Delbecq1986, Landeta2011], although the risks of groupthink [McNees1987] and the importance of maintaining the independence of each group member’s individual forecast are well established [Armstrong2001]. Recent advances in the field have been led by Tetlock and Mellers’ superforecasting experiment [TetlockGard2016], which leveraged geopolitical forecasting tournaments and a base of 5000 volunteer forecasters to identify individuals with consistently exceptional accuracy (top 2%). The experiment’s findings underline the effectiveness of group forecasting orientated around debating [Tetlock2014art], and demonstrate a specific cognitive-intellectual approach conducive for forecasting [Mellers20151, Mellers2015], but stop short of suggesting a concrete universal methodology for higher accuracy. Instead, Tetlock draws on his own work and previous literature to crystallise a broad set of methodological principles by which superforecasters abide [TetlockGard2016, pg.144]:

  • Pragmatic: not wedded to any idea or agenda;

  • Analytical: capable of stepping back from the tip-of-your-nose perspective and considering other views;

  • Dragonfly-eyed: value diverse views and synthesise them into their own;

  • Probabilistic: judge using many grades of maybe;

  • Thoughtful updaters: when facts change, they change their minds;

  • Good intuitive psychologists: aware of the value of checking thinking for cognitive and emotional biases.

Subsequent research after the superforecasting experiment has included exploring further optimal forecasting tournament preparation [penn_global_2021, Katsagounos2021] and extending Tetlock and Mellers’ approach to answer broader, more time-distant questions [georgetown]. It should be noted that there have been no recent advances on computational toolkits for the field similar to that proposed in this paper.

2.2 Computational Argumentation

We posit that existing argumentation formalisms are not well suited for the aforementioned future-based arguments, which are necessarily semantically and structurally different from arguments about present or past concerns. Specifically, forecasting arguments are inherently probabilistic and must deal with the passage of time and its implications for the outcomes at hand. Further, several other important characteristics can be drawn from the forecasting literature which render current argumentation formalisms unsuitable, e.g. the paramountcy of dealing with bias (in data and cognitive), forming granular conclusions, fostering group debate and the co-occurrence of qualitative and quantitative arguing.

Nonetheless, several of these characteristics have been previously explored in argumentation and our formalisation draws from several existing approaches. First and foremost, it draws in spirit from abstract argumentation frameworks (AAFs) [Dung1995], in that the arguments’ inner contents are ignored and the focus is on the relationships between arguments. However, we consider arguments of different types and an additional relation of support (pro), rather than attack (con) alone as in [Dung1995]. Past work has also introduced probabilistic constraints into argumentation frameworks. Probabilistic AAFs (prAAFs) propose two divergent ways for modelling uncertainty in abstract argumentation using probabilities - the constellation approach [Dung2010, Li2012] and the epistemic approach [Hunter2013, Hunter2014, Hunter2020]. These formalisations use probability as a means to assess uncertainty over the validity of arguments (epistemic) or graph topology (constellation), but do not enable reasoning with or about probability, which is fundamental in forecasting. In exploring temporality, [Cobo2010] augment AAFs by providing each argument with a limited lifetime. Temporal constraints have been extended in [Cobo2012] and [Baron2014]. Elsewhere, [Rago2017] have used argumentation to model irrationality or bias in agents. Finally, a wide range of gradual evaluation methods have gone beyond traditional qualitative semantics by measuring arguments’ acceptability on a scale (normally [0,1]) [Leite2011, Evripidou2012, Amgoud2017, Amgoud2018, Amgoud2016]. Many of these approaches have been unified as Quantitative Bipolar Argumentation Frameworks (QBAFs) in [Baroni2018].

Amongst existing approaches, of special relevance in this paper are Quantitative Argumentation Debate (QuAD) frameworks [Baroni2015], i.e. 5-tuples ⟨, , , , ⟩ where is a finite set of answer arguments (to implicit issues); is a finite set of con arguments; is a finite set of pro arguments; , and are pairwise disjoint; is an acyclic binary relation; : is a total function: is the base score of . Here, attackers and supporters of arguments are determined by the pro and con arguments they are in relation with. Formally, for any , the set of con (pro) arguments of is (, resp.). Arguments in QuAD frameworks are scored by the Discontinuity-Free QuAD (DF-QuAD) algorithm [Rago2016], using the argument’s intrinsic base score and the strengths of its pro/con arguments. Given that DF-QuAD is used to define our method (see Def. 4), for completeness we define it formally here. DF-QuAD’s strength aggregation function is defined as , where for : if , ; if , ; if , ; if , ; with the base function defined, for , as: . After separate aggregation of the argument’s pro/con descendants, the combination function combines and with the argument’s base score (): and , resp. The inputs for the combination function are provided by the score function, , which gives the argument’s strength, as follows: for any : where if is an arbitrary permutation of the () con arguments in , (similarly for pro arguments). Note that the DF-QuAD notion of can be applied to any argumentation framework where arguments are equipped with base scores and pro/con arguments. We will do so later, for our novel formalism.

3 Update Frameworks

We begin by defining the individual components of our frameworks, starting with the fundamental notion of a forecast. This is a probability estimate for the positive outcome of a given (binary) question.

Definition 1.

A forecast is the probability for a given forecasting question .

Example 1.

Consider the forecasting question : ‘Will the Tokyo 2020 Summer Olympics be cancelled/postponed to another year?’. Here, the outcome amounts to the Olympics being cancelled/postponed (and to it taking place in 2020 as planned). Then, a forecast may be   which amounts to a 15% probability of the Olympics being cancelled/postponed. Note that may have been introduced as part of an update framework (herein described), or as an initial base rate at the outset of a FAF (Stage 1 in Figure 1).

In the remainder, we will often drop , implicitly assuming it is given, and use to stand for .

In order to empower agents to reason about probabilities and thus support forecasting, we need, in addition to pro/con arguments as in QuAD frameworks, two new argument types:

  • proposal arguments, each about some forecast (and its underlying forecasting question); each proposal argument has a forecast and, optionally, some supporting evidence; and

  • amendment arguments, which suggest a modification to some forecast’s probability by increasing or decreasing it, and are accordingly separated into disjoint classes of increase and decrease (amendment) arguments.111Note that we decline to include a third type of amendment argument for arguing that is just right. This choice rests on the assumption that additional information always necessitates a change to , however granular that change may be. This does not restrict individual agents arguing about from casting as their own final forecast. However, rather than cohering their argumentation around , which we hypothesise would lead to high risk of groupthink [McNees1987], agents are compelled to consider the impact of their amendment arguments on this more granular level.

Note that amendment arguments are introduced specifically for arguing about a proposal argument, given that traditional QuAD pro/con arguments are of limited use when the goal is to judge the acceptability of a probability, and that in forecasting agents must not only decide if they agree/disagree but also how they agree/disagree (i.e. whether they believe the forecast is too low or too high considering, if available, the evidence). Amendment arguments, with their increase and decrease classes, provide for this.

Example 2.

A proposal argument in the Tokyo Olympics setting may comprise forecast: There is a 75% chance that the Olympics will be cancelled/postponed to another year’. It may also include evidence: ‘A new poll today shows that 80% of the Japanese public want the Olympics to be cancelled. The Japanese government is likely to buckle under this pressure.’ This argument may aim to prompt updating the earlier forecast in Example 1. A decrease amendment argument may be : ‘The International Olympic Committee and the Japanese government will ignore the views of the Japanese public’. An increase amendment argument may be : ‘Japan’s increasingly popular opposition parties will leverage this to make an even stronger case for cancellation’.

Intuitively, a proposal argument is the focal point of the argumentation. It typically suggests a new forecast to replace prior forecasts, argued on the basis of some new evidence (as in the earlier example). We will see that proposal arguments remain immutable through each debate (update framework), which takes place via amendment arguments and standard pro/con arguments. Note that, wrt QuAD argument types, proposal arguments replace issues and amendment arguments replace answers, in that the former are driving the debates and the latter are the options up for debate. Note also that amendment arguments merely state a direction wrt and do not contain any more information, such as how much to alter by. We will see that alteration can be determined by scoring amendment arguments.

Proposal and amendment arguments, alongside pro/con arguments, form part of our novel update frameworks (Stage 2 of Figure 1), defined as follows:

Definition 2.

An update framework is a nonad ⟨⟩ such that:

is a single proposal argument with forecast and, optionally, evidence for this forecast;

is a finite set of amendment arguments composed of subsets of increase arguments and of decrease arguments;

is a finite set of con arguments;

is a finite set of pro arguments;

the sets , , , and are pairwise disjoint;

is a directed acyclic binary relation between amendment arguments and the proposal argument (we may refer to this relation informally as ‘probabilistic’);

( ) ( ) is a directed acyclic, binary relation from pro/con arguments to amendment/pro/con arguments (we may refer to this relation informally as ‘argumentative’);

is a finite set of agents );

: ( ) [0, 1] is a total function such that is the vote of agent on argument ; with an abuse of notation, we let : ( ) represent the votes of a single agent , e.g. ;

is such that , where , is the forecast of agent .

Note that pro (con) arguments can be seen as supporting (attacking, resp.) other arguments via , as in the case of conventional QuAD frameworks [Baroni2015].

Example 3.

A possible update framework in our running setting may include as in Example 2 as well as (see Table 1) , , , , , and , . Figure 2 gives a graphical representation of these arguments and relations. Assuming , may be such that , , and so on.

‘A new poll today shows that 80% of the Japanese public want the Olympics to be cancelled owing to COVID-19, and the Japanese government is likely to buckle under this pressure (. Thus, there is a 75% chance that the Olympics will be cancelled/postponed to another year’ ().
‘The International Olympic Committee and the Japanese government will ignore the views of the Japanese public’.
‘This poll comes from an unreliable source.’
‘Japan’s increasingly popular opposition parties will leverage this to make an even stronger case for cancellation.’
‘The IOC is bluffing - people are dying, Japan is experiencing a strike. They will not go ahead with the games if there is a risk of mass death.’
‘The Japanese government may renege on its commitment to the IOC, and use legislative or immigration levers to block the event.’
‘Japan’s government has sustained a high-approval rating in the last year and is strong enough to ward off opposition attacks.’
‘This pollster has a track record of failure on Japanese domestic issues.’
‘Rising anti-government sentiment on Japanese Twitter indicates that voters may be receptive to such arguments.’
Table 1: Arguments in the update framework in Example 3.
Figure 2: A graphical representation of arguments and relations in the update framework from Example 3. Nodes represent proposal (), increase (), decrease (), pro () and con () arguments, while dashed/solid edges indicate, resp., the / relations.

Several considerations about update frameworks are in order. Firstly, they represent ‘stratified’ debates: graphically, they can be represented as trees with the proposal argument as root, amendment arguments as children of the root, and pro/con arguments forming the lower layers, as shown in Figure 2. This tree structure serves to focus argumentation towards the proposal (i.e. the probability and, if available, evidence) it puts forward. Second, we have chosen to impose a ‘structure’ on proposal arguments, whereby their forecast is distinct from their (optional) evidence. Here the forecast has special primacy over the evidence, because forecasts are the vital reference point and the drivers of debates in FAFs. They are, accordingly, both mandatory and required to ‘stand out’ to participating agents. In the spirit of abstract argumentation [Dung1995], we nonetheless treat all arguments, including proposal arguments, as ‘abstract’, and focus on relations between them rather between their components. In practice, therefore, amendment arguments may relate to a proposal argument’s forecast but also, if present, to its evidence. We opt for this abstract view on the assumption that the flexibility of this approach better suits judgmental forecasting, which has a diversity of use cases (e.g. including politics, economics and sport) where different argumentative approaches may be deployed (i.e. quantitative, qualitative, directly attacking amendment nodes or raising alternative POVs) and wherein forecasters may lack even a basic knowledge of argumentation. We leave the study of structured variants of our framework (e.g. see overview in [structArg]) to future work: these may consider finer-grained representations of all arguments in terms of different components, and finer-grained notions of relations between components, rather than full arguments. Third, in update frameworks, voting is restricted to pro/con arguments. Preventing agents from voting directly on amendment arguments mitigates against the risk of arbitrary judgements: agents cannot make off-the-cuff estimations but can only express their beliefs via (pro/con) argumentation, thus ensuring a more rigorous process of appraisal for the proposal and amendment arguments. Note that rather than facilitating voting on arguments using a two-valued perspective (i.e. positive/negative) or a three-valued perspective (i.e. positive/negative/neutral), allows agents to cast more granular judgements of (pro/con) argument acceptability, the need for which has been highlighted in the literature [Mellers2015]. Finally, although we envisage that arguments of all types are put forward by agents during debates, we do not capture this mapping in update frameworks. Thus, we do not capture who put forward which arguments, but instead only use votes to encode and understand agents’ views. This enables more nuanced reasoning and full engagement on the part of agents with alternative viewpoints (i.e. an agent may freely argue both for and against a point before taking an explicit view with their voting). Such conditions are essential in a healthy forecasting debate [Landeta2011, Mellers2015].

In the remainder of this paper, with an abuse of notation, we often use to denote, specifically, the probability advocated in (e.g. 0.75 in Example 2).

4 Aggregating Rational Forecasts

In this section we formally introduce (in §4.1) our notion of rationality and discuss how it may be used to identify, and subsequently ‘block’, undesirable behaviour in forecasters. We then define (in §4.2) a method for calculating a revised forecast (Stage 2c of Figure 1), which aggregates the views of all agents in the update framework, whilst optimising their overall forecasting accuracy.

4.1 Rationality

Characterising an agent’s view as irrational offers opportunities to refine the accuracy of their forecast (and thus the overall aggregated group forecast). Our definition of rationality is inspired by, but goes beyond, that of QuAD-V [Rago2017], which was introduced for the e-polling context. Whilst update frameworks eventually produce a single aggregated forecast on the basis of group deliberation, each agent is first evaluated for their rationality on an individual basis. Thus, as in QuAD-V, in order to define rationality for individual agents, we first reduce frameworks to delegate frameworks for each agent, which are the restriction of update frameworks to a single agent.

Definition 3.

A delegate framework for an agent is ⟩.

Note that all arguments in an update framework are included in each agent’s delegate framework, but only the agent’s votes and forecast are carried over.

Recognising the irrationality of an agent requires comparing the agent’s forecast against (an aggregation of) their opinions on the amendment arguments and, by extension, on the proposal argument. To this end, we evaluate the different parts of the update framework as follows. We use the DF-QuAD algorithm [Rago2016] to score each amendment argument for the agent, in the context of the pro/con arguments ‘linked’ to the amendment argument, using , in the context of the agent’s delegate framework. We refer to the DF-QuAD score function as . This requires a choice of base scores for amendment arguments as well as pro/con arguments. We assume the same base score for all ; in contrast, the base score of pro/con arguments is a result of the votes they received from the agent, in the spirit of QuAD-V [Rago2017]. The intuition behind assigning a neutral (0.5) base score for amendment arguments is that an agent’s estimation of their strength from the outset would be susceptible to bias and inaccuracy. Once each amendment argument has been scored (using ) for the agent, we aggregate the scores of all amendment arguments (for the same agent) to to calculate the agent’s confidence score in the proposal argument (which underpins our rationality constraints), by weighting the mean average strength of this argument’s increase amendment relations against that of the set of decrease amendment relations:

Definition 4.

Given a delegate framework = ⟨, , , , , , , , ⟩ , let and . Then, ’s confidence score is as follows:

Note that , which denotes the overall views of the agent on the forecast (i.e. as to whether it should be increased or decreased, and how far). A negative (positive) indicates that an agent believes that should be amended down (up, resp.). The size of reflects the degree of the agent’s certainty in either direction. In turn, we can constrain an agent’s forecast if it contradicts this belief as follows.

Definition 5.

Given a delegate framework = ⟨, , , , , , , , ⟩, ’s forecast is strictly rational (wrt ) iff:

Hereafter, we refer to forecasts which violate the first two constraints as, resp., irrational increase and irrational decrease forecasts, and to forecasts which violate the final constraint as irrational scale forecasts.

This definition of rationality preserves the integrity of group forecast in two ways. First, it prevents agents from forecasting against their beliefs: an agent cannot increase if and an agent cannot decrease if ; further, it ensures that agents cannot make forecasts disproportionate to their confidence score – how far an agent deviates from the proposed change is restricted by ; finally, an agent must have greater than or equal to the relative change to denoted in their forecast . Note that the irrational scale constraint deals with just one direction of proportionality (i.e. providing only a maximum threshold for ’s deviation from , but no minimum threshold). Here, we avoid bidirectional proportionality on the grounds that such a constraint would impose an arbitrary notion of arguments’ ‘impact’ on agents. An agent may have a very high , indicating their belief that is too low, but may, we suggest, rationally choose to increase by only a small amount (e.g. if, despite their general agreement with the arguments, they believe the overall issue at stake in to be minor or low impact to the overall forecasting question). Our definition of rationality, which relies on notions of argument strength derived from DF-QuAD, thus informs but does not wholly dictate agents’ forecasting, affording them considerable freedom. We leave alternative, stricter definitions of rationality, which may derive from probabilistic conceptions of argument strength, to future work.

Example 4.

Consider our running Tokyo Olympics example, with the same arguments and relations from Example 3 and an agent with a confidence score . From this we know that believes that the suggested in the proposal argument should be decreased. Then, under our definition of rationality, ’s forecast is ‘rational’ if it decreases by up to 50%.

If an agent’s forecast violates these rationality constraints then it is ‘blocked’ and the agent is prompted to return to the argumentation graph. From here, they may carry out one or more of the following actions to render their forecast rational:

a. Revise their forecast;

b. Revise their votes on arguments;

c. Add new arguments (and vote on them).

Whilst a) and b) occur on an agent-by-agent basis, confined to each delegate framework, c) affects the shared update framework and requires special consideration. Each time new arguments are added to the shared graph, every agent must vote on them, even if they have already made a rational forecast. In certain cases, after an agent has voted on a new argument, it is possible that their rational forecast is made irrational. In this instance, the agent must resolve their irrationality via the steps above. In this way, the update framework can be refined on an iterative basis until the graph is no longer being modified and all agents’ forecasts are rational. At this stage, the update framework has reached a stable state and the agents are collectively rational:

Definition 6.

Given an update framework = ⟨, , , , , , , , ⟩, is collectively rational (wrt u) iff , is individually rational (wrt the delegate framework = ⟨, , , , , , , , ⟩).

When is collectively rational, the update framework becomes immutable and the aggregation (defined next) produces a group forecast , which becomes the new .

4.2 Aggregating Forecasts

After all the agents have made a rational forecast, an aggregation function is applied to produce one collective forecast. One advantage of forecasting debates vis-a-vis the many other forms of debate, is that a ground truth always exists – an event either happens or does not. This means that, over time and after enough FAF instantiations, data on the forecasting success of different agents can be amassed. In turn, the relative historical performance of forecasting agents can inform the aggregation of group forecasts. In update frameworks, a weighted aggregation function based on Brier Scoring [Brier1950] is used, such that more accurate forecasting agents have greater influence over the final forecast. Brier Scores are a widely used criterion to measure the accuracy of probabilistic predictions, effectively gauging the distance between a forecaster’s predictions and an outcome after it has(n’t) happened, as follows.

Definition 7.

Given an agent , a non-empty series of forecasts with corresponding actual outcomes (where is the number of forecasts has made in a non-empty sequence of as many update frameworks), ’s Brier Score is as follows:

where if , and 0 otherwise.

A Brier Score is effectively the mean squared error used to gauge forecasting accuracy, where a low indicates high accuracy and high indicates low accuracy. This can be used in the update framework’s aggregation function via the weighted arithmetic mean as follows. Each Brier Score is inverted to ensure that more (less, resp.) accurate forecasters have higher (lower, resp.) weighted influences on :

Definition 8.

Given a set of agents , their corresponding set of Brier Scores and their forecasts , and letting, for , , the group forecast is as follows:

This group forecast could be ‘activated’ after a fixed number of debates (with the mean average used prior), when sufficient data has been collected on the accuracy of participating agents, or after a single debate, in the context of our general Forecasting Argumentation Frameworks:

Definition 9.

A Forecasting Argumentation Framework (FAF) is a triple ⟨⟩ such that:

is a forecast;

is a finite, non-empty sequence of update frameworks with  the forecast of the proposal argument in the first update framework in the sequence; the forecast of each subsequent update framework is the group forecast of the previous update framework’s agents’ forecasts;

is a preset time limit representing the lifetime of the FAF;

each agent’s forecast wrt the agent’s delegate framework drawn from each update framework is strictly rational.

Example 5.

Consider our running Tokyo Olympics example: the overall FAF may be composed of , update frameworks and time limit , where is the latest (and therefore the only open) update framework after, for example, four days.

The superforecasting literature explores a range of forecast aggregation algorithms: extremizing algorithms [Baron2014], variations on logistic and Fourier regression [Cross2018], with considerable success. These approaches aim at ensuring that less certain or less accurate forecasts have a lesser influence over the final aggregated forecast. We believe that FAFs apply a more intuitive algorithm since much of the ‘work’ needed to bypass inaccurate and erroneous forecasting is expedited via argumentation.

5 Properties

We now undertake a theoretical analysis of FAFs by considering mathematical properties they satisfy. Note that the properties of the DF-QuAD algorithm (see [Rago2016]) hold (for the amendment and pro/con arguments) here. For brevity, we focus on novel properties unique to FAFs which relate to our new argument types. These properties focus on aggregated group forecasts wrt a debate (update framework). They imply the two broad, and we posit, desirable, principles of balance and unequal representation. We assume for this section a generic update framework , , , , , , , , ⟩ with group forecast .


The intuition for these properties is that differences between and correspond to imbalances between the increase and decrease amendment arguments.

The first result states that only differs from if is the dialectical target of amendment arguments.

Proposition 1.

If ( and ), then .


If and then , by Def. 4 and by Def. 5. Then, by Def. 8.

This simple proposition conveys an important property for forecasting:for an agent to put forward a different forecast, amendment arguments must have been introduced.

Example 6.

In the Olympics setting, the group of agents could only forecast higher or lower than the proposed forecast after the addition of at least one of the amendment arguments , or .

In the absence of increase (decrease) amendment arguments, if there are decrease (increase, resp.) amendment arguments, then is not higher (lower, resp.) than .

Proposition 2.

If and , then . If and , then .


If and then , by Def. 4 and then by Def. 5. Then, by Def. 8, . If and then , by Def. 4 and then by Def. 5. Then, by Def. 8, .

This proposition demonstrates that, if a decrease (increase) amendment argument has an effect on proposal arguments, it can only be as its name implies.

Example 7.

In the Olympics setting, the agents could not forecast higher than the proposed forecast if either of the decrease amendment arguments or had been added, but the increase argument had not. Likewise, the agents could not forecast lower than if had been added, but neither of or had.

If is lower (higher) than , there is at least one decrease (increase, resp.) argument.

Proposition 3.

If , then . If , then .


If then, by Def. 8, where , for which it holds from Def. 5 that . Then, irrespective of , . If then, by Def. 8, where , for which it holds from Def. 5 that . Then, irrespective of , .

We can see here that the only way an agent can decrease (increase) the forecast is by adding decrease (increase, resp.) arguments, ensuring the debate is structured as intended.

Example 8.

In the Olympics setting, the group of agents could only produce a group forecast lower than due to the presence of decrease amendment arguments or . Likewise, the group of agents could only produce a higher than due to the presence of .

Unequal representation.

AFs exhibit instances of unequal representation in the final voting process. In formulating the following properties, we distinguish between two forms of unequal representation. First, dictatorship, where a single agent dictates with no input from other agents. Second, pure oligarchy, where a group of agents dictates with no input from other agents outside the group. In the forecasting setting, these properties are desirable as they guarantee higher accuracy from the group forecast .

An agent with a forecasting record of some accuracy exercises dictatorship over the group forecast , if the rest of the participating agents have a record of total inaccuracy.

Proposition 4.

If has a Brier score and }, , then .


By Def. 8: if , then ; and if , then . Then, again by Def. 8, is weighted at 100% and is weighted at 0% so .

This proposition demonstrates how we will disregard agents with total inaccuracy, even in the extreme case where we allow one (more accurate) agent to dictate the forecast.

Example 9.

In the running example, if alice, bob and charlie have Brier scores of 0.5, 1 and 1, resp., bob’s and charlie’s forecasts have no impact on , whilst alice’s forecast becomes the group forecast .

A group of agents with a forecasting record of accuracy exercises pure oligarchy over if the rest of the agents all have a record of total inaccuracy.

Proposition 5.

Let where , and . Then, is weighted at and is weighted at 0%.


By Def. 8: if , then ; and if , then . Then, again by Def. 8, is weighted at and is weighted at .

This proposition extends the behaviour from Proposition 4 to the (more desirable) case where fewer agents have a record of total inaccuracy.

Example 10.

In the running example, if alice, bob and charlie have Brier scores of 1, 0.2 and 0.6, resp., alice’s forecast has no impact on , whilst bob and charlie’s aggregated forecast becomes the group forecast .

6 Evaluation

We conducted an experiment using a dataset obtained from the ‘Superforecasting’ project, Good Judgment Inc [GJInc], to simulate four past forecasting debates in FAFs. This dataset contained 1770 datapoints (698 ‘forecasts’ and 1072 ‘comments’) posted by 242 anonymised users with a range of expertise. The original debates had occurred on the publicly available group forecasting platform, the Good Judgment Open (GJO)222, providing a suitable baseline against which to compare FAFs’ accuracy.

For the experiment, we used a prototype implementation of FAFs in the form of the publicly available web platform called Arg&Forecast (see [Irwin2022] for an introduction to the platform and an additional human experiment with FAFs). Python’s Gensim topic modelling library [rehurek2011gensim] was used to separate the datapoints for each debate into contextual-temporal groups that could form update frameworks. In each update framework the proposal forecast was set to the mean average of forecasts made in the update framework window and each argument appeared only once. Gensim was further used to simulate voting, matching users to specific arguments they (dis)approved of. Some 4,700 votes were then generated with a three-valued system (where votes were taken from {0,0.5,1}) to ensure consistency: if a user voiced approval for an argument in the debate time window, their vote for the corresponding argument(s) was set to 1; disapproval for an argument led to a vote of 0, and (in the most common case) if a user did not mention an argument at all, their vote for the corresponding argument(s) defaulted to 0.5.

With the views of all participating users wrt the proposal argument encoded in each update framework’s votes, forecasts could then be simulated. If a forecast was irrational, violating any of the three constraints in Def. 5 (referred to in the following as increase, decrease and scale, resp.), it was blocked and, to mimic real life use, an automatic ‘follow up’ forecast was made. The ‘follow up’ forecast would be the closest possible prediction (to their original choice) a user could make whilst remaining ‘rational’.

Note that evaluation of the aggregation function described in §4.2 was outside this experiment, since the past forecasting accuracy of the dataset’s 242 anonymised users was unavailable. Instead, we used the mean average whilst adopting the GJO’s method for scoring the accuracy of a user and/or group over the lifetime of the question [roesch_2015]. This meant calculating a daily forecast and daily Brier score for each user, for every day of the question. After users made their first rational forecast, that forecast became their ‘daily forecast’ until it was updated with a new forecast. Average and range of daily Brier scores allowed reliable comparison between (individual and aggregated) performance of the GJO versus the FAF implementation.

Q Group
Q1 0.1013 (0.1187) 0.0214 (0) 0.4054 (1)
Q2 0.216 (0.1741) 0 (0) 0.3853 (1)
Q3 0.01206 (0.0227) 0.0003 (0) 0.0942 (0.8281)
Q4 0.5263 (0.5518) 0 (0) 0.71 (1)
All 0.2039 (0.217) 0 (0) 1 (1)
Table 2: The accuracy of the platform group versus control, where Group is the aggregated (mean) Brier score, ‘’ is the lowest individual Brier score and ‘’ is the highest individual Brier score. Q1-Q4 indicate the four simulated debates.
Q Forecasts Irrational Forecasts
Increase Decrease Scale
Q1 -0.0418 366 63 101 170
Q2 0.1827 84 11 15 34
Q3 -0.4393 164 53 0 86
Q4 0.3664 84 4 19 15
All -0.0891 698 131 135 305
Table 3: Auxiliary results from the experiment, where is the average confidence score, ‘Forecasts’ is number of forecasts made in each question and ‘Irrational Forecasts’ the number in each question which violated each constraint in §4.1.


As Table 2 demonstrates, simulating forecasting debates from GJO in Arg&Forecast led to predictive accuracy improvements in three of the four debates. This is reflected in these debates by a substantial reduction in Brier scores versus control. The greatest accuracy improvement in absolute terms was in Q4, which saw a Brier score decrease of 0.0255. In relative terms, Brier score decreases ranged from 5% (Q4) to 47% (Q3). The average Brier score decrease was 33%, representing a significant improvement in forecasting accuracy across the board. Table 3 demonstrates how our rationality constraints drove forward this improvement

. 82% of forecasts made across the four debates were classified as irrational

and subsequently moderated with a rational ‘follow up’ forecast. Notably, there were more irrational scale forecasts than irrational increase and irrational decrease forecasts combined. These results demonstrate how argumentation-based rationality constraints can play an active role in facilitating higher forecasting accuracy, signalling the early promise of FAFs.

7 Conclusions

We have introduced the Forecasting Argumentation Framework (FAF), a multi-agent argumentation framework which supports forecasting debates and probability estimates. FAFs are composite argumentation frameworks, comprised of multiple non-concurrent update frameworks which themselves depend on three new argument types and a novel definition of rationality for the forecasting context. Our theoretical and empirical evaluation demonstrates the potential of FAFs, namely in increasing forecasting accuracy, holding intuitive properties, identifying irrational behaviour and driving higher engagement with the forecasting question (more arguments and responses, and more forecasts in the user study). These strengths align with requirements set out by previous research in the field of judgmental forecasting.

There is a multitude of possible directions for future work. First, FAFs are equipped to deal only with two-valued outcomes but, given the prevalence of forecasting issues with multi-valued outcomes (e.g. ‘Who will win the next UK election?’), expanding their capability would add value. Second, further work may focus on the rationality constraints, e.g. by introducing additional parameters to adjust their strictness, or by implementing alternative interpretations of rationality. Third, future work could explore constraining agents’ argumentation. This could involve using past Brier scores to limit the quantity or strength of agents’ arguments and also to give them greater leeway wrt the rationality constraints. Fourth, our method relies upon acyclic graphs: we believe that they are intuitive for users and note that all Good Judgment Open debates were acyclic; nonetheless, the inclusion of cyclic relations (e.g. to allow con arguments that attack each other) could expand the scope of the argumentative reasoning in in FAFs. Finally, there is an immediate need for larger scale human experiments.


The authors would like to thank Prof. Anthony Hunter for his helpful contributions to discussions in the build up to this work. Special thanks, in addition, go to Prof. Philip E. Tetlock and the Good Judgment Project team for their warm cooperation and for providing datasets for the experiments. Finally, the authors would like to thank the anonymous reviewers and meta-reviewer for their suggestions, which led to a significantly improved paper.