Multi-Issue Bargaining With Deep Reinforcement Learning

02/18/2020 ∙ by Ho-Chun Herbert Chang, et al. ∙ 0

Negotiation is a process where agents aim to work through disputes and maximize their surplus. As the use of deep reinforcement learning in bargaining games is unexplored, this paper evaluates its ability to exploit, adapt, and cooperate to produce fair outcomes. Two actor-critic networks were trained for the bidding and acceptance strategy, against time-based agents, behavior-based agents, and through self-play. Gameplay against these agents reveals three key findings. 1) Neural agents learn to exploit time-based agents, achieving clear transitions in decision preference values. The Cauchy distribution emerges as suitable for sampling offers, due to its peaky center and heavy tails. The kurtosis and variance sensitivity of the probability distributions used for continuous control produce trade-offs in exploration and exploitation. 2) Neural agents demonstrate adaptive behavior against different combinations of concession, discount factors, and behavior-based strategies. 3) Most importantly, neural agents learn to cooperate with other behavior-based agents, in certain cases utilizing non-credible threats to force fairer results. This bears similarities with reputation-based strategies in the evolutionary dynamics, and departs from equilibria in classical game theory.



There are no comments yet.


page 23

page 34

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Negotiation is a process where agents aim to work through disputes and maximize their surplus. As the use of deep reinforcement learning in bargaining games is unexplored, this paper evaluates its ability to exploit, adapt, and cooperate to produce fair outcomes, in comparison to classical game theoretic results.

Two actor-critic networks were trained for the bidding and acceptance strategy, against time-based agents, behavior-based agents, and through self-play. Gameplay against these agents reveals three key findings. 1) Neural agents learn to exploit time-based agents, achieving clear transitions in decision preference values. The Cauchy distribution emerges as suitable for sampling offers, due to its peaky center and heavy tails. The kurtosis and variance sensitivity of the probability distributions used for continuous control produce trade-offs in exploration and exploitation. 2) Neural agents demonstrate adaptive behavior against different combinations of concession, discount factors, and behavior-based strategies. 3) Most importantly, neural agents learn to cooperate with other behavior-based agents, in certain cases utilizing non-credible threats to force fairer results. This bears similarities with reputation-based strategies in the evolutionary dynamics, and departs from equilibria in classical game theory.

2.1 Overview of Negotiation

A negotiation setting contains a protocol, agents, and scenario. The protocol determines the rules of how agents interact with each other. The scenario takes place in a negotiation domain which determines an outcome space, denoted as . A negotiation domain can have a single or multiple issues. Issues refer to the resources under contention, such as the price of an object or level of service. Thus, an outcome can be described as a specific division of the issues. Agents have preference profiles, which determines specific outcomes they prefer.

2.1.1 Protocols

We use single-issue bargaining as a preliminary illustration. Given a unit pie, two players and are asked to split it amongst themselves [fatima2013negotiation]. Suppose Agents and negotiate rounds to divide a unit pie, by alternately proposing outcomes called bids or offers, until a player accepts. We denote an offer , such that .

This process of alternating offers is known as the Rubenstein’s Bargaining Protocol. Games with one round, are known as an ultimatum game [rubinstein1982perfect]. In ultimatum games, makes the first and only proposal. can only accept or reject it, which means has all the power. Similarly, if there are two rounds, then Player has the advantage. In a game of repeated offers, it is necessary to introduce some form of discounting factor— otherwise, players would negotiate forever. The discount factor makes a portion of the pie go bad at every round. Thus, it is in the best interest for players to finish the game as soon as possible.

The Rubenstein Bargaining Protocol is widely used because it accurately simulates many real-world scenarios [rubinstein1982perfect]. Multi-issue bargaining is more complex, as multiple issues are under contention and requires further protocol restrictions describe how each issue is resolved. Common ones are [kraus1997negotiation]:

  1. Package-deal Procedure: All issues addressed at once.

  2. Simultaneous Procedure: All issues are solved independently. It is equivalent to single-issue problems.

  3. Sequential Procedure: Negotiates one issue at a time, with a predetermined sequence. Cannot negotiate prior or future issues.

An alternative protocol is the monotonic concession protocol [rosenschein1994rules], where agents disclose information about how they value each issue, and their subsequent offers must have less utility than their prior ones. Other protocol considerations include [fatima2014principles]:

  1. Time Constraints: Beyond the discount factor , there is often a deadline . If negotiation does not end by , players earn utility (known as the conflict deal).

  2. Divisibility: Issues may be atomic and discrete, or divisible and continuous.

  3. Lateral-ness: Whether negotiation is between two parties (bilateral) or with multiple parties (multilateral).

  4. Reserve Price : The minimum an agent is willing to accept.

2.1.2 The Scenario

The utility is defined as the cumulative utility,: a combination of sub-utility functions. Most commonly used is the linear additivity. With the division for Player A (PA) and for Player B (PB), the aggregate utility is of PA is:


where is the value (weight) PA ascribes to issue , the discount rate, and the division for issue . This can be viewed as the discounted dot product of weights and issue division . In many cases, however, utilities are not linear in combination— for instance, in the auctions of multiple items, combinations of items yield greater rewards, to the effect of the sum being greater than the parts, due to synergistic effects. These are modeled with non-linear utility functions [ito2008multi].

The action space is defined by three possible actions: . Offers are made after rejections, and should an agent choose to accept an offer the negotiation ends. Each issue is often normalized such that

. For games with only one issue, the offer consists of the division of one pie. For multiple issues, offers are represented as vectors, subject to

. For this dissertation, the outcome space is assumed to continuous, linear, and normalized.

2.1.3 Outcome Spaces

Each player has a preference ordering, called the preference profiles, on all possible outcomes. An outcome is weakly preferred to if , which is denoted . Similarly, is strictly preferred to (denoted ) if . For linear additive utilities, the preference profile can be inferred directly by the weights.

Now we present metrics used to evaluate our three criterion. An outcome is called Pareto Optimal if there exists no outcome that a player would prefer without worsening their opponent’s outcome. Formally:

The Pareto Frontier describes all Pareto optimal solutions, which we denote as . When an offer is not Pareto Optimal, then through negotiation there is potential to reach an outcome without players conceding anything.

There are two other useful metrics. Let denote the set of outcomes that are Pareto optimal. The bid distribution denotes the mean distance to the Pareto frontier, shown in Eq. 2.2. A high bid distribution indicates bids are on average far away.


Usually, simultaneous maximization of outcomes is not possible, as there is a region of disagreement between players. Another useful metric is the product of utilities (), known as the Nash Product. A fair outcome is often characterized using the Nash solution, the outcome that maximizes the product of utilities, shown in Eq. 2.3.


2.2 Strategies

In cases of perfect information, it is possible to determine the optimal bidding strategy [fatima2014principles]

. However, as previously mentioned, perfect information is unlikely in bargaining as agents are unwilling to give away their preferences in fear of exploitation. This motivates the development of negotiation tactics under imperfect information. These negotiation tactics can broadly be classified as

time-dependent or behavior-dependent tactics, based on a decision-function that maps state to a target utility.

2.2.1 Baseline Strategies

Two are commonly used. The Hardliner always bids maximum utility for itself, which emulates the "take-it-or-leave-it" attitude. The Random walker denotes agents that bid randomly, serving as a standard baseline.

2.2.2 Time-dependent Strategies

Time-dependent Strategies denote functions that produce offers solely based on time. At every round, the agent calculates their decision utility which determines whether they accept an offer or not. For time-dependent agents, this is:


and , thus parametrizing the range of the offers. Frequently, is parametrized as an exponential function:


where is the concession factor. is often set to 0 for simplicity. Fig. 2.1 shows the decision utilities of different agents. If , then the agent concedes towards the end and is known as Boulware. Otherwise, if , the agent concedes quickly and offers its reservation value, thus it is known as a Conceder. means the agent’s decision utility decreases linearly.

Figure 2.1: Decision utilities of time-based agents with different concession factors.

2.2.3 Behavior-based Strategies

Behavior-dependent and imitative bidding strategies observe the behavior of the opponent to make their own decisions on what to offer and what to accept. The most well-known is tit-for-tat, which produces cooperation through reciprocity. It’s three central mantras are 1) never defect first (play nice as long as the opponent plays nice), 2) retaliate if provoked and 3) can forgive after retaliation.

In negotiation, the relative tit-for-tat (TFT) strategy reciprocates by offering concessions proportional to their opponent’s concessions from rounds prior:


Here, is the offer for issue . This value is determined by the ratio of the opponent’s prior concessions, which then scales the agent’s own prior offer . The min and max values ensure offer values are within range.

2.3 State-of-the-Art in Negotiation

Machine learning methods in the domain of negotiation can be broadly separated into the following types: Bayesian learning, non-linear regression, kernel density estimation, and artificial neural networks. These methods have been applied to mostly model an opponent’s (acceptance and bidding) strategy, then derive an analytic response. This is because if an agent knows the opponent’s bidding strategy, then the agent can compute its optimal strategy  [baarslag2016learning].

For estimating the opponent acceptance strategy, techniques can be siloed into the estimation of individual variables. Zeng and Sycara provide a popular and intuitive Bayesian approach for estimating the reserve price, using historical data. The model generates a set of hypotheses on the opponent’s reserve price, then attaches a likelihood using the history. The estimate is a weighted sum of the hypotheses based on their likelihoods [zeng1998bayesian]. This technique has been adapted to estimate the deadline for time-dependent tactics [sim2008blgan]. In general, acceptance strategy estimation uses some form of Bayesian learning [sycara1997benefits, yu2013adaptive, sim2007adaptive, gwak2010bayesian, ren2002learning], augmented with non-linear regression [agrawal2009learning, yu2013adaptive, sim2008blgan, haberland2012adaptive, hou2004predicting], kernel density estimates [farag2010towards, oshrat2009facing, coehoorn2004learning]

, polynomial interpolation 


, genetic algorithms 

[matwin1991genetic, jazayeriy2011learning], and more recently neural networks [fang2008opponent].

In contrast, neural methods have been applied much more aggressively to the bidding strategy [baarslag2016learning]. In simpler cases where the general bidding formula is known, regression is sufficient as the problem reduces down to parameter estimation. If no formula is known, then neural networks are employed to approximate the opponent’s bid strategy, typically using a large database of bid history. Oprea [oprea2002adaptive] uses a time-series approach on single-issue negotiations, taking in only the opponent’s current bid. By 2008, early efforts for opponent move prediction using neural networks [carbonneau2008predicting], who focused on predicting human bidding strategies. This was particularly relevant in e-commerce and supply chain management, as forecasting bids is useful in determining automated strategies [lee2009neural, carbonneau2011pairwise, moosmayer2013neural]

. When the domain is general, researchers have found success using deep learning with multilayer perceptrons. Masvoula shows reliable predictions using single deep networks both with and without historical knowledge 

[masvoula2005design, masvoula2011predictive]. Papaioannou and Rau et al. have shown the concession factor and weight of each issue can be predicted if the opponent is known to be time-dependent, using multilayer neural nets [papaioannou2008neural, rau2006learning]

or single layer, radial basis function neural nets 


Reinforcement learning approaches to negotiation began as early as the late 20th century, often denoted as adaptive learning [rapoport1998reinforcement]

. Today DRL has more frequently been used with natural language processing 

[georgila2011reinforcement, cuayahuitl2015strategic]. Lewis et al. implement an end-to-end DRL negotiation dialogue generator [lewis2017deal]

. They curated a set of human-human dialogues with Mechanical Turk, then trained on four gated recurrent units 


, a type of long-short term memory neural net 

[neubig2017neural]. However, this study focuses on emulating human language, with less concern on optimality— for instance, their DRL agent present 58.6% and 69.1% Pareto Optimality against simple autonomous agents and humans respectively, on a very limited, discrete action space (around 200 offers).

As illustrated with this brief survey, there is an immense number of agent designs for negotiation. The primary weakness in best performing models, such as Bayesian models in acceptance or bid prediction, is they require specific domain assumptions and architectures. Another weakness is for these negotiators to perform well in populations of different strategies, an additional opponent classifier is needed, which introduces further uncertainty. Additionally, opponents can use more complex behavioral strategies and mixed strategies— pure strategies associated with a probability—that requires higher levels of adaptability to play against.

All of this motivates an adaptive agent with a fixed architecture that can perform well against different opponents. An end-to-end negotiation agent is desirable as the only required input is the offer, time step, and public knowledge, and can adapt online during gameplay. Although deep learning often comes at the expense of explainability, a fixed architecture playing end-to-end means we do not need additional classifiers and assumptions about the opponent. The success of AlphaZero in chess is largely because it did not rely on hand-crafted heuristics and assumptions like other engines [silver2017mastering]; likewise, Libratus learned to exploit specific human opponent idiosyncrasies in poker [brown2018superhuman]. An end-to-end, adaptive neural agent is the analogous solution for negotiation. It is a convenient coincidence that the negotiation domain also aligns with the current interest in continuous control positions deep reinforcement learning.

3.1 Deep Reinforcement Learning

Multi-agent Reinforcement Learning is formally the study of n-agent stochastic games [shoham2003multi], described as a tuple . is the number of agents. is the set of states and , with each the set of actions agent can take. In the most basic case, by treating the environment as static, the single-agent Q-learning algorithm developed by [watkins1992q] gives the optimal policy in an MDP with unknown reward and transition.


estimates the value of taking action when on state , and the value of the state by taking the best action. Extension of this paradigm to multiple agents is difficult. One approach is to assume the environment as passive, each agent with their own reward and transition functions. However, this falsely assumes agent actions do not influence each other [sen1994learning]. Another approach is to define the value function over all agents actions, but introduces a dynamic programming challenge in updating .

In recent years, reinforcement learning has been applied successfully in conjunction with deep learning, using deep neural networks to approximate value functions. A breakthrough comes from policy-gradient methods. Traditionally, RL algorithms are action-value methods: after learning values of the action, algorithms select actions based on the estimated action values. In contrast, policy-gradient methods learn a parametric policy without consulting the value function [sutton2018reinforcement]. By policy we mean an agents strategy— what it does at a given state and time.

Additionally, in cases where the environment is dynamic, it may be optimal to acquire a stochastic policy— a probability distribution over possible actions. This distribution is updated to associate actions with higher expected rewards with higher probability values. Since probabilities can be over discrete or continuous action spaces, DRL is a useful control framework for negotiation, as the decision to accept or reject an offer is discrete, whereas bidding is on continuous space (on given issues).

3.1.1 Policy Gradients

Call the policy and let parameters define a probability distribution. The probability of action is denoted as , that is, the probability of taking action at time given that the state and parameters . Similarly, a learned value function, such as using a neural network to approximate the value, can be represented as , where is its weights.

As with action-value RL, policy parameters are optimized to maximize a scalar performance measure :


which describes the expected future aggregate rewards (sum of rewards from until the end). The policy values are updated according to through gradient ascent:

For discrete actions, actions are selected by estimating a numerical preference value or logit , based on the state, action, and parameter values (weights in a neural net). Actions are then selected using the softmax distribution:


For instance, for the acceptance strategy, an agent can reject or stop. Associate with these actions and respectively, and a stochastic policy is defined.

However, updating the policy in respect to requires the policy-gradient theorem, which provides guaranteed improvements when updating the policy parameters [sutton2018reinforcement]. The theorem states that change in performance is proportional to the change in the policy, and a full statement is given in Appendix A.1. The theorem yields a canonical policy-gradient algorithm— REINFORCE [sutton2018reinforcement, willianms1988toward, sutton2000policy]. The parameter updates is:


where is the observed reward. Intuitively, the update is the reward multiplied by the gradient of the action probability divided by the action probability. If is high, this increases the chances of visiting that state in the future. Note, the policy gradient is often expressed as

, which yields the fraction through the chain rule.

3.1.2 Deep Reinforcement Learning for continuous variables

Secondly, actor-critic models are useful because they separate the policy space and action space, which means policy selection can occur on a continuous domain. For instance, in a uni-variate control problem, the choice of action can be sampled from a normal distribution. The policy approximation with a normal distribution is:

During back-propogation, the values are updated such that and reflect a better reward using the Equation 3.4.

3.2 Actor-Critic Implementation

We have arrived at our main method. Unlike REINFORCE, which only learns a policy, actor-critic models simultaneously learn a value function approximation and a policy. Intuitively, the value function critiques whether an action undertaken by the policy is good, rather than being an absolute measure. Thus, we make a modification to Equation 3.4, substituting the reward with the value estimate in Equation 3.5:


The process of negotiation thus requires two actor-critic nets— one for the acceptance strategy and another for the offer strategy. The algorithmic procedure is shown in Fig. 3.1, with pseudo-code provided by Algorithm 3 in the Appendix. We use univariate and single-issue interchangeably, as with multivariate and multi-issue. Next, we describe the architectures of the acceptance and bidding strategy.

Figure 3.1: Flowchart of training algorithm.

3.2.1 Acceptance Net Architecture

The first neural network approximates the accecptance strategy. For the univariate case, the input is a two-element vector consisting of the opponent’s offer and the current time step. For the multivariate case, the input is four-dimensional.

Figure 3.2: Architecture of the Acceptance Strategy.

At every time step, Accept Net takes in the opponents offer, encodes it to a 512 hidden state using two affine-Relu6 pairs. This base layer is shared between the actor and value network, which facilitates a shared representation [mnih2016asynchronous]. The actor takes in the embedded state and outputs two logit values, which are softmaxed to choose the appropriate action. Similarly, the value network outputs the expected reward estimate.

The Relu6 is a variant of the Relu functions (

, but capped at 6. Relu6 layers have been shown to train faster (due to the limit on byte representation) and to encourage the learning of sparse features earlier on [krizhevsky2010convolutional]. This is important since gameplay is path-dependent and, against a mixed set of opponents, states may be sparse, which we confirmed during preliminary testing. Hyper-parameters were also chosen through testing, reducing layers and the number of hidden states until training behavior changed. The full architecture is shown in Fig. 3.2.

After playout, the critic loss is calculated by taking the mean-squared error (MSE) of the temporal difference— the difference between the observed rewards and the value network’s forward pass. Learning parameters are given in Table 3.1.

for Every Reward, State and Action do
       TDLoss Reward ;
       TD Loss ;
       LogProbs ;
       LogProbs TDLoss;
       Backprop on and ;
end for
Algorithm 1 Acceptance Net Actor-Critic Update

3.2.2 Offer Net Architecture

Next, assuming the agent has rejected the offer, the agent now takes in the same input and decides a counter-offer. Since we have three issues and offers operate on continuous space, the Offer Net must output a vector

. To do so, we implement DRL with continuous control, sampling from three different three types of distributions: 1) multivariate Gaussian, 2) three beta distribution and 3) three Cauchy distributions.

The multivariate Gaussian is parametrized by a vector of means and covariance matrix

. However, a common assumption in deep learning is that the neural network will capture interdependencies between variables. Hence, an estimate of individual standard deviations along each dimension will suffice. The probability density and policies are given explicitly below.


The beta distribution is defined on the interval, and defined by two positive shape parameters and . This is useful as offers are held to a finite span. The PDF is:


where denotes the gamma function (fractional factorial). Some useful properties of the beta distribution include its intuitive mean and relatively simple expression for variance:


Lastly, the Cauchy distribution is parametrized similarly to the normal with and denoted more generally as location and scale. It’s density function is given as :


The neural architecture follows the same actor-critic model described in Section 3.2. A base layer inputs into the value network and six other neural blocks, estimating the distribution parameters— three means (locations) and sigmas (scale). For the beta distribution, these are estimates on and . These six variables are then used to sample the offer, which serves as the action used during loss calculation.

Figure 3.3: Architecture of the Offer Strategy for multivariate normals and Cauchy distributions. For the beta distribution, the estimated variables were and , and the final forward pass layer is a ReLu layer, instead of a Sigmoid.

The value network consists of seven layers of affine-Relu6 layers. Neural estimates of the mean were conducted with two affine-Relu6 layers, followed with an affine-sigmoid layer to constrain the output between 0 and 1. Sigma estimates used one affine-Relu6 layer and one affine-sigmoid layer. For the beta distribution, the network estimated three pairs of and . Since , the final sigmoid layer was replaced with a Relu layer. Fig. 3.3 shows the architecture in full. Apart from a similar justification for the use of Relus for Accept Net, Relus have documented success for continuous control as well [lillicrap2015continuous]. Hyper-parameter choice was chosen in a similar way.

Training was undertaken using Adam with learning parameters are given jointly in Table 3.1. The exact computations for back-propagation are given in Algorithm‘2. Note, because back-propagation occurs on a continuous domain, log-probabilities of continuous density functions can be positive when variance is small.

for Every Reward, State and Action do
       TDLoss Reward ;
       TDLoss ;
       LogProbsX ;
       (LogProbs + Entropy) TDLoss;
       Compute , ;
       Backprop on , , ,
end for
Algorithm 2 Offer Net Actor-Critic Update

Here, the entropy is defined as . Adding entropy introduces noise to enable action exploration. Also, high variance means higher loss, so overtime, the variance decreases to improve the precision of the evolved strategy.

Learning Rate Epochs Optimizer
Accept Net to 8000 Adam
Offer Net (Gaussian & Cauchy) to 4000 Adam
Offer Net (Beta) 5000 Adam
Self-Play 3000 Adam
Tit-for-tat 5000 Adam
Table 3.1: Deep Learning Training Parameters. Early stopping criteria was convergence in play-out time for 500 epochs. This varied by the concession factor and discount rate

3.2.3 Reward Scheme

Accept Net and Offer Net share the same reward scheme. With deadline , value weights , and final offer , the reward given to the neural agent is:

This reward function encourages the agent to increase its offer but not so much it forces a conflict deal and receives low reward. Unless specified, .

3.2.4 Self-Play: Characterizing Behaviors with Game Theory

Within bargaining game theory, a focus has been how mechanisms induce norms of fairness, particularly from a branch of game theory called evolutionary game theory (EGT). EGT originates from biology, where it studies the dominance of species through evolutionary pressure, and has been extended to behavioral economics to understand the evolution of behavioral traits. Nowak et al. showed fairness could be induced if reputation was taken into consideration for the repeated ultimatum game [nowak2000fairness], where agents play many one-round negotiations with other agents.

Their most important finding was that, if agents learned to reject offers they deemed too low, a population of fair agents would emerge. Thus, reputation refers to the trait of intentionally reject low offers, and has been confirmed with computational and empirical results [rand2013evolution]. We implement a similar study to compare results, using the neural actor-critic model instead of evolutionary methods, and on multi-round negotiation rather than the ultimatum game. Details of implementation are given in Section 4.4.2, where we demonstrate a similar appearance of fairness.

3.2.5 Against Behavior-Based Agents

Lastly, we train our agent against two behavior-based agents. The first is the relative tit-for-tat described in Section 2.2.3, and the decision function in Eq. 2.6. Furthermore, we implement a Bayesian Tit-for-tat, by estimating the opponent’s value weights. The Bayesian tit-for-tat agent first measures the opponent’s concession using its own utility function. Then, it mirrors the amount of concession. Finally, this offer is made as attractive as possible using a Bayesian opponent model [baarslag2013tit].

To do this, we first take the ratio of the opponent’s offer at and to update the decision utility. If the opponent concedes, then we concede; if they increase their share, we increase ours. Then, we estimate the opponent’s utility weights as the mean value of their offers. For instance, if an opponent offers then , then the utility is estimated as


We then implement the Simplex algorithm [dantzig1955generalized] to maximize this value, fixed upon the decision utility we calculated prior. While this assumes the opponent makes concessions in particular (preference-based) fashion, it remains a question whether the neural agent can uncover the correct concessions to make.

4.1 Theoretical Decision Utilities

4.1.1 Utility Moments

Before proceeding to results against time-based agents, we first derive the theoretical optimal strategies. Denote the decision utility of the time-based opponent as

This is in essence the same as Equation 2.5. In our case, is the reserve price and is normalized to . Then our utility is:


The maximal point must be one where the marginal utility is 0. Before we take the derivative of , we first take the derivative of .

We then solve for our marginal utility by the product rule.


Setting the reserve price to in our experiments, we can derive a much more elegant expression, as .


It is a simple matter to check the second derivative is negative, hence the expression for the condition for the maximal point is


Interestingly, this value does not depend on the total time and since

is a linear transformation of the utility function, this optimal time depends only on the concession factor and discount rate. The optimal stopping time can be expressed as:


The theoretical values are shown in Fig. 4.1

. A strong phase transition occurs along the

, demarcated by the clearly lighter region.

Figure 4.1: Theoretical optimal stopping time over concession factor and discount rate.

Additionally, we compute the second derivative of the utility, as it accounts for the error analysis of neural agent in Section 4.2, and the

-th moment for generality.

Figure 4.2: Outcome space of Negotiation. Fig. 4.2a) shows the outcome space at . The vertices of the outcome space polytope is mapped from the vertices of the action space, at . Fig. 4.2b) shows the evolution of the polytope over time, with a discount rate of . The decision to counter-offer is made by the subsequent time-step, rather than the current. The colors denote the Nash Product, with the Nash solution lying at .

4.1.2 Outcome Space

In the field of automated negotiation, preferences are typically visualized through an outcome space plot. The axes are utilities of Player and . Possible outcomes are mapped to . Fig. 4.2 shows this plot for our negotiation process. In a), the Pareto frontier is shown by the right-most edges of the polytope.

By theorems of fixed points and the simplex algorithm [wong2015bridging, dantzig1955generalized], the vertices in the outcome space must come from the vertices in the action space . The action space vertices that outline the frontier are found to be . Intuitively, these are points that offer the greatest marginal utilities to P1 and P2, based on their value weights and . The piece-wise equation for the Pareto Frontier is given as follows, in Equation 4.8:


This equation allows us to calculate the bid distribution and determine the efficiency of an agent’s bid strategy, provided in Section A.2 in the Appendix. Furthermore, the Nash Solution ( lies at given by the offer , provides a benchmark for the fairness when playing against behavior-based agents.

4.2 Acceptance Strategy

4.2.1 Behavioral dynamics: Cliff-walking vs optimal play

The central question for an acceptance strategy is when given an offer, whether or not to accept or wait for potentially better future offers. However, if the agent fails to accept before the deadline, then the conflict deal is enacted and both agents do not receive any reward. Given a discount rate and opponent concession factor, the goal is to find the best moment to accept an offer, inferring from their prior offers.

Thus, the acceptance strategy can be seen as an optimal stopping problem with an additional cliff-walking problem to solve. Fig. 4.3 shows the loss, rewards, and playing time as the network trains against a linear agent () with no discount (). Through stochastic sampling of new points, the agent notices greater reward by waiting, illustrated by gradual trends in playing time (green). However, once the agent reaches the deadline at 20 rounds, the conflict deal is enacted and a reward of is issued, producing a large loss. We present only the multivariate case, as results for the univariate case are the same but with lower complexity.

Figure 4.3: Loss, rewards and total time of the DRL agent training against a time-based agent with . After a few epochs of random search, the agent learns that increased playtime comes with greater reward. However, as this time increase to the deadline,the reward drops sharply.

To analyze the stopping time, we consider the evolution of acceptance probabilities during gameplay against time-based opponents. Fig. 4.4 shows the logit values used in Eq. 3.3 and acceptance probabilities against Boulware, Linear, and Conceder agents.

Figure 4.4: Evolution of acceptance probabilities and logit values without discounting (). Row 1 shows the acceptance probabilities at each time step (blue), the cumulative probability of first success (orange), and max value at each time step. The pink point denotes the optimal value (Eq. 4.5). As the concession factor increases, stopping time decreases. Row 2 shows where the logits "change places," corresponding to where the cumulative probability is maximal.

The first row shows the acceptance probabilities at each time step. The cumulative probability (orange) denotes the likelihood the game ends at a certain time step, given as:


Since the discount rate is , the optimal value is waiting until the final point in time. The decrease in stoppage time shown by the right-shifting cumulative probability is sub-optimal, although this is not uncommon in conservative agents. In the value function (blue, second row), there is also a slight decrease after the logit values cross. This indicates that the expected reward at these times may be the same.

Another way to see this is to consider the marginal utility over time. Since Boulware agents only concede towards the end, the Neural agent is forced to wait to achieve comparable results, whereas it may be “satisfied" earlier against Conceders. The marginal utility of the Boulware agent is thus much greater towards the end, whereas marginal utility is high at the beginning against Conceders. Explicitly, for , the expression for marginal utility is:


This analysis is corroborated further once we introduce the discount rate. Fig. 4.5 shows the acceptance probabilities and logits once discounting is introduced.

Figure 4.5: Evolution of acceptance probabilities and logit values with discount. Row one shows the acceptance probabilities at each time step (blue), the cumulative probability of the first success (orange), and the pink point denotes the theoretical maximum prescribed by Equation 4.5. As the discount rate increases, the optimal maximum and stoppage time decreases.

Looking at the red curve, has positive marginal utility and for , the marginal utility is negative from time step 7 onwards. For , the utility function is relatively flat after time step 10, which means the marginal utility is close to 0. Here, we observe the greatest time deviation.

4.2.2 Marginal Analysis: Marginal Utility determines Error

Due to the stochastic nature of deep learning, it’s difficult to construct a precise mathematical proof of how changes in marginal utilities push against each other. However, we can test this empirically. The neural agent played against a set of different agents, with concession factors of 0.95, 1.5, 2, 3, 5, 10. In Fig. 4.6a), the curves show our max utility (Equation 4.1) and the red dot shows the optimal stopping time given by Equation 4.5. Note, as increases, the curves grow sharper and since , the magnitude of the second derivative strictly increases.

Fig. 4.6b) shows an inverse relationship between the time error and the reward error. The “peakier" the curve, the more likely the Neural net selects the optimal time. However, deferral by even one time step leads to large amounts of diminished utility, hence creating the larger reward error. In contrast, using the second derivative derived in Equation 4.6, we observe in Fig. 4.6c) that as the second derivative approaches 0, the time error increases.

Having shown what produces the reward and time errors, we can address our sub-problem about limitations. For future work, we may dynamically reduce the learning rate using the second derivative and distance to the deadline for better convergence. Numerical results are summarized in Table 4.1.

Figure 4.6: Optimals vs second derivative from 100 gameplays. Rew. error scaled by 3.
Time Error 3.95 11.62 7.81 5.86 4.99 3.0 1.0
Reward Error -0.0234 -0.712 -1.054 -1.278 -1.366 -1.302 -0.907
Second Deriv. () 0.0722 -0.094 -0.152 -0.223 -0.389 -0.777 -1.886
Table 4.1: Tabular Results of 100 gameplays sat different concessions.

4.2.3 Preference-based concessions produce fairer outcomes

Finally, we consider optimality. Since the final offers depend on the time-based agent, so do the optimality measures. Thus, the way opponent agent algorithmically constructs their offers will appear differently in the outcome space. Fig. 4.7 shows the distribution of accepted offers after 400 gameplays, with and . The first randomly samples from the plane that satisfies the following condition:


where is the decision utility at time and is the weighed utility for issue . The second uses a preference-based, monotonic concession strategy— it satisfies Equation 4.11, but concedes starting from the issue it values the least (). Multivariate Gaussian noise with a standard deviation of is added to prevent deterministic offers. When the time-based agent uses preference-based, monotonic concession strategies then this guarantees offers to lie on the Pareto Frontier.

Figure 4.7: Distribution of final, accepted offers (, ). The preference-based bidding (red) produces offers on the Pareto Frontier. Planar sampling (magenta) occur in intervals as the time-based agent samples points based on .
() Av. Reward Av. Time
Planar Samp. 1.64 0.54 1.84 4.89
Preference-based Concession 0.46 0.00 2.61 5.7
Pure Random 1.92 1.24 N/A N/A
Table 4.2: Sampling Results

The preference-based method produces offers that lie on the Pareto Frontier. Because of this optimality, the neural agents play on average a longer time when its opponent follows this strategy. Random planar sampling yields considerably better results than pure random sampling, with a bid distribution difference of (shown in Table 4.2). The magenta points in Fig. 4.7 arise because, for every time step, the decision utility is fixed for fixed . Table 4.2 summarizes the mean outcomes of the gameplays, with the preference-based concession performing the best.

4.3 Bidding Strategy

4.3.1 Precision in Single-Issue Negotiation

Figure 4.8: Cauchy distributions based on concession factor and decision utility. When is low (Boulware), Cauchy means are clustered tightly for low , spread out for high . When is high (Conceder), values are clustered for high and spread out for low .

Before evaluating performance on the multivariate case, we verify the univariate case. Fig. 4.8

shows the action policies given by Cauchy distributions for specific decision utilities. As the concession factor increases, the distribution of means transfers from right- to left-skew. This can be attributed to the magnitude of the marginal utility. Cauchy means are clustered tightly for the Boulware agent when

is low, as the marginal utility is low early on. However, as time passes and the Boulware agent begins to concede greatly, the distance between means increase. Conversely, when the opponent is a Conceder, concession begins early, so marginal utility is large when is small, leading to right-skew. The Conceder case is not as pronounced as the Boulware case due to the cliff at the deadline. As expected, means are spaced out linearly against linear agents.

In sum, the change in decision utility affects the distribution of the means. For completeness, Fig. A.1 in the Appendix shows a heat map of how the neural agent’s utility changes in respect to the opponent’s.

4.3.2 Multivariate Training Dynamics

Figure 4.9: Training statistics of the Cauchy Offer Net. The neural agent learns to induce rejection from the time-based agent so negotiation ends near the deadline.

Next, we present the multivariate case. We trained on a grid of concession factors for fixed discount rates. We denote the three issues as issues , , and . Fig. 4.9 shows the first 1000 epochs, using a multivariate Cauchy distribution with and no discount. Unlike training Accept Net, cliff-walking is less present, as the final action (accept) lies with the opponent and time-based agents are very likely to accept. The agent quickly learns to wait longer, converging at a higher time step.

However, unlike accept net, this is not simply a binary action where rejecting an offer leads to the next round. The agent has to produce offers that induce rejection from the time-based agent. Fig. 4.10 shows how Cauchy means vary over time. The first row shows the individual Cauchy means for issues , and respectively. Shown in light blue is also the normalized utility (divided by the total possible utility of 6) and the equilibrium payout from 100 gameplays. The second row shows the opponent’s decision function (blue) and our maximum utility at each time step. The value estimate given by the value net is shown in green. At the bottom, the red line shows the mean stoppage time, with the distribution of times shown with a kernel density estimate.

Figure 4.10: Variation of Cauchy means over time with no discount rate. The expected reward hovers around 84 percent of the full utility. All stoppage times are high, but still decreases as the concession factor increases due to reduced marginal utility.

Since the cliff-walking aspect is not as prominent (the only case where the conflict deal is enacted is if the agent proposes the full amount at the end), all stoppage times are relatively high, although diminishing stoppage time is still observed when the marginal utility is lower. For instance, when , the marginal utility is constantly and the second derivative is , the stoppage time is . Comparably, when , the marginal utility is increasing and the mean stoppage time is .

Note issue varies the most, either through concession (Fig. 4.10a)) or increase of the offer value as shown in Fig. 4.10c). At a glance, this may be counter-intuitive, since a change in or would yield the most marginal gains for the agent. However, the opponent values issue the most, which means produces the largest amount of gradient for the least amount of loss during concession. As a result, the agent learns the negotiate close or along the Pareto Frontier, which we show later in the distributional analysis. Secondly, there is a clear progression between Boulware and Conceder strategies when comparing the linear agent to the Boulware agent.

Additionally, in the bottom row, the green value function remains fairly constant throughout, until the drop off towards the end induced by the deadline. The value function remains flat since the expected value is constant— so long as the neural agent sticks to its strategy, the payout will not change. While performance is not optimal, it achieves more than of the optimal which is typical of risk-averse agents whose behaviors are generally conservative on estimates [sandholm1999bargaining].

Figure 4.11: Variation of beta distribution means over time with discount rate. Each mean is calculated using Equation 3.8. Gameplay results fall consistently within of the expected optimal stoppage time. Values begin around the same region with high variance as the agent is unsure what opponent it is playing against, but decreases with time as certainty towards its opponent’s strategy grows. .

Next, we compare this to the case when discounting is introduced, using the beta distribution as an example. Fig. 4.9 shows the gameplays of a neural agent using the beta-distribution. The first row shows the evolution of the multivariate distribution means (blue, orange and green for , , and respectively), the evolution of the normalized utility (black), and the reward under 100 gameplays. The second row shows the theoretical maximums (orange) and mean stoppage time (red). Immediately, we observe that the agent begins sampling around the same initial values, then alters its strategy as it learns more about the opponent. The way it alters its strategy varies depending on its own inherent discount rate, which demonstrates adaptability.

Finally, issue is again the issue with the most variation, with the same argument that its change produces the highest gradients during gameplay, due to the opponent valuing the most. Mean stoppage time is close to the optimal, with mean deviation. While this is not precisely optimal, it is quite good. To understand what causes this limitation, we analyze the probability distributions.

4.3.3 Offer Strategy requires Sensitivity to Variance

We compare the outcome space of agents using Gaussian, Cauchy, and beta distributions, after playing 3000 rounds against batches of mixed opponents. Fig. 4.12 shows the distribution of final offers given by each agent, with the addition of a random agent, when playing against a linear agent with no discount rate. Since time-based agents make monotonic concessions, lower y-values imply longer gameplay times.

Figure 4.12: Distribution of outcomes based on different sampling distributions. Neural agents were trained on 3000 games, then played against a linear agent. The random agent plays. Statistics are summarized in Table 4.3

As expected, the random agent produces offers distributed randomly in the outcome space. At a glance, the beta distribution outcomes bear the most resemblance to the random agent. This is due to the beta distribution’s initial high variance. This can be adjusted by increasing the constant added to the initial values of and . However, also note that the scattered points are on average greater than 3.

The normal distribution produces the most consistent results, with results clustered around the vertex. The Cauchy distribution performs similarly, but on average performs better, with a maximum value of and average . However, it also has a much greater variance when compared to the Normal distribution. We can conclude convergence to optimal play requires sufficient initial variance to prevent convergence to local optima. Additionally, the maximum value achieved by any of these distributions was achieved by the beta distribution. These values are presented in Table 4.3.

Av. Reward Av. Time Reward. Range
Rand. Samp. 1.901 1.246 NA NA NA
Beta 1.741 1.0018 3.587 10.11 5.378
Normal 1.585 0.0815 4.993 11.03 0.294
Cauchy 2.261 0.218 5.051 14.66 4.403
Table 4.3: Gameplay outcomes from 400 games against linear agent. Sampling through the normal distribution givese the fairest and most consistent results, whereas the Cauchy provides the highest expected reward.

The normal distribution produces consistent results, with the lowest bid distribution and reward range, and also is closest to the Nash Solution. In expectation, the Cauchy distribution produces better results but with a higher bid distribution and is farther from the Nash Solution. However, having a large distance from the Nash Point is not necessarily bad, as exploiting the opponent’s strategy leads to higher rewards. Without discount, the neural agent can improve its own outcomes by waiting.

The four panels in Fig. 4.12 reveal two opposing forces that make continuous DRL difficult in this domain. Convergence to optima requires high variance, yet avoiding the conflict deal requires low variance to prevent sampling the conflict deal. For further proof, consider that the time error decreases with discount rate, comparing Fig. 4.10 and Fig. 4.11. The discount rate shifts the optimal away from the cliff, thus sampling around the optimal produces less error due to slow change in marginal utility, and the smooth reward function allows more accurate function approximation [lillicrap2015continuous, sutton2018reinforcement].

The next step is to compare the variances of the three distributions, and their sensitivity to parameter change. The Normal’s variance is directly parametrized by the action network. In contrast, the variance of the beta distribution depends on both shape parameters and :

which means large, simultaneous increases in both and is required to lower variance, leading to slow convergence. In contrast, the Cauchy distribution famously does not have a theoretical mean, variance or kurtosis, due to laws of integration.

This points to why the Cauchy distribution works better, all things held equal. It is parameterized as directly as the normal distribution, but is also has "heavy-tailed" and "peakier" than the Gaussian. This slower decay in the tails means lower variance sensitivity, hence avoiding convergence to local optima. At the same time, the Cauchy distribution also has a much higher peak than the Gaussian, which means there is less cost in accuracy when sampling.

In regards to our study’s objectives: the results from Sections 4.2 and Section 4.3 show slow convergence when variance is high and sub-optimal convergence when variance is low due to lack of action exploration as the primary limitation. This suggests the learning rate can be adjusted through marginal utilities, the distribution’s kurtosis (peaky-ness and tail-behavior), and the variance sensitivity for faster convergence and efficient outcomes. A more aggressive learning rate can curtail distributions with lower variance sensitivity. Furthermore, variation in concession factor and discount rate yields different strategies from Offer Net, thus demonstrating adaptivity.

4.4 Self-Play: The Emergence of Fairness

So far, we have addressed sub-problems related to training barriers and demonstrated exploitative capabilities against time-based agents. While play against time-based agents provides clear benchmarks due to monotonic time-based concession, play against behavior-based agents is required to evaluate behavioral traits such as fairness. In this section, we first present a game-theoretic framework of our games, then the results for single- and multi-issue self-play, then against two variants of tit-for-tat agents.

4.4.1 Game-theoretic Framework

We introduce a few game-theoretic concepts required for in-depth behavioral analysis. An extensive game consists of a set of players , a set of sequences that denote possible game trajectories. A game tree describes this trajectory of states, round-by-round. A Nash Equilibrium (NE) denotes an outcome where no player wants to willingly deviate. In extensive games, a strategy profile is an NE if

denotes the strategy set of player . Let denote the strategy profiles, where each bracket contains a player’s sequence of moves [osborne1994course].










Figure 4.14: Game tree of the centipede game. Every round, the "pie" grows by , ending with . Players can split it evenly at the end (cooperate every turn), or defect. By backwards induction, the SPNE is , with reward.
Figure 4.13: Centipede Game

In extensive games, the concept of sub-games describes part of the game tree which function as a game itself [osborne1994course]. Fig. 4.14 shows the game tree of the centipede game, a canonical game in game theory, and Fig. 4.16 shows the game tree of a bargaining game. In this instance of the centipede, the total size of the pie increases by at every time step. The players can choose to wait or defect. Consider the rightmost node in Fig. 4.14 labeled , denoting P2’s decision to cooperate or defect. Since the pay-off of defecting yields a reward of over from cooperating, P2 will defect if they are rational. The sub-tree stemming from can then be reduced to .

Once P1 realizes P2 will defect, P1 will also defect as this yields a higher reward. This process continues until P1 defects in round 1. The process of iteratively reducing up the tree is known as backwards-induction. The result at the end is a sub-game perfect Nash Equilibrium (SPNE), a type of NE that is also the equilibria of sub-games. The SPNE in this game is found to be . Ironically, if both players waited until the end, they would receive higher rewards. Hence, for the centipede game, we expect to see cooperative agents wait until the end, while rational agents defect at the beginning.










Figure 4.16: Game tree of a bargaining game. With an initial size of , the size diminishes by each time step, ending with after the sixth and last round, when the conflict deal is enacted. The sum of the rewards are subject to .
Figure 4.15: Bargaining Game

4.4.2 Univariate self-play results

Before training with self-play in the multi-issue domain, we consider a simplified version. Instead of giving offers in , consider the case where bidding actions are constrained to a binary decision—either offering or . Thus, agents can either offer a low amount to their opponent (rational behavior), or a fair amount. These four choices can be summarized as:

  1. Offer low, reject nothing. This is typically the SPNE, thus rational (G1).

  2. Offer high, reject nothing. This is altruistic (G2).

  3. Offer high, accept high. This agent is fair (G3).

  4. Offer low, accept high. This one is often disregarded, as it is a hardliner G4.

Figure 4.17: Training results for self-play, for the centipede game and bargaining game. Depicted are P1’s results(blue), P2’s results(orange) and the total playing time (green).

Fig. 4.17 shows the training results for the bargaining and centipede game, with the discount factors set to and respectively. Note, by setting the discount rate to greater than , the bargaining game effectively becomes a more complex version of the centipede game. For the bargaining game in Fig. 4.17a), the reward for P2 is initially low, then increases with time approaching 1 round. Conversely, the reward for P1 decreases. We infer that P2 learns to reject P1’s offer, and P1 learns to accept. This play is close to rationally optimal. If the agents were perfectly rational, the game would end immediately. However, P2 adopts a strategy that forces play to go on— we analyze the reason for this further in the multivariate case.

Figure 4.18: Comparison of offer and acceptance logits for the (a) bargaining and (b) centipede game. In both, Offer Net initially gives rational offers, then over time shifts to fair offers to increase acceptance probability. Accept Net accepts early in the bargaining game, and late for the centipede game, aligned with the optimal stopping times.

In the centipede game in Fig. 4.17b), the players learn to play close to 20 rounds, maximizing the "interest" accumulated. The total size of the pie is . By the final round, P1 holds a mixed strategy yields 47% rational offers ( and 53% fair offers (. As the time series shown are the running averages, the big dips show brief spans where the P1 adopts a fair strategy. These dynamics can be seen more clearly by observing how decision logits evolve during gameplay. Fig. 4.18 shows the probabilities of giving rational and fair offers at each time step, and the probabilities of accepting an offer (for one game trajectory).

In both cases, agents start-out giving low offers to their opponents. However, as time moves forward, the probability of a fair offer increases, to increase the probability of acceptance. For the bargaining game (a), the stoppage time reaches a maximum close to the beginning. This suggests the network, through gameplay, learns outcomes similar to backward-induction. Similarly in the centipede game, the network learns to wait, leveraging “interest" to accept near to the deadline, which also indicates cooperative behavior. Together, we conclude the neural agent learns to accept optimally and as time moves forward, shift its behavior from G1 (rational) to G3 (fair).

4.4.3 Multivariate Self-play

Now, we extend analysis to the continuous, multi-issue case. Fig. 4.19 shows the training dynamics of the Offer Net and Accept Net, whose final rewards are plotted per epoch, with the discount rate set to . Initially, the Offer Net (blue) has higher reward— as long as some reward is given to Accept Net, the Accept Net will accept it. However, around epoch 1600 the Accept Net learns to invoke the conflict deal. This is demarcated by the large drop in reward for both blue and orange to . After which, the Offer Net must concede some by offering a fairer amount.

Figure 4.19: Multivariate self-play for negotiation. Offer Net concedes value after Accept Net adopts a mixed strategy with the conflict deal threat.

In sum, by including some probability of the conflict deal, neural agents force a counter-offer that is fairer. This departs from classical game theory, as an example of a non-credible threat. A non-credible threat describes actions that perfectly rational agents will not carry out, as it would also leave themselves worse off [osborne1994course]. The adaption of non-credible threats is also observable in the uni-variate centipede game, with periods of dips in P1’s reward, following the conflict deal. For discounted bargaining, convergence to low playing time suggests that heavy discounting acts as a similar threat— if you do not offer a fair deal, I will drag on the negotiation. By keeping non-credible threats part of a mixed strategy, fair outcomes can evolve.

This is significant because it agrees with results from evolutionary game theory. We previously mentioned in Section 4.4.1, that reputation produces fairness in the repeated ultimatum game. Nowak et al. showed this through the same, exact simplified mini-game (bids restricted to low and fair offers), then full bidding space using population-based experiments [nowak2000fairness]. Populations of G1, G2, and G3 played against each other, and “reproduced" based on their utility. After many rounds, results showed that the rational agent (G1) dominated. However, if the prior acceptance history was available, which showed agents rejecting below a certain threshold, then a population of fair agents (G3) who offered high and accepted high would emerge.

In other words, there is a strong similarity between non-credible threats in our neural agent and the rejection of low offers (using reputation) in evolutionary strategies. This is by far the most interesting result, and it’s important to note evolutionary methods and RL are often framed as competing choices for agent design [salimans2017evolution]. Similar results produced in these two fields may drive future research directions.

4.4.4 Against Tit-for-Tat Agents

Figure 4.20: Evolution of play against relative tit-for-tat agents.

Finally, we present results against relative TFT and the Bayesian TFT agent. Since this investigation studies whether the acceptance and bidding strategy can adapt and induce promising counter-bids, each game is designed as follows: the TFT agent makes a bid, then the neural agent makes an acceptance decision and counter-bids. Thus, the game can only end on the neural agent’s acceptance.

Against the relative TFT agent with no discounting (Fig. 4.20), the neural agent converges to the vertex. Yellow represents the ending epoch and we observe the neural agent’s utility is greater than 3. Notably, the TFT agent makes offers opposite of the Pareto Frontier. This arises as the relative TFT agent measures concession with respect to its own utility. As we’ve analyzed in the time-based opponents, DRL agents are prone to adjusting variables that yield the greatest rewards and lowest losses. To the TFT agent, this is reversed, prompting it to concede the issue it values most and propose away from the Pareto frontier. This demonstrates adaptivity.

Figure 4.21: Evolution of play against Bayesian tit-for-tat agents. The agent learns to accept early and around the Nash point.

In contrast, the neural agent cooperates with the Bayesian TFT agent. In Fig. 4.21a (epoch 0) the neural agent performs randomly and suboptimally. Note, the color bar represents time step. However, the bid direction drifts towards the vertex (Figs 4.21(b) and (c)). By epoch 1450, the neural agent learns to induce results near the Nash point, in only four moves due to the discount rate.

With results from time-based agents and self-play, our analysis shows that when concession is necessary, the bidding strategy gravitates toward to . This ensures near Pareto Optimal payoff if its offer is accepted. Exploitation occurs against time-based and relative TFT agents, while fairer outcomes arise with more complex agents.

5.1 Summary of Discussion

Bilateral negotiation presents a unique domain that combines discrete and continuous control problems. Furthermore, the deadline produces a utility function analogous to cliff-walking. This paper is a fundamental evaluation of actor-critic models for negotiation, measuring its ability to exploit, adapt, and cooperate.

The neural agent shows clear exploitative behavior against time-based agents. For acceptance, the neural agent demonstrates precise logit switching behavior, in transitions between rejecting and accepting offers. The acceptance strategy resembles a conservative agent, accepting a little before the optimal time. For the bidding strategy, we evaluated the use of Normal, Cauchy, and beta distributions for continuous control. The Cauchy has the highest reward, but the Normal is more consistent. The neural agent learns ways to evaluate the opponent, such as maintaining high mean, high initial variance to ensure enough rejections, before lowering the variance to more deterministic outcomes. This also demonstrates adaptability to concession and discounting.

Time-based experiments reveal the barriers to optimal convergence. We discover the error in stoppage time can be explained by the change in marginal utility (second derivative) and cliff-walking: the agent waits for higher rewards, then is punished aggressively due to enacting the conflict deal. The primary factors that influence the bidding optimality is trade-offs in variance (i.e.the beta distribution suffers from slow convergence due to low variance sensitivity). High variance is required to seek out optimal strategies, but low variance helps avoid the conflict deal. The peakiness of the Cauchy and its heavy tails makes it a suitable candidate.

The neural agent was also shown to be cooperative and adaptive. When playing against time-based agents with preference-based concessions, offers are accepted along the Pareto Frontier and produce the highest expected reward. Self-play in the centipede game shows agents are willing to accrue interest, which demonstrates cooperation over rationality. Against simple Bayesian TFT agents, the neural agent learns to quickly arrive at the Nash Solution, resulting in win-win cooperation. Since all results arise from a single neural architecture, the neural agent shows significant adaptability. Most importantly, the neural agent forces fairer results by either 1) utilizing the conflict deal or 2) levying discounting to force fairer offers. There is a strong similarity between non-credible threats in our neural agent and the rejection of low offers (using reputation) in evolutionary strategies. It’s important to note evolutionary methods and RL are often framed as competing choices for agent design [salimans2017evolution]. Beyond theoretical interest in diverging from classical game theory, these results may guide the design of fairer negotiations, with EGT from a population perspective and DRL from the individual agent’s perspective.

Before discussing future work, I’ll note what didn’t work. Initially, the use of LSTMs seemed promising due to its success in natural language negotiation generation. However, there the action domain is discrete and limited.

5.1.1 Evaluation and Future Work

One weakness of this study is it studies a specific preference ordering. For the scenario, promising avenues include variations in the utility functions, as there are six combinations of preference orderings for three issues. More importantly is the inclusion of more complicated behavior-based agents. One barrier to this is, unlike the iterated prisoner dilemma that has hundreds of established strategies, we lack a repository that collects these strategies, such as the Axelrod library for the IPD [axelrodproject].

However, this is quickly changing. An annual negotiation competition that began in 2010 [baarslag2012first] collects strong bots into the Genius Environment [lin2014genius], maintained by Tim Baarslag. I anticipate running the neural agent against these bots, to understand how DRL performs in a tournament setting and against more complicated strategies.

Another weakness of this study is experimentation with design choices (learning methodology), although this was not possible given the focus of this dissertation was behavioral analysis. A separate study with a deep learning focus could scope-out the impact of neural architecture (the type of non-linearity and number of layers) and hyper-parameters (learning rate, reward discounting and the use of schedulers). Additionally, increasing the complexity of the algorithm may improve performance, such as increasing the input space to include -prior moves against trajectory-based opponents, or the use of Monte-Carlo Tree Search and rollout [silver2016mastering].


Appendix A Appendix

a.1 Policy-Gradient Theorem

The policy gradient theorem states the change in scalar is proportional to change in policy weights. More specifically, this is given as:

Here, denotes the on-policy distribution of a state under policy . We can think of this as the frequency a state has occurred. is the value of the state-action pair and is the change in the policy distribution. What this says is a positive change in can be produced a proportional shift in the policy. A full proof can be found in Chapter 13 of  [sutton2018reinforcement].

a.2 Point to Line Calculation

The general form of the closest distance from a point to line is


The Pareto Frontier s given by Eq. 4.8. Hence, the distance of a point to the Pareto Frontier is:

a.3 Actor-Critic Playout Implementation

for e in epochs do
       while Not accepted and deadline do
             P1 offers;
             if P2 Accepts then
                   Collect States for both players ;
                   Collect Acceptance Actions and Rewards for both players;
                   P2 Offers;
                   Swap Places (P1 receives then counter offers);
             end if
       end while
      Calculate Critic and Actor Loss for both Accept Net and Offer Net;
       Backprop on both networks;
end for
Algorithm 3 Implementation of negotiation playout and training pipeline

a.4 Change in Univariate Mean Estimation

The y-axis shows the decision utility at a given time, which is inversely related to the concession factor. The higher the decision utility, the more Boulware the agent is. The concession value can be converted with .

Figure A.1: Cauchy means based on opponent decision utility and time. The decision utility serves as a proxy for concession, as the higher the decision utility, the more Boulware the opponent.

a.1 Policy-Gradient Theorem

The policy gradient theorem states the change in scalar is proportional to change in policy weights. More specifically, this is given as:

Here, denotes the on-policy distribution of a state under policy . We can think of this as the frequency a state has occurred. is the value of the state-action pair and is the change in the policy distribution. What this says is a positive change in can be produced a proportional shift in the policy. A full proof can be found in Chapter 13 of  [sutton2018reinforcement].

a.2 Point to Line Calculation

The general form of the closest distance from a point to line is


The Pareto Frontier s given by Eq. 4.8. Hence, the distance of a point to the Pareto Frontier is:

a.3 Actor-Critic Playout Implementation

for e in epochs do
       while Not accepted and deadline do
             P1 offers;
             if P2 Accepts then
                   Collect States for both players ;
                   Collect Acceptance Actions and Rewards for both players;
                   P2 Offers;
                   Swap Places (P1 receives then counter offers);
             end if
       end while
      Calculate Critic and Actor Loss for both Accept Net and Offer Net;
       Backprop on both networks;
end for
Algorithm 3 Implementation of negotiation playout and training pipeline

a.4 Change in Univariate Mean Estimation

The y-axis shows the decision utility at a given time, which is inversely related to the concession factor. The higher the decision utility, the more Boulware the agent is. The concession value can be converted with .

Figure A.1: Cauchy means based on opponent decision utility and time. The decision utility serves as a proxy for concession, as the higher the decision utility, the more Boulware the opponent.