Abstract
Negotiation is a process where agents aim to work through disputes and maximize their surplus. As the use of deep reinforcement learning in bargaining games is unexplored, this paper evaluates its ability to exploit, adapt, and cooperate to produce fair outcomes, in comparison to classical game theoretic results.
Two actorcritic networks were trained for the bidding and acceptance strategy, against timebased agents, behaviorbased agents, and through selfplay. Gameplay against these agents reveals three key findings. 1) Neural agents learn to exploit timebased agents, achieving clear transitions in decision preference values. The Cauchy distribution emerges as suitable for sampling offers, due to its peaky center and heavy tails. The kurtosis and variance sensitivity of the probability distributions used for continuous control produce tradeoffs in exploration and exploitation. 2) Neural agents demonstrate adaptive behavior against different combinations of concession, discount factors, and behaviorbased strategies. 3) Most importantly, neural agents learn to cooperate with other behaviorbased agents, in certain cases utilizing noncredible threats to force fairer results. This bears similarities with reputationbased strategies in the evolutionary dynamics, and departs from equilibria in classical game theory.
2.1 Overview of Negotiation
A negotiation setting contains a protocol, agents, and scenario. The protocol determines the rules of how agents interact with each other. The scenario takes place in a negotiation domain which determines an outcome space, denoted as . A negotiation domain can have a single or multiple issues. Issues refer to the resources under contention, such as the price of an object or level of service. Thus, an outcome can be described as a specific division of the issues. Agents have preference profiles, which determines specific outcomes they prefer.
2.1.1 Protocols
We use singleissue bargaining as a preliminary illustration. Given a unit pie, two players and are asked to split it amongst themselves [fatima2013negotiation]. Suppose Agents and negotiate rounds to divide a unit pie, by alternately proposing outcomes called bids or offers, until a player accepts. We denote an offer , such that .
This process of alternating offers is known as the Rubenstein’s Bargaining Protocol. Games with one round, are known as an ultimatum game [rubinstein1982perfect]. In ultimatum games, makes the first and only proposal. can only accept or reject it, which means has all the power. Similarly, if there are two rounds, then Player has the advantage. In a game of repeated offers, it is necessary to introduce some form of discounting factor— otherwise, players would negotiate forever. The discount factor makes a portion of the pie go bad at every round. Thus, it is in the best interest for players to finish the game as soon as possible.
The Rubenstein Bargaining Protocol is widely used because it accurately simulates many realworld scenarios [rubinstein1982perfect]. Multiissue bargaining is more complex, as multiple issues are under contention and requires further protocol restrictions describe how each issue is resolved. Common ones are [kraus1997negotiation]:

Packagedeal Procedure: All issues addressed at once.

Simultaneous Procedure: All issues are solved independently. It is equivalent to singleissue problems.

Sequential Procedure: Negotiates one issue at a time, with a predetermined sequence. Cannot negotiate prior or future issues.
An alternative protocol is the monotonic concession protocol [rosenschein1994rules], where agents disclose information about how they value each issue, and their subsequent offers must have less utility than their prior ones. Other protocol considerations include [fatima2014principles]:

Time Constraints: Beyond the discount factor , there is often a deadline . If negotiation does not end by , players earn utility (known as the conflict deal).

Divisibility: Issues may be atomic and discrete, or divisible and continuous.

Lateralness: Whether negotiation is between two parties (bilateral) or with multiple parties (multilateral).

Reserve Price : The minimum an agent is willing to accept.
2.1.2 The Scenario
The utility is defined as the cumulative utility,: a combination of subutility functions. Most commonly used is the linear additivity. With the division for Player A (PA) and for Player B (PB), the aggregate utility is of PA is:
(2.1) 
where is the value (weight) PA ascribes to issue , the discount rate, and the division for issue . This can be viewed as the discounted dot product of weights and issue division . In many cases, however, utilities are not linear in combination— for instance, in the auctions of multiple items, combinations of items yield greater rewards, to the effect of the sum being greater than the parts, due to synergistic effects. These are modeled with nonlinear utility functions [ito2008multi].
The action space is defined by three possible actions: . Offers are made after rejections, and should an agent choose to accept an offer the negotiation ends. Each issue is often normalized such that
. For games with only one issue, the offer consists of the division of one pie. For multiple issues, offers are represented as vectors, subject to
. For this dissertation, the outcome space is assumed to continuous, linear, and normalized.2.1.3 Outcome Spaces
Each player has a preference ordering, called the preference profiles, on all possible outcomes. An outcome is weakly preferred to if , which is denoted . Similarly, is strictly preferred to (denoted ) if . For linear additive utilities, the preference profile can be inferred directly by the weights.
Now we present metrics used to evaluate our three criterion. An outcome is called Pareto Optimal if there exists no outcome that a player would prefer without worsening their opponent’s outcome. Formally:
The Pareto Frontier describes all Pareto optimal solutions, which we denote as . When an offer is not Pareto Optimal, then through negotiation there is potential to reach an outcome without players conceding anything.
There are two other useful metrics. Let denote the set of outcomes that are Pareto optimal. The bid distribution denotes the mean distance to the Pareto frontier, shown in Eq. 2.2. A high bid distribution indicates bids are on average far away.
(2.2) 
Usually, simultaneous maximization of outcomes is not possible, as there is a region of disagreement between players. Another useful metric is the product of utilities (), known as the Nash Product. A fair outcome is often characterized using the Nash solution, the outcome that maximizes the product of utilities, shown in Eq. 2.3.
(2.3) 
2.2 Strategies
In cases of perfect information, it is possible to determine the optimal bidding strategy [fatima2014principles]
. However, as previously mentioned, perfect information is unlikely in bargaining as agents are unwilling to give away their preferences in fear of exploitation. This motivates the development of negotiation tactics under imperfect information. These negotiation tactics can broadly be classified as
timedependent or behaviordependent tactics, based on a decisionfunction that maps state to a target utility.2.2.1 Baseline Strategies
Two are commonly used. The Hardliner always bids maximum utility for itself, which emulates the "takeitorleaveit" attitude. The Random walker denotes agents that bid randomly, serving as a standard baseline.
2.2.2 Timedependent Strategies
Timedependent Strategies denote functions that produce offers solely based on time. At every round, the agent calculates their decision utility which determines whether they accept an offer or not. For timedependent agents, this is:
(2.4) 
and , thus parametrizing the range of the offers. Frequently, is parametrized as an exponential function:
(2.5) 
where is the concession factor. is often set to 0 for simplicity. Fig. 2.1 shows the decision utilities of different agents. If , then the agent concedes towards the end and is known as Boulware. Otherwise, if , the agent concedes quickly and offers its reservation value, thus it is known as a Conceder. means the agent’s decision utility decreases linearly.
2.2.3 Behaviorbased Strategies
Behaviordependent and imitative bidding strategies observe the behavior of the opponent to make their own decisions on what to offer and what to accept. The most wellknown is titfortat, which produces cooperation through reciprocity. It’s three central mantras are 1) never defect first (play nice as long as the opponent plays nice), 2) retaliate if provoked and 3) can forgive after retaliation.
In negotiation, the relative titfortat (TFT) strategy reciprocates by offering concessions proportional to their opponent’s concessions from rounds prior:
(2.6) 
Here, is the offer for issue . This value is determined by the ratio of the opponent’s prior concessions, which then scales the agent’s own prior offer . The min and max values ensure offer values are within range.
2.3 StateoftheArt in Negotiation
Machine learning methods in the domain of negotiation can be broadly separated into the following types: Bayesian learning, nonlinear regression, kernel density estimation, and artificial neural networks. These methods have been applied to mostly model an opponent’s (acceptance and bidding) strategy, then derive an analytic response. This is because if an agent knows the opponent’s bidding strategy, then the agent can compute its optimal strategy [baarslag2016learning].
For estimating the opponent acceptance strategy, techniques can be siloed into the estimation of individual variables. Zeng and Sycara provide a popular and intuitive Bayesian approach for estimating the reserve price, using historical data. The model generates a set of hypotheses on the opponent’s reserve price, then attaches a likelihood using the history. The estimate is a weighted sum of the hypotheses based on their likelihoods [zeng1998bayesian]. This technique has been adapted to estimate the deadline for timedependent tactics [sim2008blgan]. In general, acceptance strategy estimation uses some form of Bayesian learning [sycara1997benefits, yu2013adaptive, sim2007adaptive, gwak2010bayesian, ren2002learning], augmented with nonlinear regression [agrawal2009learning, yu2013adaptive, sim2008blgan, haberland2012adaptive, hou2004predicting], kernel density estimates [farag2010towards, oshrat2009facing, coehoorn2004learning]
, polynomial interpolation
[saha2005modeling][matwin1991genetic, jazayeriy2011learning], and more recently neural networks [fang2008opponent].In contrast, neural methods have been applied much more aggressively to the bidding strategy [baarslag2016learning]. In simpler cases where the general bidding formula is known, regression is sufficient as the problem reduces down to parameter estimation. If no formula is known, then neural networks are employed to approximate the opponent’s bid strategy, typically using a large database of bid history. Oprea [oprea2002adaptive] uses a timeseries approach on singleissue negotiations, taking in only the opponent’s current bid. By 2008, early efforts for opponent move prediction using neural networks [carbonneau2008predicting], who focused on predicting human bidding strategies. This was particularly relevant in ecommerce and supply chain management, as forecasting bids is useful in determining automated strategies [lee2009neural, carbonneau2011pairwise, moosmayer2013neural]
. When the domain is general, researchers have found success using deep learning with multilayer perceptrons. Masvoula shows reliable predictions using single deep networks both with and without historical knowledge
[masvoula2005design, masvoula2011predictive]. Papaioannou and Rau et al. have shown the concession factor and weight of each issue can be predicted if the opponent is known to be timedependent, using multilayer neural nets [papaioannou2008neural, rau2006learning]or single layer, radial basis function neural nets
[papaioannou2011multi].Reinforcement learning approaches to negotiation began as early as the late 20th century, often denoted as adaptive learning [rapoport1998reinforcement]
. Today DRL has more frequently been used with natural language processing
[georgila2011reinforcement, cuayahuitl2015strategic]. Lewis et al. implement an endtoend DRL negotiation dialogue generator [lewis2017deal]. They curated a set of humanhuman dialogues with Mechanical Turk, then trained on four gated recurrent units
[cho2014properties], a type of longshort term memory neural net
[neubig2017neural]. However, this study focuses on emulating human language, with less concern on optimality— for instance, their DRL agent present 58.6% and 69.1% Pareto Optimality against simple autonomous agents and humans respectively, on a very limited, discrete action space (around 200 offers).As illustrated with this brief survey, there is an immense number of agent designs for negotiation. The primary weakness in best performing models, such as Bayesian models in acceptance or bid prediction, is they require specific domain assumptions and architectures. Another weakness is for these negotiators to perform well in populations of different strategies, an additional opponent classifier is needed, which introduces further uncertainty. Additionally, opponents can use more complex behavioral strategies and mixed strategies— pure strategies associated with a probability—that requires higher levels of adaptability to play against.
All of this motivates an adaptive agent with a fixed architecture that can perform well against different opponents. An endtoend negotiation agent is desirable as the only required input is the offer, time step, and public knowledge, and can adapt online during gameplay. Although deep learning often comes at the expense of explainability, a fixed architecture playing endtoend means we do not need additional classifiers and assumptions about the opponent. The success of AlphaZero in chess is largely because it did not rely on handcrafted heuristics and assumptions like other engines [silver2017mastering]; likewise, Libratus learned to exploit specific human opponent idiosyncrasies in poker [brown2018superhuman]. An endtoend, adaptive neural agent is the analogous solution for negotiation. It is a convenient coincidence that the negotiation domain also aligns with the current interest in continuous control positions deep reinforcement learning.
3.1 Deep Reinforcement Learning
Multiagent Reinforcement Learning is formally the study of nagent stochastic games [shoham2003multi], described as a tuple . is the number of agents. is the set of states and , with each the set of actions agent can take. In the most basic case, by treating the environment as static, the singleagent Qlearning algorithm developed by [watkins1992q] gives the optimal policy in an MDP with unknown reward and transition.
(3.1) 
estimates the value of taking action when on state , and the value of the state by taking the best action. Extension of this paradigm to multiple agents is difficult. One approach is to assume the environment as passive, each agent with their own reward and transition functions. However, this falsely assumes agent actions do not influence each other [sen1994learning]. Another approach is to define the value function over all agents actions, but introduces a dynamic programming challenge in updating .
In recent years, reinforcement learning has been applied successfully in conjunction with deep learning, using deep neural networks to approximate value functions. A breakthrough comes from policygradient methods. Traditionally, RL algorithms are actionvalue methods: after learning values of the action, algorithms select actions based on the estimated action values. In contrast, policygradient methods learn a parametric policy without consulting the value function [sutton2018reinforcement]. By policy we mean an agents strategy— what it does at a given state and time.
Additionally, in cases where the environment is dynamic, it may be optimal to acquire a stochastic policy— a probability distribution over possible actions. This distribution is updated to associate actions with higher expected rewards with higher probability values. Since probabilities can be over discrete or continuous action spaces, DRL is a useful control framework for negotiation, as the decision to accept or reject an offer is discrete, whereas bidding is on continuous space (on given issues).
3.1.1 Policy Gradients
Call the policy and let parameters define a probability distribution. The probability of action is denoted as , that is, the probability of taking action at time given that the state and parameters . Similarly, a learned value function, such as using a neural network to approximate the value, can be represented as , where is its weights.
As with actionvalue RL, policy parameters are optimized to maximize a scalar performance measure :
(3.2) 
which describes the expected future aggregate rewards (sum of rewards from until the end). The policy values are updated according to through gradient ascent:
For discrete actions, actions are selected by estimating a numerical preference value or logit , based on the state, action, and parameter values (weights in a neural net). Actions are then selected using the softmax distribution:
(3.3) 
For instance, for the acceptance strategy, an agent can reject or stop. Associate with these actions and respectively, and a stochastic policy is defined.
However, updating the policy in respect to requires the policygradient theorem, which provides guaranteed improvements when updating the policy parameters [sutton2018reinforcement]. The theorem states that change in performance is proportional to the change in the policy, and a full statement is given in Appendix A.1. The theorem yields a canonical policygradient algorithm— REINFORCE [sutton2018reinforcement, willianms1988toward, sutton2000policy]. The parameter updates is:
(3.4) 
where is the observed reward. Intuitively, the update is the reward multiplied by the gradient of the action probability divided by the action probability. If is high, this increases the chances of visiting that state in the future. Note, the policy gradient is often expressed as
, which yields the fraction through the chain rule.
3.1.2 Deep Reinforcement Learning for continuous variables
Secondly, actorcritic models are useful because they separate the policy space and action space, which means policy selection can occur on a continuous domain. For instance, in a univariate control problem, the choice of action can be sampled from a normal distribution. The policy approximation with a normal distribution is:
During backpropogation, the values are updated such that and reflect a better reward using the Equation 3.4.
3.2 ActorCritic Implementation
We have arrived at our main method. Unlike REINFORCE, which only learns a policy, actorcritic models simultaneously learn a value function approximation and a policy. Intuitively, the value function critiques whether an action undertaken by the policy is good, rather than being an absolute measure. Thus, we make a modification to Equation 3.4, substituting the reward with the value estimate in Equation 3.5:
(3.5) 
The process of negotiation thus requires two actorcritic nets— one for the acceptance strategy and another for the offer strategy. The algorithmic procedure is shown in Fig. 3.1, with pseudocode provided by Algorithm 3 in the Appendix. We use univariate and singleissue interchangeably, as with multivariate and multiissue. Next, we describe the architectures of the acceptance and bidding strategy.
3.2.1 Acceptance Net Architecture
The first neural network approximates the accecptance strategy. For the univariate case, the input is a twoelement vector consisting of the opponent’s offer and the current time step. For the multivariate case, the input is fourdimensional.
At every time step, Accept Net takes in the opponents offer, encodes it to a 512 hidden state using two affineRelu6 pairs. This base layer is shared between the actor and value network, which facilitates a shared representation [mnih2016asynchronous]. The actor takes in the embedded state and outputs two logit values, which are softmaxed to choose the appropriate action. Similarly, the value network outputs the expected reward estimate.
The Relu6 is a variant of the Relu functions (
, but capped at 6. Relu6 layers have been shown to train faster (due to the limit on byte representation) and to encourage the learning of sparse features earlier on [krizhevsky2010convolutional]. This is important since gameplay is pathdependent and, against a mixed set of opponents, states may be sparse, which we confirmed during preliminary testing. Hyperparameters were also chosen through testing, reducing layers and the number of hidden states until training behavior changed. The full architecture is shown in Fig. 3.2.After playout, the critic loss is calculated by taking the meansquared error (MSE) of the temporal difference— the difference between the observed rewards and the value network’s forward pass. Learning parameters are given in Table 3.1.
3.2.2 Offer Net Architecture
Next, assuming the agent has rejected the offer, the agent now takes in the same input and decides a counteroffer. Since we have three issues and offers operate on continuous space, the Offer Net must output a vector
. To do so, we implement DRL with continuous control, sampling from three different three types of distributions: 1) multivariate Gaussian, 2) three beta distribution and 3) three Cauchy distributions.
The multivariate Gaussian is parametrized by a vector of means and covariance matrix
. However, a common assumption in deep learning is that the neural network will capture interdependencies between variables. Hence, an estimate of individual standard deviations along each dimension will suffice. The probability density and policies are given explicitly below.
(3.6) 
The beta distribution is defined on the interval, and defined by two positive shape parameters and . This is useful as offers are held to a finite span. The PDF is:
(3.7) 
where denotes the gamma function (fractional factorial). Some useful properties of the beta distribution include its intuitive mean and relatively simple expression for variance:
(3.8) 
Lastly, the Cauchy distribution is parametrized similarly to the normal with and denoted more generally as location and scale. It’s density function is given as :
(3.9) 
The neural architecture follows the same actorcritic model described in Section 3.2. A base layer inputs into the value network and six other neural blocks, estimating the distribution parameters— three means (locations) and sigmas (scale). For the beta distribution, these are estimates on and . These six variables are then used to sample the offer, which serves as the action used during loss calculation.
(3.10) 
The value network consists of seven layers of affineRelu6 layers. Neural estimates of the mean were conducted with two affineRelu6 layers, followed with an affinesigmoid layer to constrain the output between 0 and 1. Sigma estimates used one affineRelu6 layer and one affinesigmoid layer. For the beta distribution, the network estimated three pairs of and . Since , the final sigmoid layer was replaced with a Relu layer. Fig. 3.3 shows the architecture in full. Apart from a similar justification for the use of Relus for Accept Net, Relus have documented success for continuous control as well [lillicrap2015continuous]. Hyperparameter choice was chosen in a similar way.
Training was undertaken using Adam with learning parameters are given jointly in Table 3.1. The exact computations for backpropagation are given in Algorithm‘2. Note, because backpropagation occurs on a continuous domain, logprobabilities of continuous density functions can be positive when variance is small.
Here, the entropy is defined as . Adding entropy introduces noise to enable action exploration. Also, high variance means higher loss, so overtime, the variance decreases to improve the precision of the evolved strategy.
Learning Rate  Epochs  Optimizer  

Accept Net  to  8000  Adam 
Offer Net (Gaussian & Cauchy)  to  4000  Adam 
Offer Net (Beta)  5000  Adam  
SelfPlay  3000  Adam  
Titfortat  5000  Adam 
3.2.3 Reward Scheme
Accept Net and Offer Net share the same reward scheme. With deadline , value weights , and final offer , the reward given to the neural agent is:
This reward function encourages the agent to increase its offer but not so much it forces a conflict deal and receives low reward. Unless specified, .
3.2.4 SelfPlay: Characterizing Behaviors with Game Theory
Within bargaining game theory, a focus has been how mechanisms induce norms of fairness, particularly from a branch of game theory called evolutionary game theory (EGT). EGT originates from biology, where it studies the dominance of species through evolutionary pressure, and has been extended to behavioral economics to understand the evolution of behavioral traits. Nowak et al. showed fairness could be induced if reputation was taken into consideration for the repeated ultimatum game [nowak2000fairness], where agents play many oneround negotiations with other agents.
Their most important finding was that, if agents learned to reject offers they deemed too low, a population of fair agents would emerge. Thus, reputation refers to the trait of intentionally reject low offers, and has been confirmed with computational and empirical results [rand2013evolution]. We implement a similar study to compare results, using the neural actorcritic model instead of evolutionary methods, and on multiround negotiation rather than the ultimatum game. Details of implementation are given in Section 4.4.2, where we demonstrate a similar appearance of fairness.
3.2.5 Against BehaviorBased Agents
Lastly, we train our agent against two behaviorbased agents. The first is the relative titfortat described in Section 2.2.3, and the decision function in Eq. 2.6. Furthermore, we implement a Bayesian Titfortat, by estimating the opponent’s value weights. The Bayesian titfortat agent first measures the opponent’s concession using its own utility function. Then, it mirrors the amount of concession. Finally, this offer is made as attractive as possible using a Bayesian opponent model [baarslag2013tit].
To do this, we first take the ratio of the opponent’s offer at and to update the decision utility. If the opponent concedes, then we concede; if they increase their share, we increase ours. Then, we estimate the opponent’s utility weights as the mean value of their offers. For instance, if an opponent offers then , then the utility is estimated as
(3.11) 
We then implement the Simplex algorithm [dantzig1955generalized] to maximize this value, fixed upon the decision utility we calculated prior. While this assumes the opponent makes concessions in particular (preferencebased) fashion, it remains a question whether the neural agent can uncover the correct concessions to make.
4.1 Theoretical Decision Utilities
4.1.1 Utility Moments
Before proceeding to results against timebased agents, we first derive the theoretical optimal strategies. Denote the decision utility of the timebased opponent as
This is in essence the same as Equation 2.5. In our case, is the reserve price and is normalized to . Then our utility is:
(4.1) 
The maximal point must be one where the marginal utility is 0. Before we take the derivative of , we first take the derivative of .
We then solve for our marginal utility by the product rule.
(4.2)  
Setting the reserve price to in our experiments, we can derive a much more elegant expression, as .
(4.3) 
It is a simple matter to check the second derivative is negative, hence the expression for the condition for the maximal point is
(4.4) 
Interestingly, this value does not depend on the total time and since
is a linear transformation of the utility function, this optimal time depends only on the concession factor and discount rate. The optimal stopping time can be expressed as:
(4.5) 
The theoretical values are shown in Fig. 4.1
. A strong phase transition occurs along the
, demarcated by the clearly lighter region.4.1.2 Outcome Space
In the field of automated negotiation, preferences are typically visualized through an outcome space plot. The axes are utilities of Player and . Possible outcomes are mapped to . Fig. 4.2 shows this plot for our negotiation process. In a), the Pareto frontier is shown by the rightmost edges of the polytope.
By theorems of fixed points and the simplex algorithm [wong2015bridging, dantzig1955generalized], the vertices in the outcome space must come from the vertices in the action space . The action space vertices that outline the frontier are found to be . Intuitively, these are points that offer the greatest marginal utilities to P1 and P2, based on their value weights and . The piecewise equation for the Pareto Frontier is given as follows, in Equation 4.8:
(4.8) 
This equation allows us to calculate the bid distribution and determine the efficiency of an agent’s bid strategy, provided in Section A.2 in the Appendix. Furthermore, the Nash Solution ( lies at given by the offer , provides a benchmark for the fairness when playing against behaviorbased agents.
4.2 Acceptance Strategy
4.2.1 Behavioral dynamics: Cliffwalking vs optimal play
The central question for an acceptance strategy is when given an offer, whether or not to accept or wait for potentially better future offers. However, if the agent fails to accept before the deadline, then the conflict deal is enacted and both agents do not receive any reward. Given a discount rate and opponent concession factor, the goal is to find the best moment to accept an offer, inferring from their prior offers.
Thus, the acceptance strategy can be seen as an optimal stopping problem with an additional cliffwalking problem to solve. Fig. 4.3 shows the loss, rewards, and playing time as the network trains against a linear agent () with no discount (). Through stochastic sampling of new points, the agent notices greater reward by waiting, illustrated by gradual trends in playing time (green). However, once the agent reaches the deadline at 20 rounds, the conflict deal is enacted and a reward of is issued, producing a large loss. We present only the multivariate case, as results for the univariate case are the same but with lower complexity.
To analyze the stopping time, we consider the evolution of acceptance probabilities during gameplay against timebased opponents. Fig. 4.4 shows the logit values used in Eq. 3.3 and acceptance probabilities against Boulware, Linear, and Conceder agents.
The first row shows the acceptance probabilities at each time step. The cumulative probability (orange) denotes the likelihood the game ends at a certain time step, given as:
(4.9) 
Since the discount rate is , the optimal value is waiting until the final point in time. The decrease in stoppage time shown by the rightshifting cumulative probability is suboptimal, although this is not uncommon in conservative agents. In the value function (blue, second row), there is also a slight decrease after the logit values cross. This indicates that the expected reward at these times may be the same.
Another way to see this is to consider the marginal utility over time. Since Boulware agents only concede towards the end, the Neural agent is forced to wait to achieve comparable results, whereas it may be “satisfied" earlier against Conceders. The marginal utility of the Boulware agent is thus much greater towards the end, whereas marginal utility is high at the beginning against Conceders. Explicitly, for , the expression for marginal utility is:
(4.10) 
This analysis is corroborated further once we introduce the discount rate. Fig. 4.5 shows the acceptance probabilities and logits once discounting is introduced.
Looking at the red curve, has positive marginal utility and for , the marginal utility is negative from time step 7 onwards. For , the utility function is relatively flat after time step 10, which means the marginal utility is close to 0. Here, we observe the greatest time deviation.
4.2.2 Marginal Analysis: Marginal Utility determines Error
Due to the stochastic nature of deep learning, it’s difficult to construct a precise mathematical proof of how changes in marginal utilities push against each other. However, we can test this empirically. The neural agent played against a set of different agents, with concession factors of 0.95, 1.5, 2, 3, 5, 10. In Fig. 4.6a), the curves show our max utility (Equation 4.1) and the red dot shows the optimal stopping time given by Equation 4.5. Note, as increases, the curves grow sharper and since , the magnitude of the second derivative strictly increases.
Fig. 4.6b) shows an inverse relationship between the time error and the reward error. The “peakier" the curve, the more likely the Neural net selects the optimal time. However, deferral by even one time step leads to large amounts of diminished utility, hence creating the larger reward error. In contrast, using the second derivative derived in Equation 4.6, we observe in Fig. 4.6c) that as the second derivative approaches 0, the time error increases.
Having shown what produces the reward and time errors, we can address our subproblem about limitations. For future work, we may dynamically reduce the learning rate using the second derivative and distance to the deadline for better convergence. Numerical results are summarized in Table 4.1.
Time Error  3.95  11.62  7.81  5.86  4.99  3.0  1.0 
Reward Error  0.0234  0.712  1.054  1.278  1.366  1.302  0.907 
Second Deriv. ()  0.0722  0.094  0.152  0.223  0.389  0.777  1.886 
4.2.3 Preferencebased concessions produce fairer outcomes
Finally, we consider optimality. Since the final offers depend on the timebased agent, so do the optimality measures. Thus, the way opponent agent algorithmically constructs their offers will appear differently in the outcome space. Fig. 4.7 shows the distribution of accepted offers after 400 gameplays, with and . The first randomly samples from the plane that satisfies the following condition:
(4.11) 
where is the decision utility at time and is the weighed utility for issue . The second uses a preferencebased, monotonic concession strategy— it satisfies Equation 4.11, but concedes starting from the issue it values the least (). Multivariate Gaussian noise with a standard deviation of is added to prevent deterministic offers. When the timebased agent uses preferencebased, monotonic concession strategies then this guarantees offers to lie on the Pareto Frontier.
()  Av. Reward  Av. Time  

Planar Samp.  1.64  0.54  1.84  4.89 
Preferencebased Concession  0.46  0.00  2.61  5.7 
Pure Random  1.92  1.24  N/A  N/A 
The preferencebased method produces offers that lie on the Pareto Frontier. Because of this optimality, the neural agents play on average a longer time when its opponent follows this strategy. Random planar sampling yields considerably better results than pure random sampling, with a bid distribution difference of (shown in Table 4.2). The magenta points in Fig. 4.7 arise because, for every time step, the decision utility is fixed for fixed . Table 4.2 summarizes the mean outcomes of the gameplays, with the preferencebased concession performing the best.
4.3 Bidding Strategy
4.3.1 Precision in SingleIssue Negotiation
Before evaluating performance on the multivariate case, we verify the univariate case. Fig. 4.8
shows the action policies given by Cauchy distributions for specific decision utilities. As the concession factor increases, the distribution of means transfers from right to leftskew. This can be attributed to the magnitude of the marginal utility. Cauchy means are clustered tightly for the Boulware agent when
is low, as the marginal utility is low early on. However, as time passes and the Boulware agent begins to concede greatly, the distance between means increase. Conversely, when the opponent is a Conceder, concession begins early, so marginal utility is large when is small, leading to rightskew. The Conceder case is not as pronounced as the Boulware case due to the cliff at the deadline. As expected, means are spaced out linearly against linear agents.In sum, the change in decision utility affects the distribution of the means. For completeness, Fig. A.1 in the Appendix shows a heat map of how the neural agent’s utility changes in respect to the opponent’s.
4.3.2 Multivariate Training Dynamics
Next, we present the multivariate case. We trained on a grid of concession factors for fixed discount rates. We denote the three issues as issues , , and . Fig. 4.9 shows the first 1000 epochs, using a multivariate Cauchy distribution with and no discount. Unlike training Accept Net, cliffwalking is less present, as the final action (accept) lies with the opponent and timebased agents are very likely to accept. The agent quickly learns to wait longer, converging at a higher time step.
However, unlike accept net, this is not simply a binary action where rejecting an offer leads to the next round. The agent has to produce offers that induce rejection from the timebased agent. Fig. 4.10 shows how Cauchy means vary over time. The first row shows the individual Cauchy means for issues , and respectively. Shown in light blue is also the normalized utility (divided by the total possible utility of 6) and the equilibrium payout from 100 gameplays. The second row shows the opponent’s decision function (blue) and our maximum utility at each time step. The value estimate given by the value net is shown in green. At the bottom, the red line shows the mean stoppage time, with the distribution of times shown with a kernel density estimate.
Since the cliffwalking aspect is not as prominent (the only case where the conflict deal is enacted is if the agent proposes the full amount at the end), all stoppage times are relatively high, although diminishing stoppage time is still observed when the marginal utility is lower. For instance, when , the marginal utility is constantly and the second derivative is , the stoppage time is . Comparably, when , the marginal utility is increasing and the mean stoppage time is .
Note issue varies the most, either through concession (Fig. 4.10a)) or increase of the offer value as shown in Fig. 4.10c). At a glance, this may be counterintuitive, since a change in or would yield the most marginal gains for the agent. However, the opponent values issue the most, which means produces the largest amount of gradient for the least amount of loss during concession. As a result, the agent learns the negotiate close or along the Pareto Frontier, which we show later in the distributional analysis. Secondly, there is a clear progression between Boulware and Conceder strategies when comparing the linear agent to the Boulware agent.
Additionally, in the bottom row, the green value function remains fairly constant throughout, until the drop off towards the end induced by the deadline. The value function remains flat since the expected value is constant— so long as the neural agent sticks to its strategy, the payout will not change. While performance is not optimal, it achieves more than of the optimal which is typical of riskaverse agents whose behaviors are generally conservative on estimates [sandholm1999bargaining].
Next, we compare this to the case when discounting is introduced, using the beta distribution as an example. Fig. 4.9 shows the gameplays of a neural agent using the betadistribution. The first row shows the evolution of the multivariate distribution means (blue, orange and green for , , and respectively), the evolution of the normalized utility (black), and the reward under 100 gameplays. The second row shows the theoretical maximums (orange) and mean stoppage time (red). Immediately, we observe that the agent begins sampling around the same initial values, then alters its strategy as it learns more about the opponent. The way it alters its strategy varies depending on its own inherent discount rate, which demonstrates adaptability.
Finally, issue is again the issue with the most variation, with the same argument that its change produces the highest gradients during gameplay, due to the opponent valuing the most. Mean stoppage time is close to the optimal, with mean deviation. While this is not precisely optimal, it is quite good. To understand what causes this limitation, we analyze the probability distributions.
4.3.3 Offer Strategy requires Sensitivity to Variance
We compare the outcome space of agents using Gaussian, Cauchy, and beta distributions, after playing 3000 rounds against batches of mixed opponents. Fig. 4.12 shows the distribution of final offers given by each agent, with the addition of a random agent, when playing against a linear agent with no discount rate. Since timebased agents make monotonic concessions, lower yvalues imply longer gameplay times.
As expected, the random agent produces offers distributed randomly in the outcome space. At a glance, the beta distribution outcomes bear the most resemblance to the random agent. This is due to the beta distribution’s initial high variance. This can be adjusted by increasing the constant added to the initial values of and . However, also note that the scattered points are on average greater than 3.
The normal distribution produces the most consistent results, with results clustered around the vertex. The Cauchy distribution performs similarly, but on average performs better, with a maximum value of and average . However, it also has a much greater variance when compared to the Normal distribution. We can conclude convergence to optimal play requires sufficient initial variance to prevent convergence to local optima. Additionally, the maximum value achieved by any of these distributions was achieved by the beta distribution. These values are presented in Table 4.3.
Av. Reward  Av. Time  Reward. Range  

Rand. Samp.  1.901  1.246  NA  NA  NA 
Beta  1.741  1.0018  3.587  10.11  5.378 
Normal  1.585  0.0815  4.993  11.03  0.294 
Cauchy  2.261  0.218  5.051  14.66  4.403 
The normal distribution produces consistent results, with the lowest bid distribution and reward range, and also is closest to the Nash Solution. In expectation, the Cauchy distribution produces better results but with a higher bid distribution and is farther from the Nash Solution. However, having a large distance from the Nash Point is not necessarily bad, as exploiting the opponent’s strategy leads to higher rewards. Without discount, the neural agent can improve its own outcomes by waiting.
The four panels in Fig. 4.12 reveal two opposing forces that make continuous DRL difficult in this domain. Convergence to optima requires high variance, yet avoiding the conflict deal requires low variance to prevent sampling the conflict deal. For further proof, consider that the time error decreases with discount rate, comparing Fig. 4.10 and Fig. 4.11. The discount rate shifts the optimal away from the cliff, thus sampling around the optimal produces less error due to slow change in marginal utility, and the smooth reward function allows more accurate function approximation [lillicrap2015continuous, sutton2018reinforcement].
The next step is to compare the variances of the three distributions, and their sensitivity to parameter change. The Normal’s variance is directly parametrized by the action network. In contrast, the variance of the beta distribution depends on both shape parameters and :
which means large, simultaneous increases in both and is required to lower variance, leading to slow convergence. In contrast, the Cauchy distribution famously does not have a theoretical mean, variance or kurtosis, due to laws of integration.
This points to why the Cauchy distribution works better, all things held equal. It is parameterized as directly as the normal distribution, but is also has "heavytailed" and "peakier" than the Gaussian. This slower decay in the tails means lower variance sensitivity, hence avoiding convergence to local optima. At the same time, the Cauchy distribution also has a much higher peak than the Gaussian, which means there is less cost in accuracy when sampling.
In regards to our study’s objectives: the results from Sections 4.2 and Section 4.3 show slow convergence when variance is high and suboptimal convergence when variance is low due to lack of action exploration as the primary limitation. This suggests the learning rate can be adjusted through marginal utilities, the distribution’s kurtosis (peakyness and tailbehavior), and the variance sensitivity for faster convergence and efficient outcomes. A more aggressive learning rate can curtail distributions with lower variance sensitivity. Furthermore, variation in concession factor and discount rate yields different strategies from Offer Net, thus demonstrating adaptivity.
4.4 SelfPlay: The Emergence of Fairness
So far, we have addressed subproblems related to training barriers and demonstrated exploitative capabilities against timebased agents. While play against timebased agents provides clear benchmarks due to monotonic timebased concession, play against behaviorbased agents is required to evaluate behavioral traits such as fairness. In this section, we first present a gametheoretic framework of our games, then the results for single and multiissue selfplay, then against two variants of titfortat agents.
4.4.1 Gametheoretic Framework
We introduce a few gametheoretic concepts required for indepth behavioral analysis. An extensive game consists of a set of players , a set of sequences that denote possible game trajectories. A game tree describes this trajectory of states, roundbyround. A Nash Equilibrium (NE) denotes an outcome where no player wants to willingly deviate. In extensive games, a strategy profile is an NE if
denotes the strategy set of player . Let denote the strategy profiles, where each bracket contains a player’s sequence of moves [osborne1994course].
In extensive games, the concept of subgames describes part of the game tree which function as a game itself [osborne1994course]. Fig. 4.14 shows the game tree of the centipede game, a canonical game in game theory, and Fig. 4.16 shows the game tree of a bargaining game. In this instance of the centipede, the total size of the pie increases by at every time step. The players can choose to wait or defect. Consider the rightmost node in Fig. 4.14 labeled , denoting P2’s decision to cooperate or defect. Since the payoff of defecting yields a reward of over from cooperating, P2 will defect if they are rational. The subtree stemming from can then be reduced to .
Once P1 realizes P2 will defect, P1 will also defect as this yields a higher reward. This process continues until P1 defects in round 1. The process of iteratively reducing up the tree is known as backwardsinduction. The result at the end is a subgame perfect Nash Equilibrium (SPNE), a type of NE that is also the equilibria of subgames. The SPNE in this game is found to be . Ironically, if both players waited until the end, they would receive higher rewards. Hence, for the centipede game, we expect to see cooperative agents wait until the end, while rational agents defect at the beginning.
4.4.2 Univariate selfplay results
Before training with selfplay in the multiissue domain, we consider a simplified version. Instead of giving offers in , consider the case where bidding actions are constrained to a binary decision—either offering or . Thus, agents can either offer a low amount to their opponent (rational behavior), or a fair amount. These four choices can be summarized as:

Offer low, reject nothing. This is typically the SPNE, thus rational (G1).

Offer high, reject nothing. This is altruistic (G2).

Offer high, accept high. This agent is fair (G3).

Offer low, accept high. This one is often disregarded, as it is a hardliner G4.
Fig. 4.17 shows the training results for the bargaining and centipede game, with the discount factors set to and respectively. Note, by setting the discount rate to greater than , the bargaining game effectively becomes a more complex version of the centipede game. For the bargaining game in Fig. 4.17a), the reward for P2 is initially low, then increases with time approaching 1 round. Conversely, the reward for P1 decreases. We infer that P2 learns to reject P1’s offer, and P1 learns to accept. This play is close to rationally optimal. If the agents were perfectly rational, the game would end immediately. However, P2 adopts a strategy that forces play to go on— we analyze the reason for this further in the multivariate case.
In the centipede game in Fig. 4.17b), the players learn to play close to 20 rounds, maximizing the "interest" accumulated. The total size of the pie is . By the final round, P1 holds a mixed strategy yields 47% rational offers ( and 53% fair offers (. As the time series shown are the running averages, the big dips show brief spans where the P1 adopts a fair strategy. These dynamics can be seen more clearly by observing how decision logits evolve during gameplay. Fig. 4.18 shows the probabilities of giving rational and fair offers at each time step, and the probabilities of accepting an offer (for one game trajectory).
In both cases, agents startout giving low offers to their opponents. However, as time moves forward, the probability of a fair offer increases, to increase the probability of acceptance. For the bargaining game (a), the stoppage time reaches a maximum close to the beginning. This suggests the network, through gameplay, learns outcomes similar to backwardinduction. Similarly in the centipede game, the network learns to wait, leveraging “interest" to accept near to the deadline, which also indicates cooperative behavior. Together, we conclude the neural agent learns to accept optimally and as time moves forward, shift its behavior from G1 (rational) to G3 (fair).
4.4.3 Multivariate Selfplay
Now, we extend analysis to the continuous, multiissue case. Fig. 4.19 shows the training dynamics of the Offer Net and Accept Net, whose final rewards are plotted per epoch, with the discount rate set to . Initially, the Offer Net (blue) has higher reward— as long as some reward is given to Accept Net, the Accept Net will accept it. However, around epoch 1600 the Accept Net learns to invoke the conflict deal. This is demarcated by the large drop in reward for both blue and orange to . After which, the Offer Net must concede some by offering a fairer amount.
In sum, by including some probability of the conflict deal, neural agents force a counteroffer that is fairer. This departs from classical game theory, as an example of a noncredible threat. A noncredible threat describes actions that perfectly rational agents will not carry out, as it would also leave themselves worse off [osborne1994course]. The adaption of noncredible threats is also observable in the univariate centipede game, with periods of dips in P1’s reward, following the conflict deal. For discounted bargaining, convergence to low playing time suggests that heavy discounting acts as a similar threat— if you do not offer a fair deal, I will drag on the negotiation. By keeping noncredible threats part of a mixed strategy, fair outcomes can evolve.
This is significant because it agrees with results from evolutionary game theory. We previously mentioned in Section 4.4.1, that reputation produces fairness in the repeated ultimatum game. Nowak et al. showed this through the same, exact simplified minigame (bids restricted to low and fair offers), then full bidding space using populationbased experiments [nowak2000fairness]. Populations of G1, G2, and G3 played against each other, and “reproduced" based on their utility. After many rounds, results showed that the rational agent (G1) dominated. However, if the prior acceptance history was available, which showed agents rejecting below a certain threshold, then a population of fair agents (G3) who offered high and accepted high would emerge.
In other words, there is a strong similarity between noncredible threats in our neural agent and the rejection of low offers (using reputation) in evolutionary strategies. This is by far the most interesting result, and it’s important to note evolutionary methods and RL are often framed as competing choices for agent design [salimans2017evolution]. Similar results produced in these two fields may drive future research directions.
4.4.4 Against TitforTat Agents
Finally, we present results against relative TFT and the Bayesian TFT agent. Since this investigation studies whether the acceptance and bidding strategy can adapt and induce promising counterbids, each game is designed as follows: the TFT agent makes a bid, then the neural agent makes an acceptance decision and counterbids. Thus, the game can only end on the neural agent’s acceptance.
Against the relative TFT agent with no discounting (Fig. 4.20), the neural agent converges to the vertex. Yellow represents the ending epoch and we observe the neural agent’s utility is greater than 3. Notably, the TFT agent makes offers opposite of the Pareto Frontier. This arises as the relative TFT agent measures concession with respect to its own utility. As we’ve analyzed in the timebased opponents, DRL agents are prone to adjusting variables that yield the greatest rewards and lowest losses. To the TFT agent, this is reversed, prompting it to concede the issue it values most and propose away from the Pareto frontier. This demonstrates adaptivity.
In contrast, the neural agent cooperates with the Bayesian TFT agent. In Fig. 4.21a (epoch 0) the neural agent performs randomly and suboptimally. Note, the color bar represents time step. However, the bid direction drifts towards the vertex (Figs 4.21(b) and (c)). By epoch 1450, the neural agent learns to induce results near the Nash point, in only four moves due to the discount rate.
With results from timebased agents and selfplay, our analysis shows that when concession is necessary, the bidding strategy gravitates toward to . This ensures near Pareto Optimal payoff if its offer is accepted. Exploitation occurs against timebased and relative TFT agents, while fairer outcomes arise with more complex agents.
5.1 Summary of Discussion
Bilateral negotiation presents a unique domain that combines discrete and continuous control problems. Furthermore, the deadline produces a utility function analogous to cliffwalking. This paper is a fundamental evaluation of actorcritic models for negotiation, measuring its ability to exploit, adapt, and cooperate.
The neural agent shows clear exploitative behavior against timebased agents. For acceptance, the neural agent demonstrates precise logit switching behavior, in transitions between rejecting and accepting offers. The acceptance strategy resembles a conservative agent, accepting a little before the optimal time. For the bidding strategy, we evaluated the use of Normal, Cauchy, and beta distributions for continuous control. The Cauchy has the highest reward, but the Normal is more consistent. The neural agent learns ways to evaluate the opponent, such as maintaining high mean, high initial variance to ensure enough rejections, before lowering the variance to more deterministic outcomes. This also demonstrates adaptability to concession and discounting.
Timebased experiments reveal the barriers to optimal convergence. We discover the error in stoppage time can be explained by the change in marginal utility (second derivative) and cliffwalking: the agent waits for higher rewards, then is punished aggressively due to enacting the conflict deal. The primary factors that influence the bidding optimality is tradeoffs in variance (i.e.the beta distribution suffers from slow convergence due to low variance sensitivity). High variance is required to seek out optimal strategies, but low variance helps avoid the conflict deal. The peakiness of the Cauchy and its heavy tails makes it a suitable candidate.
The neural agent was also shown to be cooperative and adaptive. When playing against timebased agents with preferencebased concessions, offers are accepted along the Pareto Frontier and produce the highest expected reward. Selfplay in the centipede game shows agents are willing to accrue interest, which demonstrates cooperation over rationality. Against simple Bayesian TFT agents, the neural agent learns to quickly arrive at the Nash Solution, resulting in winwin cooperation. Since all results arise from a single neural architecture, the neural agent shows significant adaptability. Most importantly, the neural agent forces fairer results by either 1) utilizing the conflict deal or 2) levying discounting to force fairer offers. There is a strong similarity between noncredible threats in our neural agent and the rejection of low offers (using reputation) in evolutionary strategies. It’s important to note evolutionary methods and RL are often framed as competing choices for agent design [salimans2017evolution]. Beyond theoretical interest in diverging from classical game theory, these results may guide the design of fairer negotiations, with EGT from a population perspective and DRL from the individual agent’s perspective.
Before discussing future work, I’ll note what didn’t work. Initially, the use of LSTMs seemed promising due to its success in natural language negotiation generation. However, there the action domain is discrete and limited.
5.1.1 Evaluation and Future Work
One weakness of this study is it studies a specific preference ordering. For the scenario, promising avenues include variations in the utility functions, as there are six combinations of preference orderings for three issues. More importantly is the inclusion of more complicated behaviorbased agents. One barrier to this is, unlike the iterated prisoner dilemma that has hundreds of established strategies, we lack a repository that collects these strategies, such as the Axelrod library for the IPD [axelrodproject].
However, this is quickly changing. An annual negotiation competition that began in 2010 [baarslag2012first] collects strong bots into the Genius Environment [lin2014genius], maintained by Tim Baarslag. I anticipate running the neural agent against these bots, to understand how DRL performs in a tournament setting and against more complicated strategies.
Another weakness of this study is experimentation with design choices (learning methodology), although this was not possible given the focus of this dissertation was behavioral analysis. A separate study with a deep learning focus could scopeout the impact of neural architecture (the type of nonlinearity and number of layers) and hyperparameters (learning rate, reward discounting and the use of schedulers). Additionally, increasing the complexity of the algorithm may improve performance, such as increasing the input space to include prior moves against trajectorybased opponents, or the use of MonteCarlo Tree Search and rollout [silver2016mastering].
Bibliography
Appendix A Appendix
a.1 PolicyGradient Theorem
The policy gradient theorem states the change in scalar is proportional to change in policy weights. More specifically, this is given as:
Here, denotes the onpolicy distribution of a state under policy . We can think of this as the frequency a state has occurred. is the value of the stateaction pair and is the change in the policy distribution. What this says is a positive change in can be produced a proportional shift in the policy. A full proof can be found in Chapter 13 of [sutton2018reinforcement].
a.2 Point to Line Calculation
The general form of the closest distance from a point to line is
(A.1) 
The Pareto Frontier s given by Eq. 4.8. Hence, the distance of a point to the Pareto Frontier is:
a.3 ActorCritic Playout Implementation
a.4 Change in Univariate Mean Estimation
The yaxis shows the decision utility at a given time, which is inversely related to the concession factor. The higher the decision utility, the more Boulware the agent is. The concession value can be converted with .
a.1 PolicyGradient Theorem
The policy gradient theorem states the change in scalar is proportional to change in policy weights. More specifically, this is given as:
Here, denotes the onpolicy distribution of a state under policy . We can think of this as the frequency a state has occurred. is the value of the stateaction pair and is the change in the policy distribution. What this says is a positive change in can be produced a proportional shift in the policy. A full proof can be found in Chapter 13 of [sutton2018reinforcement].
a.2 Point to Line Calculation
The general form of the closest distance from a point to line is
(A.1) 
The Pareto Frontier s given by Eq. 4.8. Hence, the distance of a point to the Pareto Frontier is:
a.3 ActorCritic Playout Implementation
a.4 Change in Univariate Mean Estimation
The yaxis shows the decision utility at a given time, which is inversely related to the concession factor. The higher the decision utility, the more Boulware the agent is. The concession value can be converted with .
Comments
There are no comments yet.