Learnable Strategies for Bilateral Agent Negotiation over Multiple Issues

09/17/2020 ∙ by Pallavi Bagga, et al. ∙ 10

We present a novel bilateral negotiation model that allows a self-interested agent to learn how to negotiate over multiple issues in the presence of user preference uncertainty. The model relies upon interpretable strategy templates representing the tactics the agent should employ during the negotiation and learns template parameters to maximize the average utility received over multiple negotiations, thus resulting in optimal bid acceptance and generation. Our model also uses deep reinforcement learning to evaluate threshold utility values, for those tactics that require them, thereby deriving optimal utilities for every environment state. To handle user preference uncertainty, the model relies on a stochastic search to find user model that best agrees with a given partial preference profile. Multi-objective optimization and multi-criteria decision-making methods are applied at negotiation time to generate Pareto-optimal outcomes thereby increasing the number of successful (win-win) negotiations. Rigorous experimental evaluations show that the agent employing our model outperforms the winning agents of the 10th Automated Negotiating Agents Competition (ANAC'19) in terms of individual as well as social-welfare utilities.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


We present a novel bilateral negotiation model that allows a self-interested agent to learn how to negotiate over multiple issues in the presence of user preference uncertainty. The model relies upon interpretable strategy templates representing the tactics the agent should employ during the negotiation and learns template parameters to maximize the average utility received over multiple negotiations, thus resulting in optimal bid acceptance and generation. Our model also uses deep reinforcement learning to evaluate threshold utility values, for those tactics that require them, thereby deriving optimal utilities for every environment state. To handle user preference uncertainty, the model relies on a stochastic search to find user model that best agrees with a given partial preference profile. Multi-objective optimization and multi-criteria decision-making methods are applied at negotiation time to generate Pareto-optimal outcomes thereby increasing the number of successful (win-win) negotiations. Rigorous experimental evaluations show that the agent employing our model outperforms the winning agents of the Automated Negotiating Agents Competition (ANAC’ ) in terms of individual as well as social-welfare utilities.


An important problem in automated negotiation is modelling a self-interested agent that is learning to optimally adapt its strategy while bilaterally negotiating against an opponent over multiple issues. In many domains a model of this kind will need to consider the preferences of the user the agent represents in an application. Consider, for instance, bilateral negotiation in e-commerce where a buyer agent negotiates with a seller agent to buy a product on user’s behalf. Here the buyer has to settle the price of a product specified by a user, around a number of similar issues expressed as user preferences about delivery time, payment methods and location delivery Fatima et al. (2006).

A practical consideration in bilateral multi-issue negotiation of this kind is the lack of a mediator to coordinate the negotiation, so the participating agents must reach agreement using a decentralized protocol such as alternating offers  Rubinstein (1982). Here an agent initiates the negotiation with an offer and its opponent may accept or reject that offer. If the opponent accepts, negotiation ends with an agreement and each agent evaluates the agreement’s utility according to their preferences. Otherwise, the opponent makes a counter offer. This turn-taking continues until an agreement is reached. In practice, the negotiation time is limited for each agent, as there is often a shared deadline for its termination. This means that the agents are under pressure to reach an agreement and must consider the risk of having an offer rejected by the opponent as the deadline is approaching Fatima et al. (2002).

Additional challenges must be considered prior to designing a negotiation strategy in the above setting: (a) The users need to express their preferences by ranking a few representative examples instead of providing a fully specified utility function Tsimpoukis et al. (2018), thus agents are uncertain about the preferences characterising the profile of the user. (b) The agents have no previous knowledge of the preferences and negotiating characteristics of their opponents. Baarslag et al. (2016) (c) The utility of offers might decrease over time (in negotiation scenarios with a discount factor), thus, timely decision on rejecting or accepting an offer and making acceptable offers are of high importance Fatima et al. (2002).

Three main approaches have been proposed to address the above challenges. The first approach is based on hand-crafted predefined heuristics that are proposed in a number of settings with competitive results 

Costantini et al. (2013), and although interpretable (e.g. see Alrayes et al. (2018)), they are often characterised by ad-hoc parameter/weight settings that are difficult to adapt for different domains. The second approach relies upon meta-heuristic (or evolutionary) methods that work well across domains and improve iteratively using a fitness function (as a guide for quality);  Lau et al. (2006); Lang and Fink (2015)

however, in these approaches every time an agent decision is made, this needs to be delivered by the meta-heuristic, which is not efficient and does not result in a human-interpretable and reusable negotiation strategy. Perhaps the most encouraging results are shown by the third approach, based on machine learning algorithms, which show best results with respect to run-time adaptability 

Bagga et al. (2020); Razeghi et al. (2020), but often their working hypotheses are uninterpretable, a fact that may hinder their eventual adoption by users due to lack of transparency in the decision making that they offer.

To avoid some of the issues above, the aim of this work is to develop an interpretable strategy template that guides the use of a series of tactics whose optimal use can be learned during negotiation. The structure of such templates depends upon a number of learnable choice parameters determining which acceptance and bidding tactic to employ at any particular time during negotiation. As these tactics represent hypotheses to be tested, defined by the agent developer, they can be explained to a user, and can in turn depend on learnable parameters. The outcome of our work is an agent model, called ANESIA (Adaptive NEgotiation model for a Self-Interested Autonomous agent), that formulates a strategy template for bid acceptance and generation so that an agent that uses it can make optimal decisions about the choice of tactics while negotiating in different domains.

Our specific contribution involves implementing ANESIA as an actor-critic architecture interpreted using Deep Reinforcement Learning (DRL) Lillicrap et al. (2016)

. This allows our agent to learn the optimal target utility value dynamically and thus to optimally adapt the strategy in different negotiation domains. Below this target threshold utility, our agent neither accepts nor proposes any bid from/to the opponent agent. To estimate the user preferences from a given partial information set, we use a meta-heuristic 

Yang and Deb (2009)

that can handle potentially large search spaces optimally, triggered prior to the start of the negotiation to build a user model. Also, our bid generation is focused on Pareto-optimal bids. We obtain these bids through a combination of a multi-objective evolutionary algorithm 

Deb et al. (2002) and a multi-criteria decision making method Tzeng and Huang (2011) that use the estimated utilities found in the user and opponent models. Moreover, to evaluate our approach with the state-of-the-art we conduct simulation experiments in different negotiation domains and against the negotiating agents presented at the ANAC’19 competition111http://ii.tudelft.nl/nego/node/7, as the theme of this tournament has been bilateral multi-issue negotiations. Our results indicate that our strategy outperforms the winning strategies in terms of individual and joint (or social welfare) utilities.

Related Work

Existing approaches with reinforcement learning have focused on methods such as Tabular Q-learning for bidding  Bakker et al. (2019) or DQN for bid acceptance Razeghi et al. (2020), which are not optimal for continuous action spaces. Such spaces, however, are the main focus in this work in order to estimate the threshold target utility value below which no bid is accepted/proposed from/to the opponent agent. The recently proposed adaptive negotiation model in Bagga et al. (2020) uses DRL (i.e. DDPG) for continuous action spaces, but their motivation is significantly different to ours. In our work, the agent tries to learn the threshold utility which will be used among one of the tactics to be used in acceptance and bidding strategy, while Bagga et al. (2020) use DRL for a complete agent strategy while negotiating with multiple sellers concurrently in E-markets. Moreover, we decouple the decision component in our model: separate acceptance and bidding strategies are learned based on a strategy template containing different tactics to be employed different times.

Many meta-heuristic optimization algorithms have been acknowledged in the negotiation literature, such as Particle Swarm Optimization for opponent selection in 

Silva et al. (2018), Chaotic Owl Search El-Ashmawi et al. (2020) and Simulated annealing Klein et al. (2003) for generating offers. All of these algorithms focus on different problem areas than ours which is to solve a constraint satisfaction problem of estimating the user model which best agrees with the given partial ordering ranking of user preferences. For this we use Cuckoo Search Optimization Yang and Deb (2009), a nature-inspired meta-heuristic algorithm which has been widely used in many engineering problems, but not in the domain of bilateral negotiation.

The Genetic Algorithm NSGA-II for multi-objective optimization has been used previously to find multiple Pareto-optimal solutions 

Hashmi et al. (2013), but here we also hybridize it with TOPSIS to choose one best among a set of ranked Pareto-optimal outcomes during negotiation according to user and opponent preferences. The applicability of hybrid method of NSGA-II and TOSPIS has definitely been extensively employed in different design optimization problems such as Wang et al. (2016), Méndez et al. (2009), Etghani et al. (2013) ; Li and Zhang (2012), Zeelanbasha et al. (2020) but we are the first to introduce it in the domain of negotiation to generate Pareto optimal outcomes between two negotiating agents.

The ANESIA Model

We assume that our negotiation environment consists of two agents negotiating with each other over some domain . A domain consists of different issues, , where each issue can take a finite set of possible values: . An agent’s bid is a mapping from each issue to a chosen value (denoted by for the -th issue), i.e. . The set of all possible bids or outcomes is called outcome space and is denoted by s.t. . Before the agents can begin the negotiation and exchange bids, they must agree on a negotiation protocol , which determines the valid moves agents can take at any state of the negotiation  Fatima et al. (2005). Here, we consider the alternating offers protocol Rubinstein (1982), as discussed in the introduction with possible .

Furthermore, we assume that each negotiating agent has its own private preference profile which describes how bids are offered over the other bids. This profile is given in terms of a utility function , defined as a weighted sum of evaluation functions as shown in (1).


In (1), each issue is evaluated separately contributing linearly without depending on the value of other issues and hence is referred to as the Linear Additive Utility space. Here, are the normalized weights indicating the importance of each issue to the user and is an evaluation function that maps the value of the issue to a utility. In our settings, we assume that is unknown and our agent is given incomplete information in terms of partial preferences i.e. a randomly generated partial ordered ranking over bids (w.r.t. ) s.t. . Hence, during the negotiation, one of the objectives of our agent is to derive an estimate of the real utility function from the given partial preferences.

Anesia Components

Our proposed agent negotiation model (shown in Figure 1) supports learning during bilateral negotiations with unknown opponents under user preference uncertainty.

Physical Capabilities:

These are the sensors and actuators of the agent that enable it to access a negotiation environment . More specifically, they allow our agent to perceive the current (external) state of the environment and represent that state locally in the form of internal attributes as shown in Table 1. Some of these attributes (, , ) are stored locally in its Knowledge Base and some of them (, , , ) are derived from the sequence of previous bids offered by the opponent which is perceived by the agent using its sensors while interacting with an opponent agent during the negotiation. At any time , the internal agent representation of the environment is , which is used by the agent (among acceptance and bidding strategies) to decide what action to execute using its actuators. Action execution then changes the state of the environment to .

Attribute Description
Current negotiation time
Total number of possible bids
Total number of issues
Given number of bids in the partial-ordering due to user preference uncertainty
Utility of the best opponent bid so far
Average of utilities of all the bids received from the opponent agent
Standard deviation of utilities of all the bids received from the opponent agent
Table 1: Agent’s State Attributes

Learning Capabilities:

This component consists of following sub-components: Negotiation Experience, Decide and Evaluate. The Decide component is further sub-divided into Acceptance strategy and Bidding strategy. These sub-components need information from two other components called User modeling and Opponent modeling which helps an agent to be able to negotiate given incomplete information about user and opponent preferences and estimate the user and opponent models respectively, and . is estimated using the given partial ordered preferences of the user about the bids. It is estimated only once by the agent before the start of the negotiation in order to encourage autonomous behaviour of the agent and avoid user elicitation. On the other hand, is estimated using information from . The set of opponent bids is collected only till the half of the negotiation period, as the opponent agent is more likely to change its initial strategy afterwards in order to either reach the negotiation or know more about the other agent’s preferences. The decoupled structure of Decide in the form of acceptance and bidding strategies is inspired by a well known negotiation architecture known as BOA Baarslag et al. (2014).

Negotiation Experience stores historical information about previous negotiation experiences which involve the interactions of an agent with other agents. Experience elements are of the form , where is the internal state of the negotiation environment , is an action performed by the agent at , is a scalar reward received from the environment and is new internal state after executing .

Decide refers to a negotiation strategy which helps an agent to choose an optimal action among a set of at a particular state based on the negotiation protocol . In particular, it consists of two functions, and , for the acceptance and bidding strategy, respectively. Function takes as inputs the agent’s state , a dynamic threshold utility (which we define next), and the sequence of past opponent bids and returns a discrete action among accept and reject. When decides reject, is used to compute the next bid to be proposed to the opponent, given in input and , see (23).


Evaluate refers to a critic which helps our agent learn the dynamic threshold utility and evolve the negotiation strategy for unknown negotiation scenarios. More specifically, it is a function of random () past negotiation experiences fetched from the database. The process of learning is retrospective since it depends on the reward obtained from the negotiation environment by performing action at state . The value of the reward depends on the (estimated) discounted utility of the last bid received from the opponent, , or of the bid accepted by either parties and is defined as follows:


where is the discounted reward of defined as


where is a temporal discount factor included to encourage the agent to negotiate without delay.

We stress that our design of reward functions accelerate agent learning by allowing the agent to receive rewards after every action it performs in the environment instead of at the end of the negotiation.

Figure 1: The Architecture of ANESIA
Strategy templates:

One common way to define the acceptance and bidding strategies and is via a combination of hand-crafted tactics that, by empirical evidence or domain knowledge, are known to work effectively. However, a fixed set of tactics might not well adapt to multiple different negotiation domains. In our model, we do not assume pre-defined strategies for and , but our agent learns these strategies offline. To do so, we assume that our agent learns the strategy by negotiating with different opponents bilaterally and with the full knowledge of the true preferences of the user it represents, so that the strategies can be derived by optimizing the true utility over multiple negotiations.

To enable strategy learning, we introduce the notion of strategy templates, i.e., strategies consisting of a series of tactics, where each tactic is executed for a specific phase of the negotiation. The parameters describing the start and duration of each phase as well as the choice of the particular tactic for that phase are all learnable (specified in blue color in equations below). Moreover, some tactics might expose learnable parameters too. We assume a library of acceptance and bidding tactics, and . Each

maps the agent state, threshold utility, opponent bid history, and a (possibly empty) vector of learnable parameters

into a utility value , i.e., , where represents the minimum utility required by the agent to accept the offer. Each is of the form where is the bid returned by the tactic. Given a library of acceptance strategies , an acceptance strategy template is a parametric function defined by


where is the number of tactics used, is the number of options for the -th tactic, , , , , and , , and are the parameters to learn, for and . In other words, the parameters determine for how long the -th tactic is applied, and the are choice parameters determining which particular tactic from to use. We note that (6) is a predicate, i.e., it returns a Boolean, indicating whether the opponent bid is accepted. Similarly, given a library of bidding strategies , a bidding strategy template is defined by


where is the number of tactics, is the number of options for the -th tactic, , and , , and are as per above. The particular libraries of tactics used in this work are discussed in the next Section.


In this section, we describe the methods used for user and opponent modelling, for learning the dynamic utility threshold, and for deriving optimal acceptance and bidding strategies out of our strategy templates.

User modeling:

To estimate the user model from the given partial bid order , our agent uses Cuckoo search optimization (CSO) Yang and Deb (2009), a meta-heuristic inspired by the brood parasitism of cuckoo birds. As a metaphor, a cuckoo is an agent, which is in search of its best user model (or nest or solution). In brief, in CSO a set of candidate solutions (user models) is evolved, and at each iteration the worst-performing solutions are abandoned and replaced with new solutions generated by Lévy flight. In our case, the fitness of a candidate solution is defined as the Spearman’s rank correlation coefficient between the estimated ranking of and the real, but partial, ranking of bids given in input to the agent. The coefficient is indeed a measure of the similarity between two rankings, assigning a value of for identical rankings, and for opposed rankings.

Opponent modeling:

For the estimation of opponent preferences during the negotiation, we have used the distribution-based frequency model proposed in Tunalı et al. (2017). In this model, the empirical frequency of the issue values in the opponent bidding history provides an educated guess on the most preferred issue values by the opponent. On the other hand, the issue weights are estimated by analyzing the disjoint windows of the opponent bidding history, which gives an idea of whether the opponent shifts from its previous negotiation strategy as the time passes.

Utility threshold learning:

We use an actor-critic architecture with model-free deep reinforcement learning (i.e. Deep Deterministic Policy Gradient (DDPG) Lillicrap et al. (2016)) to predict the target threshold utility . Thus,

is expressed as a deep neural network function, whose input is the agent state

(see Table 1 for the list of features).

Prior to reinforcement learning, our agent’s strategy is pre-trained with supervision from synthetic negotiation data. To collect supervision data, we use a simulation environment called GENIUS Lin et al. (2014) that supports multi-issue bilateral negotiation in different domains for varied user profiles. In particular, data was generated by running the winner of ANAC’19 competition (known as AgentGG222http://web.tuat.ac.jp/ katfuji/ANAC2019/) against other strategies (AgentGP, Gravity, HardDealer, Kagent, Kakesoba, SAGA, winkyagent, SACRA, FSEGA2019)333Available in GENIUS in three different domains444 Laptop, Holiday and Party; all are readily available in GENIUS

for varied user profiles assuming no user preference uncertainties. This initial supervised learning (SL) stage helps our agent in decreasing the exploration time required for DRL during the negotiation, an idea primarily influenced by the work of 

Bagga et al. (2020).

Strategy learning:

The parameters of the acceptance and bidding strategy templates (see (67) are learned by running the CSO meta-heuristic (initializing the values of the template parameters based on an educated guess). We define the fitness of a particular choice of template parameters as the average of true utility over multiple rounds of negotiations under the concrete strategy implied by those parameters, obtained by running our agent on the GENIUS platform against three different opponents (AgentGG, KakeSoba and SAGA) and three different negotiation domains.

We now describe the libraries of acceptance and bidding tactics we draw from in our templates. As for the acceptance tactics, we consider:

  • , i.e., the estimated utility of the bid that our agent would propose at time ().

  • , where is the distribution of (estimated) utility values of the bids in ,

    is the quantile function of such distribution, and

    and are learnable parameters. In other words, we consider the -th best utility received from the agent, where is a learnable function of the negotiation time .

  • The dynamic DRL-based utility threshold .

  • A fixed utility threshold .

The bidding tactics in our library are:

  • , a bid generated by a time-dependent Boulware strategy Fatima et al. (2001).

  • extracts a bid from the set of Pareto-optimal bids , which is derived (using the NSGA-II algorithm) under the estimated user and opponent utility models. In particular, it selects the bid that assigns a weight of to the ego agent utility (and to the opponent’s), where and are learnable parameters telling how this weight scales with the negotiation time. The TOPSIS algorithm is used to derive such a bid, given the weighting as input.

  • , a bid generated by changing (in a greedy way) the value of least relevant issue555Least relevant w.r.t.  randomly in the last received opponent bid in a greedy way.

  • , a random bid above our DRL-based utility threshold 666

    is the uniform distribution with support equal to

    , and is the subset of whose bids have estimated utility above w.r.t. ..

Below is an example of concrete acceptance strategy learned in our experiments, a strategy that tend to favor the time-dependent quantile tactic during the middle of the negotiation, and the DRL utility threshold during the initial and final stages.

Experimental Results

All the experimental simulations are performed on the simulation environment GENIUS Lin et al. (2014). Our experiments are based on the following five hypothesis:

  • Hypothesis A: A stochastic search allows to derive accurate user models under user preference uncertainty.

  • Hypothesis B: The set of estimated Pareto-optimal bids obtained using NSGA-II under uncertainty are close to the true Pareto-optimal solution.

  • Hypothesis C: Under non-optimized acceptance and bidding strategies, DRL of the utility threshold yields performance superior to other negotiation strategies in different domains.

  • Hypothesis D: Learning optimal acceptance and bidding tactics yields performance superior w.r.t. other negotiation strategies and non-optimized strategies.

  • Hypothesis E: ANESIA agents effectively adapt to unseen negotiation settings.

Performance metrics:

Inspired by the ANAC’19 competition, for our experiments we use the following widely adopted metrics:

  • Average individual utility rate (): Sum of all the utilities of an agent averaged over the successful negotiations (Ideal value: High (1.0));

  • Average social welfare utility (): Sum of all the utilities gained by both negotiating agents averaged over the successful negotiations (Ideal value: High (2.0));

  • Average number of negotiation rounds (): Total number of negotiation rounds until agreement is reached, averaged over the successful negotiations (Ideal value: Low (1)).

Figure 2: User Modeling Results for 3 different domains
Figure 3: Estimated Pareto-Frontier using NSGA-II based on true and estimated user models
Experimental settings:

We consider three domains (Laptop, Holiday and Party) already used in GENIUS for ANAC with three different domain sizes: low: Laptop Domain (), medium: Holiday domain () and high: Party Domain () with their default settings of reservation price and discount factor. During the negotiation, we assume the deadline for each negotiation to be 60 seconds normalized in [0,1]. For each setting, each agent plays both sides in the negotiation (i.e. 2 user profiles in each setting). A user profile is a role an agent plays during negotiation with its associated preferences. We assume only incomplete information about user preferences, given in the form of randomly-chosen partially-ordered bids. For CSO (in hypotheses A and D), we select a population size of and generations for both user model estimation and learning of strategy template parameters. For NSGA-II (in hypothesis B), we set the population size to 100, number of generations to 25 and mutation count = 0.1. The process of tuning the hyper-parameters of CSO and NSGA-II are critical as we don’t want our agent to exceed the timeout of 1000 seconds given during each turn while deciding an action.

Empirical Evaluation

Hypothesis A: User Modeling

The results in Figure 2 show the average Spearman Correlation Coefficient () values (on Y-axis) taken during 10 simulations for each user profile in every negotiation setting, plotted against the ratio (on X-axis) of given number of partial bids over the total number of possible bids. Dashed lines indicate the value w.r.t. the true (unknown) ranking of bids, solid lines w.r.t. the partial (given) ranking (i.e., the CSO fitness function). We observe that the true value grows with the ratio , attaining relatively high values (above ) even when, as in the party domain, only of the bids are made available to the agent. This demonstrates that our agent can uncover accurate user models also under high uncertainty.

Metric ANESIA AgentGG KakeSoba SAGA
Laptop domain ()
(0.87, 0.83) (0.75, 0.56) (0.72, 0.61) (0.72, 0.63)
(1.66,1.67) (1.39, 1.08) (1.53, 1.11) (1.51, 1.38)
(207.56, 29.46) (1651.60, 5450.23) (1370.0, 5877.86 (1045.0, 5004.46)
Holiday domain ()
(0.86, 0.88) (0.81, 0.85) (0.84, 0.79) (0.77, 0.73)
(1.65, 1.70 1.62, 1.60 (1.65, 1.55) (1.54, 1.50)
(40.30, 43.16) (923.85, 417.68) (880.3, 296.69) (421.55, 470.44)
Party domain ()
(0.81, 0.94) (0.72, 0.78) (0.69, 0.71) (0.55, 0.52)
(1.40, 1.48) (1.36, 1.47) (1.37, 1.42) (1.28, 1.23)
(109.71, 42.71) (938.54, 1432.87) (774.08, 407.41) (319.69, 202.96)
Table 2: Performance comparison of ANESIA VS AgentGG VS KakeSoba VS SAGA (without Strategy Template)
Laptop domain ()
(0.87, 0.87) (0.73, 0.68) (0.64, 0.53) (0.51, 0.73) (0.73, 0.68)
(1.66, 1.60) (1.52, 1.57) (1.22, 1.05) (1.58, 1.28) (1.52, 1.57)
(147.03, 173.23) (279.80, 181.28) (4251.46, 1999.16) (2159.54, 1115.70) (1865.01, 2794.10)
Holiday domain ()
(0.86, 0.79) (0.85, 0.87) (0.88, 0.87) (0.79, 0.76) (0.78, 0.70)
(1.72, 1.59) (1.68, 1.72) (1.56, 1.53) (1.66, 1.58) (1.38, 1.28)
(84.38, 74.21) (168.79, 278.63) (1486.72, 569.64) (745.59, 405.78) (725.81, 367.09)
Party domain ()
(0.78, 0.91) (0.77, 069) (0.75, 076) (0.67, 0.71) (0.55, 0.51)
(1.38, 1.50) (1.37, 1.35) (1.30, 1.23) (1.40, 1.46) (1.39, 1.37)
(4.83, 2.50) (20.30, 92.69) (1068.25, 1093.44) (520.63, 1082.00) (338.01, 296.05)
Table 3: Performance comparison of ANESIA* VS ANESIA VS AgentGG VS KakeSoba VS SAGA (with Strategy Template)
Metric ANESIA* WinkyAgent AgentGP FSEGA2019
Laptop domain ()
(0.92, 0.92) (0.88, 0.86) (0.77, 0.75 (0.90, 1.00)
(0.94, 0.93) (0.89, 0.85) 0.74, 0.75 (0.88, 0.76)
Holiday domain ()
(0.86, 0.82) (0.77, 0.78) (0.79, 0.80) (0.84, 0.84)
(1.68, 1.69) (1.63, 1.63) (1.55, 1.57) (1.66, 1.61)
Party domain ()
(0.72, 0.70) (0.61, 0.67) (0.60, 0.62) (0.71,0.63)
(1.47, 1.35) (1.36, 1.38) (1.31, 1.31) (1.27, 1.30)
Smart Energy Grid domain ()
(0.70, 0.75) NA (0.68, 0.70) (0.71, 0.65)
(1.42, 1.42) NA (1.38, 1.41) (1.40, 1.39)
Table 4: Performance comparison of ANESIA* VS WinkyAgent VS FSEGA2019 VS AgentGP

Hypothesis B: Pareto-Optimality

Figure 3 shows three different plots using true and estimated preference profiles. We can clearly see that there is not much distance between the frontier obtained with the estimated user and opponent models and that with true models. This evidences the potential of NSGA-II for generating the Pareto-optimal bids as well as the closeness of estimated utility models to the true utility models. Due to space limitations, we show the results with only one domain (Party) under only a single negotiation setting.

Hypothesis C: Impact of Dynamic Threshold Utility

We tested an ANESIA agent in a GENIUS tournament setting against AgentGG, KakeSoba and AgentGG for a total of 120 sessions in 3 different domains (Laptop, Holiday and Party) where each agent negotiates with every other agent. Table 2 compares their performance. We choose two different user profiles with two different preference uncertainties () in each domain. According to our results, our agent employing the ANESIA model outperforms the other strategies in terms of , and , and hence validates the hypothesis. During experiments, we have also observed that our agent becomes picky and learns to focus on getting the maximum utility from the end agreement (by accepting or proposing a bid from/to the opponent only if a certain dynamic (or learned) threshold utility is met) and hence the successful negotiation rate is low. However, the proportion of successful negotiations can be accommodated in the reward function to bias our learning to optimize this metric.

Hypothesis D: Strategy Template

Results in Table 3 demonstrate that our agent ANESIA* learns to make the optimal choice of tactics to be used during run time and outperforms the non-optimized ANESIA as well as the other teacher strategies which it was trained on using DDPG.

Hypothesis E: Adaptiveness of the proposed model

We deploy our agent in a negotiation domain called Smart Energy Grid777already existing in Genius () against the different agents of ANAC’ 19 tournament which won the competition but based on joint utility i.e. Winky Agent, FSEGA2019 and AgentGP888We don’t check the since our deadline is given in terms of rounds (60 rounds) as WinkyAgent readily available code doesn’t work with continuous time but discrete. These agents and the domain are different from what our agent was initially trained on. Results presented in Table 4 over 2 different user preference uncertainties () clearly demonstrate the benefits of our agent strategy built upon the given template over the other existing strategies999Winky Agent gives Timeout Exception in deciding an action during each turn, hence represented by NA in Table 4. This confirms our hypothesis that our model ANESIA with optimized strategies can learn to adapt at run-time to different negotiation settings against different unknown opponents.


ANESIA is a novel model encapsulating different types of learning to aid an agent negotiate over multiple issues under user preference uncertainty. The model uses stochastic search based on Cuckoo Search optimization for user modeling, combining NSGA-II and TOPSIS for generating Pareto bids during negotiation according to estimated user and opponent models. An ANESIA agent learns using a strategy template to choose which tactic to employ for deciding when to accept or bid at a particular time during negotiation. The model implements an actor-critic architecture-based DDPG to evaluate the target threshold utility value below which it neither accepts/proposes bids from/to the opponent. We have empirically evaluated the performance of ANESIA against the winning agent strategies of ANAC’19 tournament in different settings, showing that ANESIA outperforms them. Moreover, our template-based strategy exhibits adaptive behaviour, as it helps the agent to transfer the knowledge to environments with unknown opponent agents which are unseen during training.


  • Alrayes et al. (2018) Bedour Alrayes, Özgür Kafalı, and Kostas Stathis. Concurrent bilateral negotiation for open e-markets: the conan strategy. Knowledge and Information Systems, 56(2):463–501, 2018.
  • Baarslag et al. (2014) Tim Baarslag, Koen Hindriks, Mark Hendrikx, Alexander Dirkzwager, and Catholijn Jonker. Decoupling negotiating agents to explore the space of negotiation strategies. In Novel Insights in Agent-based Complex Automated Negotiation, pages 61–83. Springer, 2014.
  • Baarslag et al. (2016) Tim Baarslag, Mark JC Hendrikx, Koen V Hindriks, and Catholijn M Jonker. Learning about the opponent in automated bilateral negotiation: a comprehensive survey of opponent modeling techniques. Autonomous Agents and Multi-Agent Systems, 30(5):849–898, 2016.
  • Bagga et al. (2020) Pallavi Bagga, Nicola Paoletti, Bedour Alrayes, and Kostas Stathis. A deep reinforcement learning approach to concurrent bilateral negotiation. In IJCAI, 2020.
  • Bakker et al. (2019) Jasper Bakker, Aron Hammond, Daan Bloembergen, and Tim Baarslag. Rlboa: A modular reinforcement learning framework for autonomous negotiating agents. In AAMAS, pages 260–268, 2019.
  • Costantini et al. (2013) Stefania Costantini, Giovanni De Gasperis, Alessandro Provetti, and Panagiota Tsintza. A heuristic approach to proposal-based negotiation: with applications in fashion supply chain management. Mathematical Problems in Engineering, 2013, 2013.
  • Deb et al. (2002) Kalyanmoy Deb, Amrit Pratap, Sameer Agarwal, and TAMT Meyarivan. A fast and elitist multiobjective genetic algorithm: Nsga-ii.

    IEEE transactions on evolutionary computation

    , 6(2):182–197, 2002.
  • El-Ashmawi et al. (2020) Walaa H El-Ashmawi, Diaa Salama Abd Elminaam, Ayman M Nabil, and Esraa Eldesouky. A chaotic owl search algorithm based bilateral negotiation model. Ain Shams Engineering Journal, 2020.
  • Etghani et al. (2013) Mir Majid Etghani, Mohammad Hassan Shojaeefard, Abolfazl Khalkhali, and Mostafa Akbari. A hybrid method of modified nsga-ii and topsis to optimize performance and emissions of a diesel engine using biodiesel. Applied Thermal Engineering, 59(1-2):309–315, 2013.
  • Fatima et al. (2001) S Shaheen Fatima, Michael Wooldridge, and Nicholas R Jennings. Optimal negotiation strategies for agents with incomplete information. In International Workshop on Agent Theories, Architectures, and Languages, pages 377–392. Springer, 2001.
  • Fatima et al. (2002) Shaheen S Fatima, Michael Wooldridge, and Nicholas R Jennings. Multi-issue negotiation under time constraints. In Proceedings of the first international joint conference on Autonomous agents and multiagent systems: part 1, pages 143–150, 2002.
  • Fatima et al. (2005) Shaheen S Fatima, Michael Wooldridge, and Nicholas R Jennings. A comparative study of game theoretic and evolutionary models of bargaining for software agents. Artificial Intelligence Review, 23(2):187–205, 2005.
  • Fatima et al. (2006) S Shaheen Fatima, Michael J Wooldridge, and Nicholas R Jennings. Multi-issue negotiation with deadlines. Journal of Artificial Intelligence Research, 27:381–417, 2006.
  • Hashmi et al. (2013) Khayyam Hashmi, Amal Alhosban, Erfan Najmi, Zaki Malik, et al. Automated web service quality component negotiation using nsga-2. In 2013 ACS International Conference on Computer Systems and Applications (AICCSA), pages 1–6. IEEE, 2013.
  • Klein et al. (2003) Mark Klein, Peyman Faratin, Hiroki Sayama, and Yaneer Bar-Yam. Negotiating complex contracts. Group Decision and Negotiation, 12(2):111–125, 2003.
  • Lang and Fink (2015) Fabian Lang and Andreas Fink. Learning from the metaheuristics: Protocols for automated negotiations. Group Decision and Negotiation, 24(2):299–332, 2015.
  • Lau et al. (2006) Raymond YK Lau, Maolin Tang, On Wong, Stephen W Milliner, and Yi-Ping Phoebe Chen. An evolutionary learning approach for adaptive negotiation agents. International Journal of Intelligent Systems, 21(1):41–72, 2006.
  • Li and Zhang (2012) Kejing Li and Xiaobing Zhang. Using nsga-ii and topsis methods for interior ballistic optimization based on one-dimensional two-phase flow model. Propellants, Explosives, Pyrotechnics, 37(4):468–475, 2012.
  • Lillicrap et al. (2016) Timothy Paul Lillicrap, Jonathan James Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In Proceedings of the 4th International Conference on Learning Representations (ICLR 2016), 2016.
  • Lin et al. (2014) Raz Lin, Sarit Kraus, Tim Baarslag, Dmytro Tykhonov, Koen Hindriks, and Catholijn M Jonker. Genius: An integrated environment for supporting the design of generic automated negotiators. Computational Intelligence, 30(1):48–70, 2014.
  • Méndez et al. (2009) Máximo Méndez, Blas Galván, Daniel Salazar, and David Greiner. Multiple-objective genetic algorithm using the multiple criteria decision making method topsis. In Multiobjective Programming and Goal Programming, pages 145–154. Springer, 2009.
  • Razeghi et al. (2020) Yousef Razeghi, Celal Ozan Berk Yavaz, and Reyhan Aydoğan. Deep reinforcement learning for acceptance strategy in bilateral negotiations. Turkish Journal of Electrical Engineering & Computer Sciences, 28(4):1824–1840, 2020.
  • Rubinstein (1982) Ariel Rubinstein. Perfect equilibrium in a bargaining model. Econometrica: Journal of the Econometric Society, pages 97–109, 1982.
  • Silva et al. (2018) Francisco Silva, Ricardo Faia, Tiago Pinto, Isabel Praça, and Zita Vale. Optimizing opponents selection in bilateral contracts negotiation with particle swarm. In International Conference on Practical Applications of Agents and Multi-Agent Systems, pages 116–124. Springer, 2018.
  • Tsimpoukis et al. (2018) Dimitrios Tsimpoukis, Tim Baarslag, Michael Kaisers, and Nikolaos G Paterakis.

    Automated negotiations under user preference uncertainty: A linear programming approach.

    In International conference on agreement technologies, pages 115–129. Springer, 2018.
  • Tunalı et al. (2017) Okan Tunalı, Reyhan Aydoğan, and Victor Sanchez-Anguix. Rethinking frequency opponent modeling in automated negotiation. In International Conference on Principles and Practice of Multi-Agent Systems, pages 263–279. Springer, 2017.
  • Tzeng and Huang (2011) Gwo-Hshiung Tzeng and Jih-Jeng Huang. Multiple attribute decision making: methods and applications. CRC press, 2011.
  • Wang et al. (2016) Dengfeng Wang, Rongchao Jiang, and Yinchong Wu. A hybrid method of modified nsga-ii and topsis for lightweight design of parameterized passenger car sub-frame. Journal of Mechanical Science and Technology, 30(11):4909–4917, 2016.
  • Yang and Deb (2009) Xin-She Yang and Suash Deb. Cuckoo search via lévy flights. In 2009 World congress on nature & biologically inspired computing (NaBIC), pages 210–214. IEEE, 2009.
  • Zeelanbasha et al. (2020) N Zeelanbasha, V Senthil, and G Mahesh. A hybrid approach of nsga-ii and topsis for minimising vibration and surface roughness in machining process. International Journal of Operational Research, 38(2):221–254, 2020.