The Sharer's Dilemma in Collective Adaptive Systems of Self-Interested Agents

04/28/2018
by   Lenz Belzner, et al.
0

In collective adaptive systems (CAS), adaptation can be implemented by optimization wrt. utility. Agents in a CAS may be self-interested, while their utilities may depend on other agents' choices. Independent optimization of agent utilities may yield poor individual and global reward due to locally interfering individual preferences. Joint optimization may scale poorly, and is impossible if agents cannot expose their preferences due to privacy or security issues. In this paper, we study utility sharing for mitigating this issue. Sharing utility with others may incentivize individuals to consider choices that are locally suboptimal but increase global reward. We illustrate our approach with a utility sharing variant of distributed cross entropy optimization. Empirical results show that utility sharing increases expected individual and global payoff in comparison to optimization without utility sharing. We also investigate the effect of greedy defectors in a CAS of sharing, self-interested agents. We observe that defection increases the mean expected individual payoff at the expense of sharing individuals' payoff. We empirically show that the choice between defection and sharing yields a fundamental dilemma for self-interested agents in a CAS.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 13

01/03/2020

Selfish Algorithm and Emergence of Collective Intelligence

We propose a model for demonstrating spontaneous emergence of collective...
06/09/2011

Collective Intelligence, Data Routing and Braess' Paradox

We consider the problem of designing the the utility functions of the ut...
05/07/2020

Belief-Averaged Relative Utilitarianism

We study preference aggregation under uncertainty when individual and co...
01/29/2019

Efficient Network Sharing with Asymmetric Constraint Information

Network sharing has become a key feature of various enablers of the next...
03/28/2017

Diversity of preferences can increase collective welfare in sequential exploration problems

In search engines, online marketplaces and other human-computer interfac...
09/10/2020

The Cost of Denied Observation in Multiagent Submodular Optimization

A popular formalism for multiagent control applies tools from game theor...
01/30/2021

Resource Availability in the Social Cloud: An Economics Perspective

This paper focuses on social cloud formation, where agents are involved ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In collective adaptive systems (CAS), adaptation can be implemented by optimization wrt. utility, e.g. using multi-agent reinforcement learning or distributed statistical planning

[1, 2, 3, 4, 5]. Agents in a CAS may be self-interested, while their utilities may depend on other agents’ choices. This kind of situation arises frequently when agents are competing for scarce resources. Independent optimization of each agent’s utility may yield poor individual and global payoff due to locally interfering individual preferences in the course of optimization [6, 7]. Joint optimization may scale poorly11todo: 1reference, and is impossible if agents do not want to expose their preferences due to privacy or security issues [8].

A minimal example of such a situation is the coin game [9] (cf. Figure 1. 22todo: 2cite L2T Here, a yellow and a blue agent compete for coins. The coins are also colored in yellow or blue. Both agents can decide whether to pick up the coin or not. If both agents opt to pick up the coin, one of them receives it uniformly at random. If an agent picks up a coin of its own color, it receives a reward of 2. If it picks up a differently colored coin, it gets a reward of one. Each agent wants to maximize its individual reward. If agents act purely self-interested, then each agent tries to pick up each coin, resulting in suboptimal global reward. However, if rewards can be shared among agents, then agents will only pick up coins of their own color. They receive a share that is high enough to compensate for not picking up differently colored coins. This increases individual and global reward alike.

There are many examples for this kind of situation. For example, energy production in the smart grid can be modeled in terms of a CAS of self-interested agents. Each participant has to decide locally how much energy to produce. Each agent wants to maximize its individual payoff by selling energy to consumers in the grid. However, the price is depending on global production. Also, global overproduction is penalized. Routing of vehicles poses similar problems. Each vehicle wants to reach its destination in a minimal amount of time. However, roads are a constrained resource, and for a globally optimal solution, only a fraction of vehicles should opt for the shortest route. In both scenarios, the ability of agents to share payoff may increase individual and global reward alike.

Figure 1:

Two agents competing for a coin: if agent 1 (yellow) on the left side happens to get the coin it will get a reward of +1 whereas agent 2 (blue) will get a reward of +2 for it. If there is a fifty-fifty chance for an agent to get the coin when both agents are trying to collect it, the expected values are 0.5 for agent 1 and 1 for agent 2 when both agents independently optimize their utility. In contrast, if there is the possibility to share reward then agents could learn to do the following: agent 1 (yellow) resists to collect the coin. That increases the blue agent’s probability for getting a reward to 1. The blue agent transfers reward (e.g. 1) to the yellow agent. This leaves agents with expected values of 1 each and therefore defines a strong Pareto improvement compared to the former outcome.

33todo: 3asymmetry [tuyls2018symmetric]44todo: 4chicken/stag hunt, nash vs. optima, utility privacy as a motivation

In this paper, we study distributed optimization with utility sharing for mitigating the issue of contrasting individual goals at the cost of expected individual and global reward. To illustrate our ideas, we propose a utility sharing variant of distributed cross entropy optimization. Empirical results show that utility sharing increases expected individual and global payoff in comparison to optimization without utility sharing.

We then investigate the effect of defectors participating in a CAS of sharing, self-interested agents. We observe that defection increases the mean expected individual payoff at the expense of sharing individuals’ payoff. We empirically show that the choice between defection and sharing yields a fundamental dilemma for self-interested agents in a CAS.

The paper makes the following contributions.

  • We motivate utility sharing as a means to mitigate conflicts and increase expected individual and global reward in CAS of self-interested agents.

  • We propose distributed optimization with sharing (DOS) as an algorithm to realize utility sharing in self-interested CAS.

  • We evaluate DOS empirically, showing that it increases individual and global reward in expectation.

  • We investigate the effect of defecting, non-sharing individuals in a group of self-interested sharing agents. We show that the choice between defection and cooperation yields a fundamental dilemma for self-interested agents in collective adaptive systems.

The remainder of the paper is structured as follows. In Section 2 we discuss related work. We introduce DOS in Section 3. We discuss our empirical results and the Sharer’s Dilemma in Section 4. We conclude in Section 5.

2 Related Work

In general, we see our work in the context of collective adaptive systems (CAS) [2, 3] and multi-agent systems [10]. In particular, we are interested in CAS where agents are adaptive through optimization of actions of policies wrt. a given individual or global utility function. These settings can for example be modeled in terms of distributed constrained optimization problems [11], or as stochastic games [12].

Searching for optimal actions or learning policies can be done by open- or closed-loop planning, potentially enhanced with learned components such as search exploration policies or value functions [5, 13, 14, 15, 16]. Another approach for learning optimal policies in multi agent domains such as CAS is multi agent reinforcement learning (MARL) [1, 17]

and its modern variants based on deep learning for scaling up to more complex domains

[4, 18, 19]. A recent example of planning-based deep MARL combines open-loop search and learned value functions in fully cooperative multi-agent domains [5].

In the case of self-interested agents, the Coco-Q algorithm was proposed [20]. Coco-Q has been evaluated for discrete two-player matrix games, and requires explicit knowledge of other agents’ utilities. In some sense, our study of sharing in CAS extends the Coco-Q approach to continuous optimization with more than two agents. Also, we model the amount sharing as a free parameter to be learned in the course of optimization.

In the context of a research on emergent social effects in MARL [6, 7, 21, 9], a recent report investigated the effects of inequity aversion and utility sharing in temporally extended dilemmas [22]. The authors state that ”it remains to be seen whether emergent inequity-aversion can be obtained by evolving reinforcement learning agents” [22]. Our current work is a first step into this direction, and shows that the question of whether to share or not poses a dilemma in and for itself, at least in the case of stateless optimization (in contrast to learning policies).

inlineinlinetodo: inline

game theoretic dilemmas (chicken, stag hunt, etc.), transferable utility, game theoretic concepts, explicit coordination mechanisms (in contrast to sharing and coordination emergence), evolutionary game theory, survival of fittest vs. survival of the tribe

3 Distributed Optimization with Sharing

We model decision making in a CAS as a stochastic game [12].

  • is a finite set of states.

  • is a finite set of agents.

  • is a set of joint actions. is a finite set of actions for agent .

  • is a distribution modeling the probability that executing action in state yields state .

  • is a set of reward functions, one for each agent.

In the following, we assume consists of a single state, and . As is unique, we will not consider it in further notation.

We assume that is available to agent in terms of a generative model that may be queried for samples , e.g. a simulation of the application domain. Each agent only has access to its own reward function, but does not know the reward functions of other agents.

The task of a self-interested agent is to find an action that maximizes its payoff. However, its payoff in general depends on the choices of other agents. One way to deal with this dependency is to perform optimization jointly for all agents, that is . However, in a CAS with self-interested agents, each participant tries to maximize its individual reward. Also, in many situations participating agents would not want to expose their individual reward functions to others due to privacy or security issues [8]. In these situations, joint optimization wrt. global reward is not feasible. Note that optimization of self-interested individuals is non-stationary due to changes in others’ choices as they optimize for themselves.

3.1 Reward Sharing

We define agents’ utilities as . We consider the two different cases we are interested in:

  1. Individual, purely self-interested optimization

  2. Self-interested optimization with the option to share individual rewards

3.1.1 Pure Self-Interest

When optimizing independently and purely self-interested, .

3.1.2 Sharing

Sharing agents choose a share additionally to . We denote the joint shares by . Given agents, a joint action and a joint share for all , we define individual agents’ utility for distributed optimization with sharing as follows. 111We can account for the change of signature of by extending the action space of each agent accordingly: .

(1)

Shares are uniformly distributed among all other agents. There are no bilateral shares. Note that this sharing mechanism is an arbitrary choice.

For example, sharing yields the following utilities for two agents.

3.2 Distributed Optimization with Sharing

We now give a general formulation of distributed optimization with sharing (DOS). DOS is shown in Algorithm 1. Each agent maintains a policy , i.e. a distribution over actions and shares. It is initialized with an arbitrary prior distribution. A rational agent wants to optimize its policy such that the expectation of reward is maximized: , where . Note that optimization of an individual’s policy depends on the policies of all other agents. Also note that policy optimization of self-interested individuals is non-stationary due to changes in others’ policies as they optimize for themselves.

After initialization, DOS performs the following steps for a predefined number of iterations.

  1. Each agent samples a multiset of actions from its policy and communicates it to other agents.

  2. A list of joint actions is constructed from the communicated action lists of other agents.

  3. The utility of each joint action is determined according to Equation 1.

  4. The policy is updated in a way that increases the likelihood of sampling high-utility actions and shares.

After iterations, each agent samples an action and a share from its policy, executes the action, and shares reward accordingly. The resulting joint action yields the global result of DOS.

1:initialize for each agent
2:for  iterations do
3:     for each agent  do
4:         each agent samples a list of actions from
5:         broadcast sampled actions      
6:     for each agent  do
7:         build joint actions
8:         determine utility according to Eq. 1
9:         update to increase the likelihood of high-utility samples      
10:for each agent  do
11:     execute and share sampled from
Algorithm 1 Distributed Optimization with Sharing (DOS)

3.3 Cross-Entropy DOS

In general, DOS is parametric w.r.t. modeling and updating of policies . As an example, we instantiate DOS with cross entropy optimization [23]. We label this instantiation CE-DOS.

For CE-DOS, we model a policy

as isotropic normal distribution

. I.e., each parameter of an action is sampled from a normal distribution that is independent from other action parameter distributions. Note that it is also possible to model policies in terms of normal distribution with full covariance, but the simpler and computationally less expensive isotropic representation suffices for our illustrative concerns. As prior CE-DOS requires initial mean

and standard deviation

for a policy (cf. Algorithm 2, line 1). I.e. initial actions before any optimization are sampled as follows.

(2)

Updating a policy (cf. Algorithm 1

, line 12 - 15) is done by recalculating mean and variance of the normal distribution. We want the update to increase the expected sample utility. For each of

iterations, we sample actions and shares from each agent’s policy, and build the corresponding joint actions and shares .

Each agent evaluates sampled actions and shares according to its utility . From the set of evaluated samples of each agent, we drop a fraction of samples from the set wrt. their utilities. That is, we only keep high utility samples in the set. We then compute mean and variance of the action parameters in the reduced set, and use them to update the policy. A learning rate determines the impact of the new mean and variance on the existing distribution parameters: E.g. let and be the mean and standard deviation of a normal distribution modeling a policy at iteration , then

where and are mean and standard deviation of the elite samples. We require a lower bound on the standard deviation of policies in order to maintain a minimum amount of exploration.

The hyperparameters of CE-DOS are thus as follows.

  • A stochastic game

  • Number of iterations

  • Number of samples from the policy at each iteration

  • Prior mean and standard deviation for policies

  • Lower bound on the policy standard deviations

  • Fraction of elite samples to keep

  • Learning rate

1:Intitialize for each agent
2:for  iterations do
3:     for each agent  do
4:         sample actions and shares
5:         clip such that
6:         broadcast sampled actions and shares      
7:     for each agent  do
8:         build joint actions and shares
9:         determine utility according to Eq. 1
10:         keep elite samples with highest utility
11:         compute and from in the elite samples
12:         
13:         
14:         
15:               
16:for each agent  do
17:     
18:     execute and share
Algorithm 2 Cross Entropy DOS

4 Experimental Results and the Sharer’s Dilemma

We experimentally analyzed the effects of sharing in collective adaptive systems of self-interested agents.

4.1 Domains

We evaluated the effect of sharing utilities with CE-DOS in two synthetic domains. In these domains, a CAS of self-interested agents has to balance individual and global resource consumption (or production, respectively).

For example, the energy market in the smart grid can be modeled as a CAS of self-interested agents. Each participant has to decide locally how much energy to produce. Each agent wants to maximize its individual payoff by selling energy to consumers in the grid. Therefore, each agent would like to maximize its individual energy production. However, the selling price per unit is typically non-linearly depending on global production. For example, global overproduction is penalized.

There are a number of corresponding real world problems, for example energy production and consumption in the smart grid, traffic routing, passenger distribution to individual ride hailing participants, cargo distribution on transport as a service, routing of packets in networks, distribution of computational load to computers in a cluster, and many more.

We now define two market models (simple and logistic) as domains for evaluating the effects of sharing in CAS of self-interested agents.

4.1.1 Simple Market

We model individual and global production, and use their relation for calculating utilities in such a scenario. We set as individual agents’ action space, models the production amount. The sum models the global production.

We define the reward of each agent as the relation of its own individual resource consumption to the global resource consumption. I.e. the reward correlates to an agents market share. We introduce a slope parameter to control the utility slope of individual and global consumption.

(3)

In this setup, a rational agent would like to increase its own consumption until saturation. I.e. a monopoly is able to produce cheaper than two small producers, and therefore an inequal production amount unlocks more global reward. If all agents act rationally by maximizing their individual , in general the corresponding equilibrium is not equal to the global optimum.

4.1.2 Logistic Market

We modeled another market scenario for investigating the effects of sharing in CAS of self-interested agents. As before, each agent has to choose the amount of energy to use for production of a particular good. I.e. , as in the simple market domain. Note that this is an arbitrary choice.

Each agent has a logistic production curve as a function of its invested energy. For example, this models different production machine properties. The logistic curve is given as follows.

(4)

Here, defines the steepness of the logistic function, and determines the offset on the x-axis.

Global production is the sum of individual production . A price function (i.e. an inverse logistic function) defines the price per produced unit, given global production .

(5)

The reward for an agent is defined as the product of its produced units and the global price.

(6)

Figure 2 shows an example of production and price functions in the logistic market domain.

Figure 2: Example production functions (left) and global price function (right) in the logistic market domain.

4.2 Setup

For our experiments, we used the following setup of CE-DOS.222We plan to publish our code upon publication.

  • We consider a stochastic game with agents, that is .

  • We set and in our experiments.

  • Individual action spaces were set as .

  • We define the individual reward functions as given by Equation 3.

  • We set the number of iterations for CE-DOS to 100.

  • We draw samples from the policy per iteration for each agent.

  • Prior mean and standard deviation were set to 0 and 1, respectively.

  • We set the fraction of elite samples .

  • We set the learning rate .

  • We set the minimal policy standard deviation .

We sampled domain parameters uniformely from the following intervals.

  • We sampled the slope parameter from in the simple market domain.

  • We sampled logistic steepness and offset from for all production and cost functions in our experiments with the logistic market domain.

We varied the number of sharing agents to measure the effect of defecting (i.e. non-sharing) agents that participate in the stochastic game together with sharing individuals.

Note that for the results we report here, we clipped the sharing values such that agents are only able to share up to their current reward, i.e. for a given . In general, other setups with unbound sharing are possible as well.

inlineinlinetodo: inlinecorrelation of effect and number of involved agentsinlineinlinetodo: inlineillustrate optimization trajectories in utility landscape

4.3 Effect of Sharing on Global Reward

Figure 3 shows the mean global utility gathered for varying numbers of sharing agents. We can observe that the fraction of sharing agents correlates with global utility. We also see that the effect of sharing increases with the number of participating agents.

Figure 4 shows the mean individual shared value for the corresponding experimental setups. We can see that the amount of shared value correlates with global reward. I.e. the more value shared, the higher the global reward. We also see that the number of participating agents correlates with the effect of sharing.

Figure 3:

Global utility gathered for varying numbers of sharing agents in the simple market (left column) and logistic market (right column) domains. 10 agents (top row), 50 agents (center row) and 100 agents (bottom row) in total. Solid line shows empirical mean of 10 experimental runs, shaded areas show .95 confidence intervals. Best viewed on screen in color.

Figure 4: Mean individual shares for varying numbers of sharing agents in the simple market (left column) and logistic market (right column) domains. 10 agents (top row), 50 agents (center row) and 100 agents (bottom row) in total. Solid line shows empirical mean of 10 experimental runs, shaded areas show .95 confidence intervals. Best viewed on screen in color.

4.4 Sharer’s Dilemma

Figure 5 shows the Schelling diagrams for the corresponding experiments. A Schelling diagram compares the mean individual utility of sharers and defectors based on the global number of sharing agents [24]. We can see that agents that choose to defect gather more individual utility than the sharing ones.

The shape of the Schelling diagrams in Figure 5 shows that sharing in collective adaptive systems with self-interested agents yields a dilemma in our experimental setups.

Should an individual agent share or defect?

There is no rational answer to this question for an individual self-interested agent. If the agent chooses to share, it may be exploited by other agents that are defecting. However, if the agent chooses to defect, it may hurt its individual return by doing so in comparison to having chosen to share.

Note that the amount of sharing is a free parameter to be optimized by DOS. This means that all behavior we observe in our experiments is emergent. The combination of available resources, interdependency of agents’ actions and the ability to share lets agents decide to share with others based on their intrinsic motivation.

Our results illustrate a potential reason for emergence of cooperation and inequity aversion in CAS of only self-interested agents. They also give an explanation to the existence of punishment of individuals that exploit societal cooperation at the cost of sharing individuals’ and global reward.

Figure 5: Schelling diagrams showing mean individual utility for defectors and sharers, for varying numbers of sharing agents in the simple market (left column) and logistic market (right column) domains. Note the log scale on the y-axis. 10 agents (top row), 50 agents (center row) and 100 agents (bottom row) in total. 10 experimental runs. Best viewed on screen in color.

5 Conclusion

We summarize the ideas in this paper, discuss limitations and implications of our results, and outline venues for further research.

5.1 Summary

In collective adaptive systems (CAS), adaptation can be implemented by optimization wrt. utility. Agents in a CAS may be self-interested, while their utilities may depend on other agents’ choices. Independent optimization of each agent’s utility may yield poor individual and global payoff due to locally interfering individual preferences in the course of optimization. Joint optimization may scale poorly, and is impossible if agents do not want to expose their preferences due to privacy or security issues.

In this paper, we studied distributed optimization with sharing for mitigating this issue. Sharing utility with others may incentivize individuals to consider choices that are locally suboptimal but increase global reward. To illustrate our ideas, we proposed a utility sharing variant of distributed cross entropy optimization. Empirical results show that utility sharing increases expected individual and global payoff in comparison to optimization without utility sharing.

We also investigated the effect of defectors participating in a CAS of sharing, self-interested agents. We observed that defection increases the mean expected individual payoff at the expense of sharing individuals’ payoff. We empirically showed that the choice between defection and sharing yields a fundamental dilemma for self-interested agents in a CAS.

5.2 Limitations

A central limitation of CE-DOS is its state- and memoryless optimization. In our formulation of utility sharing self-interested agents optimize an individual action and share that maximizes their utility. However, our formulation does not account for learning decision policies based on a current state and other learning agents. In this case, the utility of each agent would also depend on concrete states, transition dynamics and potentially also on models agents learn about other participants [25, 26].

As there is no temporal component to the optimization problems that we studied in this paper, it is also not possible to study the effect of gathering wealth in our current setup. We think that the dynamics of sharing in temporally extended decision problems may differ from the ones in stateless optimization. For example, corresponding observations have been made for game theoretic dillemas, where optimal strategies change when repeating a game (in contrast to the optimal strategy when the game is only played once) [27]. Similar research has been conducted in the field of reinforcement learning, however not accounting for utility sharing so far [6].

We also want to point out that exposing shares eventually provides ground for attack for malicious agents [8]. Albeit indirectly, exposed shares carry information about individual utility landscapes, allowing attackers to potentially gather sensitive information about agents’ internal motivations. Agents in critical application domains should consider this weakness when opting to share.

5.3 Future Work

In future work, we would like to transfer our approach to temporally extended domains and model sharing in CAS with multi-agent reinforcement learning. Hopefully, this would enable studying sharing and the Sharer’s Dilemma in more complex domains.

We also think that there are many interesting options for realizing sharing besides equal distribution as formulated in Eq. 1. For example, our formulation does not allow for bilateral shares or formation of coalitions. Also, we would be interested to study the effect of wealth on emergent cooperation and defection. Another interesting line would be to investigate the effects of punishment in CAS of self-interested agents.

As an application domain, it would be interesting to exploit the duality of planning and verification. For example, agents utility could model individual goal satisfaction probability. Sharing could be used to increase individual and global goal satisfaction probability in CAS.

References