1 Introduction
In collective adaptive systems (CAS), adaptation can be implemented by optimization wrt. utility, e.g. using multiagent reinforcement learning or distributed statistical planning
[1, 2, 3, 4, 5]. Agents in a CAS may be selfinterested, while their utilities may depend on other agents’ choices. This kind of situation arises frequently when agents are competing for scarce resources. Independent optimization of each agent’s utility may yield poor individual and global payoff due to locally interfering individual preferences in the course of optimization [6, 7]. Joint optimization may scale poorly^{1}^{1}todo: 1reference, and is impossible if agents do not want to expose their preferences due to privacy or security issues [8].A minimal example of such a situation is the coin game [9] (cf. Figure 1. ^{2}^{2}todo: 2cite L2T Here, a yellow and a blue agent compete for coins. The coins are also colored in yellow or blue. Both agents can decide whether to pick up the coin or not. If both agents opt to pick up the coin, one of them receives it uniformly at random. If an agent picks up a coin of its own color, it receives a reward of 2. If it picks up a differently colored coin, it gets a reward of one. Each agent wants to maximize its individual reward. If agents act purely selfinterested, then each agent tries to pick up each coin, resulting in suboptimal global reward. However, if rewards can be shared among agents, then agents will only pick up coins of their own color. They receive a share that is high enough to compensate for not picking up differently colored coins. This increases individual and global reward alike.
There are many examples for this kind of situation. For example, energy production in the smart grid can be modeled in terms of a CAS of selfinterested agents. Each participant has to decide locally how much energy to produce. Each agent wants to maximize its individual payoff by selling energy to consumers in the grid. However, the price is depending on global production. Also, global overproduction is penalized. Routing of vehicles poses similar problems. Each vehicle wants to reach its destination in a minimal amount of time. However, roads are a constrained resource, and for a globally optimal solution, only a fraction of vehicles should opt for the shortest route. In both scenarios, the ability of agents to share payoff may increase individual and global reward alike.
In this paper, we study distributed optimization with utility sharing for mitigating the issue of contrasting individual goals at the cost of expected individual and global reward. To illustrate our ideas, we propose a utility sharing variant of distributed cross entropy optimization. Empirical results show that utility sharing increases expected individual and global payoff in comparison to optimization without utility sharing.
We then investigate the effect of defectors participating in a CAS of sharing, selfinterested agents. We observe that defection increases the mean expected individual payoff at the expense of sharing individuals’ payoff. We empirically show that the choice between defection and sharing yields a fundamental dilemma for selfinterested agents in a CAS.
The paper makes the following contributions.

We motivate utility sharing as a means to mitigate conflicts and increase expected individual and global reward in CAS of selfinterested agents.

We propose distributed optimization with sharing (DOS) as an algorithm to realize utility sharing in selfinterested CAS.

We evaluate DOS empirically, showing that it increases individual and global reward in expectation.

We investigate the effect of defecting, nonsharing individuals in a group of selfinterested sharing agents. We show that the choice between defection and cooperation yields a fundamental dilemma for selfinterested agents in collective adaptive systems.
2 Related Work
In general, we see our work in the context of collective adaptive systems (CAS) [2, 3] and multiagent systems [10]. In particular, we are interested in CAS where agents are adaptive through optimization of actions of policies wrt. a given individual or global utility function. These settings can for example be modeled in terms of distributed constrained optimization problems [11], or as stochastic games [12].
Searching for optimal actions or learning policies can be done by open or closedloop planning, potentially enhanced with learned components such as search exploration policies or value functions [5, 13, 14, 15, 16]. Another approach for learning optimal policies in multi agent domains such as CAS is multi agent reinforcement learning (MARL) [1, 17]
and its modern variants based on deep learning for scaling up to more complex domains
[4, 18, 19]. A recent example of planningbased deep MARL combines openloop search and learned value functions in fully cooperative multiagent domains [5].In the case of selfinterested agents, the CocoQ algorithm was proposed [20]. CocoQ has been evaluated for discrete twoplayer matrix games, and requires explicit knowledge of other agents’ utilities. In some sense, our study of sharing in CAS extends the CocoQ approach to continuous optimization with more than two agents. Also, we model the amount sharing as a free parameter to be learned in the course of optimization.
In the context of a research on emergent social effects in MARL [6, 7, 21, 9], a recent report investigated the effects of inequity aversion and utility sharing in temporally extended dilemmas [22]. The authors state that ”it remains to be seen whether emergent inequityaversion can be obtained by evolving reinforcement learning agents” [22]. Our current work is a first step into this direction, and shows that the question of whether to share or not poses a dilemma in and for itself, at least in the case of stateless optimization (in contrast to learning policies).
game theoretic dilemmas (chicken, stag hunt, etc.), transferable utility, game theoretic concepts, explicit coordination mechanisms (in contrast to sharing and coordination emergence), evolutionary game theory, survival of fittest vs. survival of the tribe
3 Distributed Optimization with Sharing
We model decision making in a CAS as a stochastic game [12].

is a finite set of states.

is a finite set of agents.

is a set of joint actions. is a finite set of actions for agent .

is a distribution modeling the probability that executing action in state yields state .

is a set of reward functions, one for each agent.
In the following, we assume consists of a single state, and . As is unique, we will not consider it in further notation.
We assume that is available to agent in terms of a generative model that may be queried for samples , e.g. a simulation of the application domain. Each agent only has access to its own reward function, but does not know the reward functions of other agents.
The task of a selfinterested agent is to find an action that maximizes its payoff. However, its payoff in general depends on the choices of other agents. One way to deal with this dependency is to perform optimization jointly for all agents, that is . However, in a CAS with selfinterested agents, each participant tries to maximize its individual reward. Also, in many situations participating agents would not want to expose their individual reward functions to others due to privacy or security issues [8]. In these situations, joint optimization wrt. global reward is not feasible. Note that optimization of selfinterested individuals is nonstationary due to changes in others’ choices as they optimize for themselves.
3.1 Reward Sharing
We define agents’ utilities as . We consider the two different cases we are interested in:

Individual, purely selfinterested optimization

Selfinterested optimization with the option to share individual rewards
3.1.1 Pure SelfInterest
When optimizing independently and purely selfinterested, .
3.1.2 Sharing
Sharing agents choose a share additionally to . We denote the joint shares by . Given agents, a joint action and a joint share for all , we define individual agents’ utility for distributed optimization with sharing as follows. ^{1}^{1}1We can account for the change of signature of by extending the action space of each agent accordingly: .
(1) 
Shares are uniformly distributed among all other agents. There are no bilateral shares. Note that this sharing mechanism is an arbitrary choice.
For example, sharing yields the following utilities for two agents.
3.2 Distributed Optimization with Sharing
We now give a general formulation of distributed optimization with sharing (DOS). DOS is shown in Algorithm 1. Each agent maintains a policy , i.e. a distribution over actions and shares. It is initialized with an arbitrary prior distribution. A rational agent wants to optimize its policy such that the expectation of reward is maximized: , where . Note that optimization of an individual’s policy depends on the policies of all other agents. Also note that policy optimization of selfinterested individuals is nonstationary due to changes in others’ policies as they optimize for themselves.
After initialization, DOS performs the following steps for a predefined number of iterations.

Each agent samples a multiset of actions from its policy and communicates it to other agents.

A list of joint actions is constructed from the communicated action lists of other agents.

The utility of each joint action is determined according to Equation 1.

The policy is updated in a way that increases the likelihood of sampling highutility actions and shares.
After iterations, each agent samples an action and a share from its policy, executes the action, and shares reward accordingly. The resulting joint action yields the global result of DOS.
3.3 CrossEntropy DOS
In general, DOS is parametric w.r.t. modeling and updating of policies . As an example, we instantiate DOS with cross entropy optimization [23]. We label this instantiation CEDOS.
For CEDOS, we model a policy
as isotropic normal distribution
. I.e., each parameter of an action is sampled from a normal distribution that is independent from other action parameter distributions. Note that it is also possible to model policies in terms of normal distribution with full covariance, but the simpler and computationally less expensive isotropic representation suffices for our illustrative concerns. As prior CEDOS requires initial mean for a policy (cf. Algorithm 2, line 1). I.e. initial actions before any optimization are sampled as follows.(2) 
Updating a policy (cf. Algorithm 1
, line 12  15) is done by recalculating mean and variance of the normal distribution. We want the update to increase the expected sample utility. For each of
iterations, we sample actions and shares from each agent’s policy, and build the corresponding joint actions and shares .Each agent evaluates sampled actions and shares according to its utility . From the set of evaluated samples of each agent, we drop a fraction of samples from the set wrt. their utilities. That is, we only keep high utility samples in the set. We then compute mean and variance of the action parameters in the reduced set, and use them to update the policy. A learning rate determines the impact of the new mean and variance on the existing distribution parameters: E.g. let and be the mean and standard deviation of a normal distribution modeling a policy at iteration , then
where and are mean and standard deviation of the elite samples. We require a lower bound on the standard deviation of policies in order to maintain a minimum amount of exploration.
The hyperparameters of CEDOS are thus as follows.

A stochastic game

Number of iterations

Number of samples from the policy at each iteration

Prior mean and standard deviation for policies

Lower bound on the policy standard deviations

Fraction of elite samples to keep

Learning rate
4 Experimental Results and the Sharer’s Dilemma
We experimentally analyzed the effects of sharing in collective adaptive systems of selfinterested agents.
4.1 Domains
We evaluated the effect of sharing utilities with CEDOS in two synthetic domains. In these domains, a CAS of selfinterested agents has to balance individual and global resource consumption (or production, respectively).
For example, the energy market in the smart grid can be modeled as a CAS of selfinterested agents. Each participant has to decide locally how much energy to produce. Each agent wants to maximize its individual payoff by selling energy to consumers in the grid. Therefore, each agent would like to maximize its individual energy production. However, the selling price per unit is typically nonlinearly depending on global production. For example, global overproduction is penalized.
There are a number of corresponding real world problems, for example energy production and consumption in the smart grid, traffic routing, passenger distribution to individual ride hailing participants, cargo distribution on transport as a service, routing of packets in networks, distribution of computational load to computers in a cluster, and many more.
We now define two market models (simple and logistic) as domains for evaluating the effects of sharing in CAS of selfinterested agents.
4.1.1 Simple Market
We model individual and global production, and use their relation for calculating utilities in such a scenario. We set as individual agents’ action space, models the production amount. The sum models the global production.
We define the reward of each agent as the relation of its own individual resource consumption to the global resource consumption. I.e. the reward correlates to an agents market share. We introduce a slope parameter to control the utility slope of individual and global consumption.
(3) 
In this setup, a rational agent would like to increase its own consumption until saturation. I.e. a monopoly is able to produce cheaper than two small producers, and therefore an inequal production amount unlocks more global reward. If all agents act rationally by maximizing their individual , in general the corresponding equilibrium is not equal to the global optimum.
4.1.2 Logistic Market
We modeled another market scenario for investigating the effects of sharing in CAS of selfinterested agents. As before, each agent has to choose the amount of energy to use for production of a particular good. I.e. , as in the simple market domain. Note that this is an arbitrary choice.
Each agent has a logistic production curve as a function of its invested energy. For example, this models different production machine properties. The logistic curve is given as follows.
(4) 
Here, defines the steepness of the logistic function, and determines the offset on the xaxis.
Global production is the sum of individual production . A price function (i.e. an inverse logistic function) defines the price per produced unit, given global production .
(5) 
The reward for an agent is defined as the product of its produced units and the global price.
(6) 
Figure 2 shows an example of production and price functions in the logistic market domain.
4.2 Setup
For our experiments, we used the following setup of CEDOS.^{2}^{2}2We plan to publish our code upon publication.

We consider a stochastic game with agents, that is .

We set and in our experiments.

Individual action spaces were set as .

We define the individual reward functions as given by Equation 3.

We set the number of iterations for CEDOS to 100.

We draw samples from the policy per iteration for each agent.

Prior mean and standard deviation were set to 0 and 1, respectively.

We set the fraction of elite samples .

We set the learning rate .

We set the minimal policy standard deviation .
We sampled domain parameters uniformely from the following intervals.

We sampled the slope parameter from in the simple market domain.

We sampled logistic steepness and offset from for all production and cost functions in our experiments with the logistic market domain.
We varied the number of sharing agents to measure the effect of defecting (i.e. nonsharing) agents that participate in the stochastic game together with sharing individuals.
Note that for the results we report here, we clipped the sharing values such that agents are only able to share up to their current reward, i.e. for a given . In general, other setups with unbound sharing are possible as well.
4.3 Effect of Sharing on Global Reward
Figure 3 shows the mean global utility gathered for varying numbers of sharing agents. We can observe that the fraction of sharing agents correlates with global utility. We also see that the effect of sharing increases with the number of participating agents.
Figure 4 shows the mean individual shared value for the corresponding experimental setups. We can see that the amount of shared value correlates with global reward. I.e. the more value shared, the higher the global reward. We also see that the number of participating agents correlates with the effect of sharing.
Global utility gathered for varying numbers of sharing agents in the simple market (left column) and logistic market (right column) domains. 10 agents (top row), 50 agents (center row) and 100 agents (bottom row) in total. Solid line shows empirical mean of 10 experimental runs, shaded areas show .95 confidence intervals. Best viewed on screen in color.
4.4 Sharer’s Dilemma
Figure 5 shows the Schelling diagrams for the corresponding experiments. A Schelling diagram compares the mean individual utility of sharers and defectors based on the global number of sharing agents [24]. We can see that agents that choose to defect gather more individual utility than the sharing ones.
The shape of the Schelling diagrams in Figure 5 shows that sharing in collective adaptive systems with selfinterested agents yields a dilemma in our experimental setups.
Should an individual agent share or defect?
There is no rational answer to this question for an individual selfinterested agent. If the agent chooses to share, it may be exploited by other agents that are defecting. However, if the agent chooses to defect, it may hurt its individual return by doing so in comparison to having chosen to share.
Note that the amount of sharing is a free parameter to be optimized by DOS. This means that all behavior we observe in our experiments is emergent. The combination of available resources, interdependency of agents’ actions and the ability to share lets agents decide to share with others based on their intrinsic motivation.
Our results illustrate a potential reason for emergence of cooperation and inequity aversion in CAS of only selfinterested agents. They also give an explanation to the existence of punishment of individuals that exploit societal cooperation at the cost of sharing individuals’ and global reward.
5 Conclusion
We summarize the ideas in this paper, discuss limitations and implications of our results, and outline venues for further research.
5.1 Summary
In collective adaptive systems (CAS), adaptation can be implemented by optimization wrt. utility. Agents in a CAS may be selfinterested, while their utilities may depend on other agents’ choices. Independent optimization of each agent’s utility may yield poor individual and global payoff due to locally interfering individual preferences in the course of optimization. Joint optimization may scale poorly, and is impossible if agents do not want to expose their preferences due to privacy or security issues.
In this paper, we studied distributed optimization with sharing for mitigating this issue. Sharing utility with others may incentivize individuals to consider choices that are locally suboptimal but increase global reward. To illustrate our ideas, we proposed a utility sharing variant of distributed cross entropy optimization. Empirical results show that utility sharing increases expected individual and global payoff in comparison to optimization without utility sharing.
We also investigated the effect of defectors participating in a CAS of sharing, selfinterested agents. We observed that defection increases the mean expected individual payoff at the expense of sharing individuals’ payoff. We empirically showed that the choice between defection and sharing yields a fundamental dilemma for selfinterested agents in a CAS.
5.2 Limitations
A central limitation of CEDOS is its state and memoryless optimization. In our formulation of utility sharing selfinterested agents optimize an individual action and share that maximizes their utility. However, our formulation does not account for learning decision policies based on a current state and other learning agents. In this case, the utility of each agent would also depend on concrete states, transition dynamics and potentially also on models agents learn about other participants [25, 26].
As there is no temporal component to the optimization problems that we studied in this paper, it is also not possible to study the effect of gathering wealth in our current setup. We think that the dynamics of sharing in temporally extended decision problems may differ from the ones in stateless optimization. For example, corresponding observations have been made for game theoretic dillemas, where optimal strategies change when repeating a game (in contrast to the optimal strategy when the game is only played once) [27]. Similar research has been conducted in the field of reinforcement learning, however not accounting for utility sharing so far [6].
We also want to point out that exposing shares eventually provides ground for attack for malicious agents [8]. Albeit indirectly, exposed shares carry information about individual utility landscapes, allowing attackers to potentially gather sensitive information about agents’ internal motivations. Agents in critical application domains should consider this weakness when opting to share.
5.3 Future Work
In future work, we would like to transfer our approach to temporally extended domains and model sharing in CAS with multiagent reinforcement learning. Hopefully, this would enable studying sharing and the Sharer’s Dilemma in more complex domains.
We also think that there are many interesting options for realizing sharing besides equal distribution as formulated in Eq. 1. For example, our formulation does not allow for bilateral shares or formation of coalitions. Also, we would be interested to study the effect of wealth on emergent cooperation and defection. Another interesting line would be to investigate the effects of punishment in CAS of selfinterested agents.
As an application domain, it would be interesting to exploit the duality of planning and verification. For example, agents utility could model individual goal satisfaction probability. Sharing could be used to increase individual and global goal satisfaction probability in CAS.
References

[1]
Tan, M.:
Multiagent reinforcement learning: Independent vs. cooperative
agents.
In: Proceedings of the tenth international conference on machine learning. (1993) 330–337
 [2] Hillston, J., Pitt, J., Wirsing, M., Zambonelli, F.: Collective adaptive systems: qualitative and quantitative modelling and analysis (dagstuhl seminar 14512). In: Dagstuhl Reports. Volume 4., Schloss DagstuhlLeibnizZentrum fuer Informatik (2015)
 [3] Belzner, L., Hölzl, M., Koch, N., Wirsing, M.: Collective autonomic systems: Towards engineering principles and their foundations. In: Transactions on Foundations for Mastering Change I. Springer (2016) 180–200
 [4] Foerster, J., Nardelli, N., Farquhar, G., Torr, P., Kohli, P., Whiteson, S., et al.: Stabilising experience replay for deep multiagent reinforcement learning. arXiv preprint arXiv:1702.08887 (2017)
 [5] Phan, T., Belzner, L., Gabor, T., Schmid, K.: Leveraging statistical multiagent online planning with emergent value function approximation. In: Proceedings of the 17th Conference on Autonomous Agents and Multi Agent Systems, International Foundation for Autonomous Agents and Multiagent Systems (2018)
 [6] Leibo, J.Z., Zambaldi, V., Lanctot, M., Marecki, J., Graepel, T.: Multiagent reinforcement learning in sequential social dilemmas. In: Proceedings of the 16th Conference on Autonomous Agents and Multi Agent Systems, International Foundation for Autonomous Agents and Multiagent Systems (2017) 464–473
 [7] Perolat, J., Leibo, J.Z., Zambaldi, V., Beattie, C., Tuyls, K., Graepel, T.: A multiagent reinforcement learning model of commonpool resource appropriation. In: Advances in Neural Information Processing Systems. (2017) 3646–3655
 [8] Brundage, M., Avin, S., Clark, J., Toner, H., Eckersley, P., Garfinkel, B., Dafoe, A., Scharre, P., Zeitzoff, T., Filar, B., et al.: The malicious use of artificial intelligence: Forecasting, prevention, and mitigation. arXiv preprint arXiv:1802.07228 (2018)
 [9] Lerer, A., Peysakhovich, A.: Maintaining cooperation in complex social dilemmas using deep reinforcement learning. arXiv preprint arXiv:1707.01068 (2017)

[10]
Van der Hoek, W., Wooldridge, M.:
Multiagent systems.
Foundations of Artificial Intelligence
3 (2008) 887–928  [11] Fioretto, F., Pontelli, E., Yeoh, W.: Distributed constraint optimization problems and applications: A survey. arXiv preprint arXiv:1602.06347 (2016)
 [12] Shapley, L.S.: Stochastic games. Proceedings of the national academy of sciences 39(10) (1953) 1095–1100

[13]
Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche,
G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M.,
et al.:
Mastering the game of go with deep neural networks and tree search.
nature 529(7587) (2016) 484–489  [14] Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., et al.: Mastering chess and shogi by selfplay with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815 (2017)
 [15] Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al.: Mastering the game of go without human knowledge. Nature 550(7676) (2017) 354
 [16] Anthony, T., Tian, Z., Barber, D.: Thinking fast and slow with deep learning and tree search. In: Advances in Neural Information Processing Systems. (2017) 5366–5376
 [17] Littman, M.L.: Markov games as a framework for multiagent reinforcement learning. In: Machine Learning Proceedings 1994. Elsevier (1994) 157–163
 [18] Foerster, J., Assael, I.A., de Freitas, N., Whiteson, S.: Learning to communicate with deep multiagent reinforcement learning. In: Advances in Neural Information Processing Systems. (2016) 2137–2145
 [19] Tampuu, A., Matiisen, T., Kodelja, D., Kuzovkin, I., Korjus, K., Aru, J., Aru, J., Vicente, R.: Multiagent cooperation and competition with deep reinforcement learning. PloS one 12(4) (2017) e0172395
 [20] Sodomka, E., Hilliard, E., Littman, M., Greenwald, A.: Cocoq: Learning in stochastic games with side payments. In: International Conference on Machine Learning. (2013) 1471–1479
 [21] Peysakhovich, A., Lerer, A.: Prosocial learning agents solve generalized stag hunts better than selfish ones. arXiv preprint arXiv:1709.02865 (2017)
 [22] Hughes, E., Leibo, J.Z., Philips, M.G., Tuyls, K., DuéñezGuzmán, E.A., Castañeda, A.G., Dunning, I., Zhu, T., McKee, K.R., Koster, R., et al.: Inequity aversion resolves intertemporal social dilemmas. arXiv preprint arXiv:1803.08884 (2018)
 [23] Kroese, D.P., Rubinstein, R.Y., Cohen, I., Porotsky, S., Taimre, T.: Crossentropy method. In: Encyclopedia of Operations Research and Management Science. Springer (2013) 326–333
 [24] Schelling, T.C.: Hockey helmets, concealed weapons, and daylight saving: A study of binary choices with externalities. Journal of Conflict resolution 17(3) (1973) 381–428
 [25] Foerster, J.N., Chen, R.Y., AlShedivat, M., Whiteson, S., Abbeel, P., Mordatch, I.: Learning with opponentlearning awareness. arXiv preprint arXiv:1709.04326 (2017)
 [26] Rabinowitz, N.C., Perbet, F., Song, H.F., Zhang, C., Eslami, S., Botvinick, M.: Machine theory of mind. arXiv preprint arXiv:1802.07740 (2018)
 [27] Sandholm, T.W., Crites, R.H.: Multiagent reinforcement learning in the iterated prisoner’s dilemma. Biosystems 37(12) (1996) 147–166
Comments
There are no comments yet.