Log In Sign Up

Offsetting Unequal Competition through RL-assisted Incentive Schemes

This paper investigates the dynamics of competition among organizations with unequal expertise. Multi-agent reinforcement learning has been used to simulate and understand the impact of various incentive schemes designed to offset such inequality. We design Touch-Mark, a game based on well-known multi-agent-particle-environment, where two teams (weak, strong) with unequal but changing skill levels compete against each other. For training such a game, we propose a novel controller assisted multi-agent reinforcement learning algorithm which empowers each agent with an ensemble of policies along with a supervised controller that by selectively partitioning the sample space, triggers intelligent role division among the teammates. Using C-MADDPG as an underlying framework, we propose an incentive scheme for the weak team such that the final rewards of both teams become the same. We find that in spite of the incentive, the final reward of the weak team falls short of the strong team. On inspecting, we realize that an overall incentive scheme for the weak team does not incentivize the weaker agents within that team to learn and improve. To offset this, we now specially incentivize the weaker player to learn and as a result, observe that the weak team beyond an initial phase performs at par with the stronger team. The final goal of the paper has been to formulate a dynamic incentive scheme that continuously balances the reward of the two teams. This is achieved by devising an incentive scheme enriched with an RL agent which takes minimum information from the environment.


page 1

page 9


Learning to Incentivize Other Learning Agents

The challenge of developing powerful and general Reinforcement Learning ...

On the Robustness of Cooperative Multi-Agent Reinforcement Learning

In cooperative multi-agent reinforcement learning (c-MARL), agents learn...

On the Role of Incentives in Evolutionary Approaches to Organizational Design

This paper introduces a model of a stylized organization that is compris...

Skynet: A Top Deep RL Agent in the Inaugural Pommerman Team Competition

The Pommerman Team Environment is a recently proposed benchmark which in...

Decentralized Role Assignment in Multi-Agent Teams via Empirical Game-Theoretic Analysis

We propose a method, based on empirical game theory, for a robot operati...

Adaptive Incentive Design with Multi-Agent Meta-Gradient Reinforcement Learning

Critical sectors of human society are progressing toward the adoption of...

Quantifying incentive (in)compatibility: a case study from sports

Incentive compatibility is usually considered a binary concept in the ac...

I Introduction

Society has evolved many mechanisms to offset inequality in real-life where often unequal individuals/teams/organizations have to compete against each other, namely, through affirmative action [foster1992economic, austen2006redistribution], special incentives [weisskopf2004impact, ali2015prevent, mukherjee2017conditional], tax breaks [suarez2000does, alexander2009measuring, mckinnon2012firms], compensation [perry2001pay, chan2014compensation], subsidies [schwartz1999government, amegashie2006economics] etc. While there are plenty of evidences that these measures help the weaker team, controversy persists around the implementation detail. For example, in the economics literature, it is well recognized that the use of subsidy by the government in a competitive market can improve welfare, help domestic industry to compete against international counterparts, correct a market failure, bring social and private costs into alignment, to name only a few [amegashie2006economics, danglun2007empirical, zhao2014review, juriaith2014economics, giupponi2018subsidizing] and in the process successfully eliminate the very premise which has led to the introduction of subsidy.

However, side by side, there are a series of works around perverse subsidy [robin2003perverse, mackintosh2006perverse, si2006perverse, srinivasan2009subsidy, stephan2012perverse, chang2018lesson] which argue that when the subsidy is not directed towards the right person or event, it may inflict various adverse effects on the economy, like higher tax, inefficient transfer of fiscal resources, or supply-side distortions, among many other possibilities. Hence, it is safe to argue that the design of incentive schemes to neutralize the disadvantage suffered by a weak team is a non-trivial exercise. Moreover, given the nature of the problem, it is very difficult to continuously monitor agents’ responses and accordingly design a dynamic incentive mechanism.

This paper looks into this problem by considering a simple multi-agent reinforcement learning (MARL) framework which allows us to monitor the response of agents towards incentive schemes and dynamically adjust them in real-time. To the best of our knowledge, there is no study on continuous monitoring of agents’ responses towards incentive. The framework has two major components: (1) a multi-agent two-team game, called Touch-Mark which simulates competition and cooperation among the agents, and accommodates varying levels of agent skills; and (2) a controller assisted multi-agent reinforcement learning algorithm, called C-MADDPG. C-MADDPG builds over MADDPG  [lowe2017multi], however, unlike MADDPG, allows efficient learning of an ensemble of agent policies and provides a controller. It facilitates dynamic switching among the policies based on the situation, thus leading to the two agents of a team taking up complementary roles to ensure win. We further postulate that experiences from different roles impact the future skill level of an individual agent differently.

Using the above-described innovative framework, we study various static and dynamic incentive schemes considering two teams with unequal skill levels. We find that the incentive given to the weaker team is effective and sustainable only when we direct targeted incentives towards the weaker players within the weaker team. Based upon this finding, we design a dynamic incentive scheme that starts with a high value of additional reward for the weaker team and gradually decreases as the weaker team learns the winning policy and progressively becomes stronger. There are several design issues related to the development of such a dynamic incentive mechanism. Most importantly, in real life situations, it may not be possible to measure certain performance related parameters dynamically. We tackle this issue by designing an RL agent that helps to dynamically predict the non-measurable parameters and design an effective incentive scheme. To summarize, this paper provides a simple setup featuring a handful of characteristics from real-world team competitions, allowing us to study the effects of various incentives in a real-world competitive setting.

Contributions: To summarize, the main contributions of this paper are: (1) We initiate the study of agent and team performance in the setting of unequal and changing skill levels, through a novel game - Touch-Mark. (2) We study mechanisms of offsetting unequal competition through individual and team rewards, which can also be learned using RL. (3) We propose C-MADDPG, which learns a dynamic role-based policy ensemble, for faster learning of agent policies and overall smaller simulation time.

Ii Related Work

Multi-agent reinforcement learning is a long-studied problem in various settings, namely learning joint strategy for cooperative tasks [guestrin2002coordinated, rangwala2019learning, wang2020cooperation, yang2020q], optimal play in competitive setting [littman1994markov], learning robust policies under model uncertainty [zhang2020robust] etc. A very common and popular approach is the recent actor-critic framework consisting of centralized training with decentralized execution  [gupta2017cooperative]. MADDPG [lowe2017multi], a multi-agent extension of deep deterministic policy gradient [lillicrap2015continuous], is one such stable and popular algorithm, There are many follow-up works on actor-critic based MARL algorithms, namely multi-actor-attension-critic (MAAC)  [iqbal2018actor] for introducing attention,  [qu2020scalable] for improving scalability,  [christianos2020shared] for sharing experience,  [zhou2020learning] for credit assignment problem,  [mahajan2021tesseract]

for tensorizing the critics, etc. In our game setting, we find the MAAC performs similarly to MADDPG while being much more computationally expensive and thereby we continue our experiments with MADDPG only. However, our proposed framework can be adapted to other multi-agent reinforcement learning algorithms as well. Recently, 

[majumdar2020evolutionary] present an extension of MADDPG with separately learning individual and global goals in a population-based training paradigm.  [liu2021coach] tackles the problem of dynamic team composition in coach-player paradigm. However, none of them explicitly address the setting of unequal competition, with focus on effects of incentives in offsetting the inequality. Also, there is a series of works for role-oriented MARL for specialized domains like robo-soccer [leottau2015study, urieli2011optimizing, ossmy2018variety], football environment [roy2019promoting] showing how complex policies can be learnt by decomposing it into simpler sub-policies. However, our primary focus is being to study and analyze the effect of various incentives schemes on unequal agents; we test our hypothesis on Touch-Mark, a simple team-competitive game as it will be difficult to gain insights on complex multi-player robo-soccer. Despite there exists recent work on the stability of mixed-strategy learning algorithms [mertikopoulos2019learning] or the conditions for the convergence to Nash equilibria in continuous action spaces [kamra2019deepfp], we postpone such theoretical exploration to a future work.

The practice of applying intrinsic incentives in multi-agent reinforcement learning framework by a third party to manipulate the dynamics to obtain the desired outcome is mostly found in various social dilemma games [mohamed2015variational, hughes2018inequity, jaques2019social, paquette2019no].

[iqbal2019coordinated] employ intrinsic rewards for coordinated exploration.  [du2019liir] employ individual intrinsic rewards for stimulating diverse behavior among agents. A form of general utility, a non-linear function of state-action occupancy measure, has recently shown to be effective in practice via prioritizing exploration [mahajan2019maven, gupta2021uneven], risk-sensitivity [qiu2020rmix], and prior experience [le2017coordinated, lee2019improved].  [zhang2021marl] establish theoretical guarantees of consistency and sample complexity for such general utility function. In this work, our proposed dynamic incentive closely resembles the intrinsic rewards proposed in  [hughes2018inequity], though the setting or motivation is quite different.  [jiang2019learning] explores learning fair and stable strategies in resource sharing settings. Close to our line of work,  [zheng2021ai]

present a machine-learning based economic simulation framework, where AI economist, a two-level, deep RL framework is used to train agents along with a social planner to provide a tractable solution to the optimal taxation problem, unlocking a computational learning-based approach to understanding economic policy. However, studying fair outcomes in competitive setting is still in its nascent stage.

The work presented here is in line with the design of fair incentive schemes, which is an important area in fair machine learning [calders2009building, zafar2015fairness, hardt2016equality, pleiss2017fairness]. More specifically, it adds to the recent studies, which have focused on the long-term effects of social groups on implementing fairness constraints.  [hu2018short] devise a data-specific affirmative action strategy on US labor market, which in turn ensures that the need for affirmative action diminishes as time progresses.  [liu2018delayed]

show delayed impact of existing fair classifiers on disadvantaged groups. They demonstrate that even in a one-step feedback model, common fairness criteria, in general, may not promote improvement over time. Similar sentiments are echoed in  

[corbett2018measure] where they show classification parity can, perversely, harm the very groups they were designed to protect.  [mouzannar2019fair] address the important issue of maintaining demographic parity and quality.  [kannan2019downstream] discuss the relation between the constraint of equal opportunity in college admission and biases induced due to this during hiring by companies.  [jabbari2017fairness] build a reinforcement learning model which achieves near-optimality subject to (exact) fairness or approximate-choice fairness. Recently, there is a series of works at the intersection of incentive-based mechanism design and reinforcement learning [zheng2020ai, brero2020reinforcement, zhang2021incentive]. Among theoretical works, [brero2020reinforcement] investigate various theoretical aspects of the use of reinforcement learning for certain classes of indirect mechanisms whereas [zhang2021incentive] propose incentive-aware PAC learning in the presence of strategic manipulation. Our work closely resembles [zheng2020ai] that build social planners for devising tax policies in dynamic economies for effectively balancing economic equality and productivity, where the agents are trained through deep reinforcement learning. The present work adds to the domain at the intersection of mechanism design, reinforcement learning and fair machine learning by considering competition between unequal teams and build RL agents for devising dynamic incentives for fair outcomes.

Iii MARL under Team Competition

We propose Touch-Mark, an episodic board game, built on the multi-particle environment (MPE)  [lowe2017multi], which elicits both competitive and collaborative behavior among agents. In this game, we focus on three major aspects of social behaviour: (1) team competition, (2) the emergence of unequal roles, and (3) skill improvement. Touch-Mark  is largely derived from Keep-away, a player competitive game introduced in [lowe2017multi], where an agent and its adversary both are trying to reach the landmark while trying to push the opponent away from the landmark. For incorporating team competition, we increase the number of members in each team to at least and consequentially increase the number of landmarks to (encouraging diverse policies). Reaching the landmark and colliding with opponent, an agent can adopt these two implicit roles within a team as a policy of the gameplay.

In this game, we also assign a skill level to each agent, which improves over time. The improvement varies depending on which role the agent plays, to simulate the dynamics of how assigning more rewarding roles with more scope for self-improvement to more skilled ones leads to more inequality among the team members. Note that while analyzing the subsidy schemes in the following sections, we assign different initial skill levels to different agents to model unequal competition. While Touch-Mark is simple and efficient, it also incorporates all the features of team competition-based social interactions that are commonly seen in society. Hence, it can be used as a simulation platform for our studies on incentive schemes. We exclude some of the more complex and popular team competition games, e.g. Google Football Environment [kurach2020google], StarCraft 2 [samvelyan2019starcraft] etc. because those are too heavy on computational resources as well as it is more complicated to analyze and differentiate the effects of various incentives. Next, we briefly describe the rules of Touch-Mark.

Iii-a Touch-Mark: Team Competition between Unequal Agents

The game setting consists of two teams, each comprising of two agents. Each agent has its current position and velocity. The game is played iteratively; in each episode, two landmarks (which introduce diversity) are placed at random in a square board and each team tries to reach at least one of those landmarks earlier than any member of the other team. The episode ends when an agent reaches a landmark. The winning agent’s team (ie. both team members) receives a large reward , simultaneously penalty is incurred to members of the opposite team. Additionally, each agent receives a small penalty, , ()) which is proportional to its distance from the nearest target at every time step. The penalty encourages the agents to move towards the target, thus accelerating the learning of policy. To stop agents from going out of the box, a small penalty is given for touching the boundary. Moreover, an agent can collide with an agent of the opponent team to divert it from its path. This mechanism is introduced so that an agent has the option to stop an agent of the opponent team from reaching a landmark, thus facilitating the fellow teammate to reach a landmark first. This cooperative behavior results from the emergence of different roles within a team.

In this game, each agent starts at a random position and zero velocity. Velocity of each agent can increase up to an upper bound (max_speed). Each agent has a parameter (max_speed), representing corresponding skill level, since it limits the speed at which the agent can move. In Touch-Mark, the max_speed is increased at the end of each episode if the agent touches any landmark in that episode, representing skill upgrade at the end of success. The rate of increase of max_speed is proportional to its difference from a global speed limit MAX_SPEED, i.e. higher the skill, slower is the rise in skill level.

Despite being apparently simple in nature, this game captures a few key aspects of the dynamics of competition. In real life, such a game resembles the setting where multiple organizations are competing for some common target and employees within organizations resemble the members within the team. In a similar analogy, touching the landmark mimics the target fulfillment by an organization. The provision for collision in Touch-Mark also is comparable to the situation where organizations put effort to outwit the competitor.

Most importantly, this game offers a skill improvement feature, i.e., an agent improves its skill (max_speed in our case) on achieving some target which closely resembles the popular argument that government policy-makers put behind supporting an infant organization, the theory of learning-by-doing effect. Also the game assumes that one role is more important to attain the target, hence although the reward is equally divided among all the members, the skill improvement happens to one particular member. This mimics real life where, in a team, some may be doing desk job while others are doing field job and learning is much steeper in the field. In each team, the number of agents has been fixed at . The difference in skills and role can both be captured through the two members and keeping the number small also ensure sufficient interpretability of the effects of various incentive schemes introduced later. Increasing the number of players has disadvantages both ways, it becomes complicated/time-consuming and it hinders from uncovering the impact of incentives on the ecosystem as well as on individual players.

Iii-B Learning Ensemble Policies

We cast the problem of learning optimal policies for agents as a multi-agent reinforcement learning problem. We briefly define the formal setup for multi-agent reinforcement learning. We consider a multi-agent extension of Markov decision processes called partially observable Markov games 


A Markov game is characterized by the tuple , where denotes the set of states, denotes the set of actions for each of the agents. Hence the joint space of actions becomes

. The state transition function maps every state and joint-action combination to a probability over future states.

. Reward function specifies the reward scheme () for each agent . The per-agent reward function allows modeling of team competitive games [lowe2017multi]. In the decentralized execution setting, each agent receives its own observation which is a function of the common state , but from the agents’ point of view. Each agent learns a policy , which is a mapping from it’s own observation to a distribution over it’s action set .

Fig. 1: Schematic diagram of C-MADDPG

MADDPG: In above setup, we restrict ourselves to the well-known actor critic framework in the deterministic policy setting, where Multi-Agent Deep Deterministic Policy Gradient (MADDPG) [lowe2017multi] is a popular state of the art method. In MADDPG, each agent maintains its policy , parameterized by and an approximation of the action-value function parameterized by for . In MADDPG  the policy is updated in a decentralized manner, whereas the action-value function (critic) is trained in a centralized manner.

Given an experience replay buffer , where are the current observations / states, are the current actions, are the rewards, and are the observations at next time; the gradient of cumulative reward function w.r.t. policy parameters can be computed as: The policy parameters are iteratively updated using the above gradients. The action-value functions is learned by minimising w.r.t. :


where is the set of delayed target policies parametrized by and is the delayed action value function parameterized by .

Data: : #episodes, : max episode length, : #agents, : classfier update interval, : Constant
1 Randomly initialize: policy-classifier, : params, : params, ;
2 for  to  do
3       for  to  do
4             Select as exploration policy w.p. ;
5             else where ;
             /* Note that, if team=team */
6             [Execute and observe new state and reward ] ;
7             Calculate using equation 2 ;
8             Store in replay buffer ;
             /* Update critic, policy */
9             Sample from ;
10             Update using eqn 1 ;
11             Update using eqn 3 , ;
13       end for
14      Every episodes: update using gradient of cross-entropy loss ;
16 end for
Algorithm 1 C-MADDPG 

C-MADDPG: A major problem with the MADDPG algorithm is that each agent has only one set of policy parameters and it is difficult to learn complex policies using a single set of parameters. But Touch-Mark being a team game, the emergence of diverse behaviors within the team while trying to achieve a common goal may be efficient. To capture the diverse behavior, we develop C-MADDPG, a controller assisted version of MADDPG, which learns an ensemble of policies per agent and also maintains a controller per agent, that is trained in a supervised manner to switch between policies, based on the relative maximum critic function values of the teams. To state our proposal formally, we propose to learn an ensemble of policies , per agent. Here we consider , corresponds to a policy adopted at a configuration where the team is in an advantageous position (we call it a winning policy) and refers to a losing policy.

We propose a simple controller assisted training paradigm where at each timestep and for each agent , we update (follow Algorithm 1): (1) two policies , which guide the next step of the agent (line of Algorithm 1) (2) a classifier which takes the current observations of all agents and assigns a policy label to the agents (line of Algorithm 1); The target policy labels, used in training the classifier , are computed according to the following equation (line of Algorithm 1).


The equation ensures that the classifier is the same for all the members of a team, a team adopts a winning policy () if any one member of the team possesses the highest critic value across all the agents and losing policy otherwise. Here, the underlying assumption is that values of the agents are observable across the teams.

At each time-step , each agent determines it’s policy using the classifier , (the construction of the classifier ensures that both members of a team choose the same policy) and then uses the policy to select its action and receive reward and are stored in replay buffer. The policies of the agents are updated using the following equation at each timestep ,


Intuitively, the rationale behind role emergence in C-MADDPG is as follows: the sample space is split based on the relative superiority of one team over another, and policies are trained over disjoint sample spaces. This modification allows C-MADDPG to enjoy two-fold benefits: C-MADDPG learns more focused, sub-goal oriented policies by systematically splitting the training sets as well as achieves the goal of complex policy by dynamic switching between the policies. Note that, we additionally require the opponent value functions as input, which are of the same size as the global states and actions. Hence, scalability is similar to CTDE methods like MADDPG for a small number of roles. Please refer to fig. 1 for schematic representation of our algorithm.

(a) Team avg. rewards
(b) Agent landmark rate
Fig. 2: (a) Temporal evolution of team-wise average rewards. Experiments have been performed for six different seeds. Shaded region denotes confidence region and (b) Team-wise landmark reaching rates for a set of initial configurations and its reverse.
(a) Forward position
(b) Flip position
Fig. 3: Role emergence of C-MADDPG agents when played against MADDPG agents. The figures have snapshots of various timestamps merged together - agents’ positions in the earlier snapshots are in lighter shade. We observe both MADDPG agents are interested in moving towards the landmark (fig. (a)a), whereas C-MADDPG agents wisely split the roles of go-for-landmark and stop-the-opponent between the team members (fig. (b)b).
(a) Reward
(b) Landmark Count
(c) Win Policy Usage
(d) Speed
(e) Reward
(f) Landmark Count
(g) Win Policy Usage
(h) Speed
Fig. 4: Agents trained in Touch-Mark game using C-MADDPG [Fig (a)-(d) for Team-wise Incentive scheme () and fig. (e)-(h) for Agent-wise Incentive scheme ( and )]. Average performance over different seeds has been reported (the shades signify the confidence region).

Experimental setup

For designing the classifier module in C-MADDPG, a multi-layer perceptron has been used, which consists of two hidden layers with

and neurons respectively, followed by a single node at the output layer and takes an input of dimension

. The hidden layers use ReLU activation function, whereas the output layer uses sigmoid as the activation function. For all the experiments here (and henceforth), the landmark touching reward at the end of episode is taken as

, and per timestep penalty for agent , (distance from nearest landmark), the agents move in continuous space, the parameters max_speed are set to for all agents, MAX_SPEED is set to . We train the RL agents for episodes for different seeds and report average results with confidence interval. We have chosen a board.

Performance Comparison of C-MADDPG and MADDPG: The comparison between the two algorithms is reported in figs. (b)b and (a)a. While training, after every training episodes, agents from both algorithms are made to play among themselves for test episodes. To remove any bias, 500 initial configurations are generated and two episodes are played with same configuration but the initial positions of the teams are exchanged. It can be seen that as the training progresses, C-MADDPG consistently achieves higher reward with respect to MADDPG (fig. (a)a). The landmark reaching statistics also indicates superiority of C-MADDPG team over MADDPG team on Touch-Mark(fig. (b)b).

In order to understand the reason behind performance difference, we look into the behavior of the agents time-step wise, we find that interestingly the two agents of C-MADDPG assume different roles to ensure that their team wins, which is not true in the case of MADDPG. We illustrate this through the fig. 3, which presents different snapshots of an episode (merged together) to illustrate the dynamics of the agents. Figures (b)b and (a)a demonstrate two episodes, where the initial positions of C-MADDPG and MADDPG agents are exchanged respectively. Here in fig. (b)b, one agent of the C-MADDPG team goes for landmark and another C-MADDPG agent stops an MADDPG agent from reaching another landmark; in fig. (a)a MADDPG agents, in a similar situation, do not show any particular trend of role separation. In order to understand the importance of collision, we consider all those episodes where C-MADDPG team wins in both a configuration as well as its reverse and we find, in those cases, the collision rate is , significantly higher than the average rate of .

Summarizing, it is observed that C-MADDPG agents wisely split the emerged roles, go-for-landmark and stop-the-opponent, between themselves resulting in superior performance as seen in figs. (b)b and (a)a. The reason behind the ability of C-MADDPG in role splitting while adopting winning policy, may be attributed to the design of the classifier which leads to the division of sample space between the competing policies. The subdivision allows more focused exploration whereby agents closer to landmark try to reach the target while it is advantageous for the team if the other agent tries to stop the opponent.

The role emergence is an important requirement for the efficient functioning of a team; however, in various cases, some roles may emerge as the main tasks and the others get relegated to auxiliary services. This is true for Touch-Mark where go-for-landmark which results in touching the landmark is more important as the speed level increases only after touching the landmark, thus stop-the-opponent plays an assistive role. So, the strength of an agent can be attributed to the frequency at which it touches a landmark; we will use this knowledge while understanding and devising fair incentive mechanism in the next section.

Iv Fair Competition

In this section, we play Touch-Mark with two teams having unequal skill (represented here by speed) and try to devise an incentive scheme (catered towards the weak team) to match the final reward of the two teams. In the stronger team, we set the initial max_speed of both members to . In the weaker team, the speed (max_speed and speed are used interchangeably) of one member is set to whereas the speed of the ‘weaker’ member is set to with . We choose this mixed setting for the weaker team as that would reveal the more interesting case where inequality is there both within and between teams. For all of the following experiments, all the agents learn their policies using C-MADDPG for episodes. We perform experiments for different seeds and finally report the average performance over all seeds for each metric along with the interval region signifying confidence. We report the average episode rewards for each team (Reward), the average number of times each agent has touched the landmark per episode (landmark-count), the fraction of times a team has used the winning policy (win-policy-usage) and instantaneous max_speed i.e. speed for last episodes throughout the RL learning episodes (These metrics are reported through [fig. 4 - fig. 6]). Table I contains list of important notations used in defining consequtive incentive schemes. We first propose an incentive scheme targeted towards the weak team.

Symbol Meaning
Fraction of additional reward if weak team touches
Fraction of additional reward if weak agent touches
Performance/speed of specific team (strong/weak)
Performance/speed of specific agent (strong/weak)
of team (strong/weak)
TABLE I: List of important notations

Team-wise Incentive : In the changed reward scheme with incentive, every time the strong team touches a target (event also called goal), it gets the usual reward ; while for every goal, the weak team gets a reward of where . All other rewards and penalties remain the same. The parameter can be set manually through trial and error, by best balancing the cumulative rewards (of the last instances) of the two teams.

In [fig. (a)a - fig. (d)d] we have summarized a representative observation at . Figure (a)a shows that for the specific value of the cumulative rewards can be balanced only for a short duration; eventually the stronger team takes over and consistently outperforms the weaker team. Figure (b)b reveals that the weakest player is reaching the landmark very rarely compared to strong players. Hence it hardly learns how to reach landmark and win episodes. Consequently, the players of the weaker team progressively start choosing the winning policy less frequently (fig. (c)c), and skill-level (speed) of the weaker player also increases at a far slower rate than other players (fig. (d)d). We see that the strong player of the weak team initially performs well, as good as the strong player of the strong team (it also initially quickly increases its speed (fig. (d)d)), this is because, within its team, it gets more opportunity. But without the help of the weak player, its performance progressively deteriorates against a team where both members are continuously improving. We conclude that in addition to team incentive, special incentive is needed for the weak agent to balance the long-term total reward.

To motivate such incentive in real life, let us take the example of a sales team where members have to perform various roles like maintaining paperwork and doing the actual sales to customers; the framework marks the paperwork role as less important. Cash incentive is provided to the team when actual sales are performed. The iota of incentive depends upon the team and individual member’s expertise, the less experienced member fetches more reward on performing the actual sales. Of course, the reward is then distributed equally among all the members of the team. The mechanism nudges the team to let the less experienced do the actual sales as that would fetch higher reward to the entire team if successful; in this process, she learns how to perform the job and become trained thus helping the team/herself in the longer run. Without such differentiation, the weaker member would always be made to do the mundane back-office paperwork and never get a chance to learn field operations.

(a) Reward
(b) Landmark Count
(c) Win Policy Usage
(d) Speed
(e) Incentive
(f) Reward
(g) Landmark Count
(h) Win Policy Usage
(i) Speed
(j) Incentive
Fig. 5: Agents trained in Touch-Mark game using C-MADDPG [Fig. (a)-(e) denotes Landmark-based Dynamic Incentive and fig. (f)-(j) denotes Speed-based Dynamic Incentive]. Average performance over different seeds has been reported (the shades signify the confidence region).

Agent-wise Incentive: Here we consider a more explicit incentive scheme, namely agent-wise reward. In this scheme, the weaker team gets if its weaker member touches the landmark and if the stronger member does so; whereas each success of the stronger team is rewarded with . All other rewards and penalties remain the same. The intuition is that such a differential incentive mechanism may help the weak team to discover the huge benefit of the weak player touching the target, and consequently the policies would increasingly gear the weak player towards attempting to touch the target. This in turn would allow the weaker agent to learn and increase her capability (which was not happening in the previous case) and in the process, the difference between the two teams may disappear.

In [fig. (e)e - fig. (h)h] we have summarized a representative observation at and . Here we observe that in fig. (e)e, the weak team initially scores less rewards than the strong team, but eventually, it gains expertise and outperforms the stronger team in terms of rewards. Looking at the corresponding landmark plots, we can conclude that initially mostly the stronger player in the weak team was going for the landmark, but eventually, as the high reward for the weaker agent is discovered, the weaker agent begins to play a significant role (fig. (f)f). Thus we find that unlike in the previous case, the winning policy is pursued by both the teams equally (fig. (g)g). The speed of the weaker agent also increases at a much faster rate than the previous case (fig. (h)h). Thus, in the end, the initial incentive value becomes disproportionate and this results in the continued superior performance of the weaker team. This experiment provides a strong evidence that to ensure sustainable existence and growth, not only the weak team but also the weak players need targetted incentive.

From the study we can conclude that an agent incentive successfully improves the weaker team and eventually brings it to the level of the stronger team. So if that incentive is not changed over time, the weaker team would outperform the stronger getting ‘unfair’ advantage. Consequently, the incentive needs to be dynamic which would ensure that none of the teams get extra benefit at any point in time. We will discuss various such dynamic incentive schemes in the next section.

Name Incentive scheme
DynamicLandmark ,
measures landmark count.
DynamicSpeed ,
measures speed.
measures speed.
measures speed.
TABLE II: List of incentive schemes
(a) Reward
(b) Landmark Count
(c) Win Policy Usage
(d) Speed
(e) Incentive
(f) Reward
(g) Landmark
(h) Win Policy Usage
(i) Speed
(j) Incentive
Fig. 6: Agents trained in Touch-Mark game using C-MADDPG. [Fig. (a)-(e) for Team-RL-Agent-Dynamic-Incentive scheme where Team wise reward is obtained from an RL scheme and Agent wise reward is obtained from their difference in speed. Fig. (f)-(j) for Team-Dynamic-Agent-RL-Incentive scheme where Agent wise reward is obtained from an RL scheme and Team wise reward is obtained from their difference in speed.] Average performance over different seeds has been reported (the shades signify the confidence region).

V Dynamic Incentive Scheme

In the dynamic incentive scheme, the team-wise incentive () and the agent-wise incentive () are determined dynamically at each timestep, considering (1) Landmark-based Incentive - using the difference in performance (landmark touching count) of agents or (2) Speed-based Incentive - using the difference in speed (skill) of agents. Hence:


where , is either performance or speed (suitably normalized) of a team and an individual agent respectively. If any of the values of , turns out to be negative at any instance, that value is set to zero.

Results: Landmark-based Incentive [fig. (a)a-fig. (e)e] - We observe the dynamic incentives help in successfully balancing the final rewards obtained by the teams. The incentive strengthens the weaker agent (see fig. (b)b)) as landmark touching rate of the weaker player nears that of a strong player. For both teams, we also find that the fraction of times each agent pursues winning policy becomes similar ( fig. (c)c). The speed of the weaker player also catches up (fig. (d)d) and the incentive needed to match the two teams disappears (fig. (e)e).
Results: Speed-based Incentive [fig. (f)f-fig. (j)j] - In this case, the rewards of the two teams (fig. (f)f), unlike in fig. (a)a, do not become equal, even after 150000 episodes. We also see the learning curve (indicated by number of landmark touched (fig. (g)g)) is low for the weak agent. The weaker team’s fraction of winning policy is also comparatively less (fig. (h)h). However, the speed of the weaker agent increases and reaches the maximum (fig. (i)i), albeit slower than in fig. (d)d. This in turn pushes both and towards zero. Hence, although this incentive scheme increases the skill of the weaker team (player), the real capability which is a complex combination of skill and policy learning lags. Therefore, we conclude that skill in many cases may not reveal the true relative positions in terms of performance.

Thus from these experiments, we find that the dynamic feedback of the difference in landmark-count (performance) is the best way to balance the output of the two teams. However, performance is a complex quantity to measure, moreover, competitors may not always immediately (if at all) share this information. Therefore, the challenge lies in achieving equivalent balancing without taking performance information as an input in real-time; we employ an RL technique to estimate the performance.

Fig. 7: A transition diagram showing how the capabilities of the heterogenous agents evolve as the training progresses. The incentives used to close the gap between agent capabilities are diminishing as the training progresses.
Fig. 8: Tournament among various incentive schemes.

V-a Dynamic Incentive Using Reinforcement Learning

Here we propose two RL-based incentive mechanisms which take current speed (and not performance) of all agents as input: (1).[Team-RL-Agent-Dynamic] - where we decide the value of using where is the current speed configuration and is the policy learned by training an RL model. is computed as in eq. 4. (2). [Team-Dynamic-Agent-RL] - similarly, we decide the value of using and compute using eq. 4.

Training the RL model: We train the RL agent in an off-policy mode using Soft Actor-Critic [haarnoja2018soft] algorithm where it uses the full training episode of Touch-Mark game setting using C-MADDPG algorithm. We assume that the performance information is available during training time. The observation space of the RL agent consists of current speed configuration of all the agents. As per the need of the incentive scheme, either or is determined by sampling from the RL module . The RL module obtains reward by measuring the difference in performance of the agents after applying for a fixed number of episodes. The intuition is that this scheme will guide the RL agent to learn the desired mapping between speed and performance. We postpone joint training of and as it is extremely computationally expensive to obtain stable policies for both.

Results : First we train the RL agent for 100000 episodes of underlying C-MADDPG. We then train agents using C-MADDPG for episodes and obtain the value of or using the pre-trained RL model, and summarize the results through fig. 6.
Team-RL-Agent-Dynamic [fig. (a)a - fig. (e)e] - In fig. (a)a, we observe that the balance of reward between the two teams gets tilted towards the weaker team after sometime. We also find that asymptotically the landmark touching rate of the weaker player converges towards its stronger peers (fig. (b)b). The speed of it increases (fig. (d)d) and the agent-based incentive diminishes to almost zero (fig. (e)e). However, the team-wise incentive (fig. (e)e), doesn’t vary much beyond a time, indicating that the RL agent could not regulate incentive value beyond initial phase making the weaker team receive undue advantage and surpass the initially stronger team.
Team-Dynamic-Agent-RL [fig. (f)f - fig. (j)j] - We observe that the reward match between two teams is better than fig. (a)a - the match is almost as good as the result obtained in the case of Dynamic Landmark based incentive scheme (fig. (a)a). The landmark touching rate of the weaker agent slowly increases and catches up with the stronger agents (fig. (g)g). The winning policy usage by both teams roughly become similar (fig. (h)h) Also, the speed of the weaker player increases steadily towards the maximum speed (fig. (i)i). Both the incentives (, ) decrease, however, the agent incentive persists a bit; this is because the RL agent learns that mere matching of speed does not necessarily mean the attainment of capability (which was one of the learnings of speed-based incentive scheme (fig. (f)f - fig. (j)j)). However, there may be some minor performance estimation error from speed which gets reflected in smaller mismatches in score observed in (fig. (a)a). From the results, we can thus conclude that Team-Dynamic-Agent-RL-Incentive scheme can be a good replacement for Landmark-Based-Dynamic-Incentive scheme. Fig.  7 demonstrates the transition of the teams of different capabilities as the training progresses as well as the evolution of incentives used to close the gap between agent capabilities.

Vi Comparison across incentive schemes

In this section, we explicitly compare all the incentive schemes, through direct tournaments and then comparing the performance of the weak teams, trained under different incentive schemes.

Vi-a Tournament between incentive schemes

Here we compare various incentive schemes in a more direct way by playing them against each other in a tournament style. We train C-MADDPG agents under various incentive schemes for episodes for different seeds. Both the competing teams have one weak and one strong player; teams with different incentive schemes are played against each other for test episodes for different combination of competing models. Figure 8 reports the landmark count per episode averaged over all experiments for the weakest member of each team, normalized by the team performance. The results confirm the following facts. (i) Static agent incentive scheme trains the weak member better than static-team or dynamic schemes. (ii) Dynamic landmark scheme trains the weak agent better than dynamic-speed scheme, RL-based schemes or static schemes. (iii) Among RL-based dynamic incentive schemes, Team-RL-Agent-Dynamic scheme trains the weak agent worse than both dynamic landmark and speed schemes. (iv) However, performance of the weak agent trained under team-dynamic-agent-RL is between dynamic-landmark and dynamic-speed schemes, which confirms team-dynamic-agent-rl as a good substitute of dynamic-landmark scheme.

Vi-B Comparing the variation of performance of the incentive schemes

We remind again that our primary motivation in bringing fairness in unequal competition is to incentivize the weak team in such a way that the teams with unequal expertise should end up achieving equivalent performances. So, an incentive scheme would be optimal if it can ensure little or no variation in performance (touching landmark) among all the agents. To find the optimal candidate among proposed incentive schemes, we check the variation and present the result in table III. For each incentive scheme, we take the last

episodes and measure the standard deviation 

[jain1984quantitative] of the landmark count of the four agents. Since all experiments have been performed on four seeds, the table reports the mean of the standard deviation with confidence interval. Defining fair learning of unequal agents as the socially optimal goal, table III shows the landmark-based dynamic scheme ( team-dynamic-agent-RL as the closest alternative) to be the best performer in this respect. This is in line with the results shown in figs. (g)g and (b)b depicting the landmark count behavior of the agents of these two best performing schemes. The incentivized learning process ensures that both the teams improve over time, only the incentive scheme lets the weaker team’s weaker member improve faster and catch up with her more accomplished peers.

Methods Landmark Count
StaticTeam 0.446 0.16
StaticAgent 0.394 0.04
DynamicLandmark 0.248 0.02
DynamicSpeed 0.302 0.08
Team-RL-Agent-Dynamic 0.450 0.11
Team-Dynamic-Agent-RL 0.235 0.01
TABLE III: Standard Deviation of landmark count among the agents across incentive schemes.

Vii Conclusion

The paper studies competition among organizations with unequal expertise and argues that certain incentive mechanisms towards the weaker members needs to be formulated to ensure fair outcomes in the long run. The entire study to devise various incentive schemes is carried out using multi-agent reinforcement learning framework. However, the implementation is not straightforward, in fact, we have to devise and extensively test a controller assisted ensemble-based multi-agent reinforcement learning algorithm, C-MADDPG  which captures the importance of (winning, losing) policy selection based on relative positions of the agents and facilitates the emergence of diverse roles among the agents. This innovative devising of controller allows us to formulate a setting where the behavior of a weaker player in a team is relegated more towards a non-primary role, thus restricting its development and in turn, development of the entire team. We argue the roadblock can be removed through the introduction of targeted incentive. We undertake a rigorous study to dynamically balance rewards and show that there are three components - skill (here speed), performance (here landmark touching), and policy learning - which determine the outcome of the game and one needs to exploit their relationship to devise a practical algorithm.

Finally, we validate the utility and effects of proposed incentive schemes on Touch-Mark, a simple game designed in MPE. As a final comment, although we feel the findings are intriguing and reaffirm the concept of targeted subsidy, building a model whose outputs can be used to alleviate inequality in a practical environment is a non-trivial task. Similarly, extending C-MADDPG framework towards accommodating dynamic set of policies per agent as well as allowing agents within team to simultaneously select different policies are immediate research directions to explore on the algorithmic side.


The authors would like to thank Intel Corporation for supporting this research.