gameofcoins
Connect 4 featuring multiple AIs and a Peggle-style alternate game mode.
view repo
We formalize the current practice of strategic mining in multi-cryptocurrency markets as a game, and prove that any better-response learning in such games converges to equilibrium. We then offer a reward design scheme that moves the system configuration from any initial equilibrium to a desired one for any better-response learning of the miners. Our work introduces the first multi-coin strategic attack for adaptive and learning miners, as well as the study of reward design in a multi-agent system of learning agents.
READ FULL TEXT VIEW PDFConnect 4 featuring multiple AIs and a Peggle-style alternate game mode.
Cryptocurrencies are an arms race. Hundreds of digital coins have crept into the worldwide market in the last decade MarketState , including more than a dozen with over a billion dollar Market Cap, e.g., buterin2014next ; BitcoinCash ; Litecoin ; Cardano ; Neo . The vast majority of cryptocurrencies are based on the notion of proof of work (PoW) Bitcoin . As a result, the major strategic players in the context of cryptocurrencies are miners who devote their power to solving computational puzzles to find PoWs Bitcoin ; buterin2014next .
The miners for a particular coin usually gain rewards that are proportional to the power they invest in the coin out of the total invested power (in the coin) by all miners. Each coin, therefore, can be viewed as having some weight that reflects the reward it divides among its miners. In practice, a coin’s weight (or reward) depends on its transaction rate, transaction fees, and its fiat exchange rate.
While the above description is not complete, it does capture the fundamental decision faced by the miner: where should I mine? One indication for reward-based coin switching can be found online in websites like www.whattomine.com whattomine , where miners enter their mining parameters (technology, power, cost, et cetra) and get a list of coins they can mine for, ordered by their profitability. Another interesting example happened on November 12 (2017) bitinfocharts , when a dramatic change in the Bitcoin to Bitcoin Cash BitcoinCash (a spin-off from Bitcoin) exchange rate led to a major inrush of miners from Bitcoin to Bitcoin Cash (see Figure 1).
All in all, the structure of the cryptocurrency market suggests that we face here a game among miners, where each miner wishes to mine coins of heavy weights while avoiding competition with other miners. In this paper we introduce for the first time the study of the cryptocurrency market as a game, consisting of a set of strategic players (miners) with possibly different mining powers and a set of coins with possibly different rewards (weights). The miners are free to choose to mine for any coin from the set, and we consider general better-response learning of the miners. That is, whenever any miner may benefit from deviating (i.e., changing the coin it mines for), some miner will take a step that improves his payoff; we allow an arbitrary sequence of such individual improvement steps (sometimes called improving path MondererShapley96 ). In our first major result we prove that any such better response learning converges to a (pure) equilibrium regardless of miner powers and coin rewards! This result is obtained by showing an ordinal potential, which according to MondererShapley96 , implies that arbitrary better response learning converges to equilibrium.
Having at hand the above fundamental result, we move to a discussion of strategic manipulation Nisan:2007:AGT:1296179 . While many efforts have been invested in the study of crypto-related manipulations SelfishMining ; OptimalSelfishMining ; StubbornMining ; MinersDilemma , we introduce for the first time the manipulation of the miners’ learning and optimization process. Given that a shift in the weight of a coin may influence miner behavior whattomine , in the cryptocurrency setting, it is quite possible for an interested party to affect this weight, either by creating additional transactions with high fees (sometimes called whale transactions liao2017incentivizing ) or by manipulating the coin exchange rate gandal2018price ; priceMuniplautionm ; whalesMuniplaution ; BitcoinMuniplaution . This way, a miner (or another interested party) can attempt to change the system equilibrium to a better one for them. We show that under broad circumstances, for every equilibrium of such a game, there exists a miner and another equilibrium in which the miner’s payoff is higher. The question is therefore: can one design rewards (i.e., temporarily increase coin weights) in a way that will lead the system from a given equilibrium to a desired one, so that the system will remain in the desired equilibrium after reverting to the original weights? Note that such reward design allows the manipulator to pay a finite cost while gaining an advantage indefinitely.
The above reward design problem is challenging since miners might take any better response step, and may make their moves in any order. Given the (modified) weights, we can use our previous major result to claim that any better response learning will converge to an equilibrium. Notice that the latter may not be the desired one, but now we can modify the rewards again. In the second major result of this paper we show that such desired reward design for learning agents is feasible! Namely, we provide a (multi-step) algorithm for assigning rewards in equilibrium states that moves learning agents from any initial equilibrium to a desired one.
In summary, our contributions are as follows:
We formalize strategic mining in multi-cryptocurrency markets as a game (Section 2).
We prove that any better-response learning in such games, starting from an arbitrary configuration, converges to equilibrium (Section 3).
We show that, in many cases, for every equilibrium there is a miner and another equilibrium in which the miner’s payoff is higher (Section 4).
We offer a reward design scheme that moves the system configuration from any initial equilibrium to a desired one for any better-response learning of the miners (Section 5).
For space limitations, the proofs of some of the claims we state here are deferred to Appendices C - F.
Results on better response learning convergence to pure equilibrium are rare and are typically restricted to games with exact potential NIPS2017_7216 ; MondererShapley96 , which coincide with congestion games. We show that our game does not have an exact potential (Section 3), and in fact our game belongs to the larger class of ID congestion games, where the payoff of a player depends on the player and the identity of other players who choose a similar resource, rather than on their number only. While there exist extensions of congestion games in which better-response learning converges to equilibrium (e.g., a restricted form of player-specific congestion games Milchtaich96 , which does not include our game), such results are extremely rare in the context of ID congestion games.
Unlike works on learning in games that emphasize adapting specific machine learning algorithms to minimize regret
NIPS2017_7216 ; PalaiopanosPP17 ; NIPS2015_5763 ; learninggamesbook , we assume minimal rationality on behalf of the players, i.e., that they follow an arbitrary better response step improving their individual payoffs.Our work also expands literature on reward design NIPS2010_4146 ; NIPS2017_7253 ; SorgSL10
, and to the best of our knowledge, is the first to introduce reward design for learning agents in a multi-agent setting. While seminal works in reward design assign/modify state rewards in a reinforcement learning context
DBLP:books/lib/SuttonB98 , we design rewards for equilibrium states for any better response learning.Though several previous works presented game theoretical analyses for cryptocurrencies liao2017incentivizing ; SelfishMining ; OptimalSelfishMining ; StubbornMining ; carlsten2016instability ; MiningGames ; Johnson14 ; MinersDilemma ; Schrijvers16 , the vast majority of them deal (in one way or another) with miners’ incentives to follow the coins’ mining protocols. Our work is the first to extend the study to a multi-coin setting and establish fundamental game theoretical results therein.
A system in our model is a tuple , where is a finite set of miners (players) and is a finite set of coins (resources). A miner has mining power , which it can invest in one of the coins , i.e., the set of possible actions of is . We denote the set of configurations of a system as and denote by the action of player in configuration . When clear from the context, we omit the subscript indicating the system and simply write . Given and , we denote by the set of miners who mine for in , i.e., , and by their total mining power, i.e., . For we denote by the configuration that is identical to except that is replaced by .
A reward function maps coins to rewards. A game consists of a system and a reward function . Every coin in a game divides its reward among all the players that mine for it, and the miners’ payoffs are defined as follows: For , the revenue per unit (RPU) of coin in is . When clear from the context, we omit the the parameter indicating the game. The payoff function of a miner is .
Given a game , a configuration , a miner , and a coin , we say that moves from to in if it changes its action from to . A move from to is a better response step for if . We say that a miner is stable in a configuration in game if has no better response steps in . A configuration is stable or a (pure) equilibrium if every miner is stable in . A better response learning from in is a sequence of configurations resulting from a sequence of better response steps starting from , which is either infinite or ends with a stable configuration. In case it is finite, we say that it converges to its final configuration.
A function is an ordinal potential for a game if for any two configurations s.t. some better response step of a miner leads from to , it holds that . If, in addition, , then is an exact potential. By MondererShapley96 , if a game has an ordinal potential, then every better response learning converges.
In this section we prove that although a game has no exact potential, every better-response learning of the miners in game converges to a stable configuration (pure equilibrium) regardless of the sets and and the reward function . To gain intuition, the reader is referred to our Appendix A and B, where we show how to construct a particular equilibrium in a game for any and , and give a simple ordinal potential function for the symmetric case in which is a constant function i.e., , , respectively.
We start by showing that our game does not have an exact potential.
The game does not always have an exact potential.
Let be a game where , , , and . Assume by way of contradiction that has an exact potential function , and consider the following four configurations:
. Payoffs: , .
. Payoffs: , .
. Payoffs: , .
. Payoffs: , .
Note that . However, by definition of exact potential, we get that . A contradiction.
∎
To show an ordinal potential, we use the following definitions:
For a configuration in a game , we define to be the sequence of pairs in ordered lexicographically from smallest to largest. Denote by the coin (second element of the pair) in the entry in . Consider the ordered set , where is the set of all possible lists in , and is the lexicographical order. The rank of a list , , is the rank of in from smallest to largest.
Note that since and are finite, we know that and are finite. The following two observations establish a connection between better response steps and the of the associated coins.
Consider a game , , , and s.t. . Then in every better response step of that changes to a coin , it holds that .
Consider a game . If some better response step from configuration to configuration of a miner changes to , then .
We are now ready to prove that any game has an ordinal potential function.
For any finite sets and of miners and coins and reward function , is an ordinal potential in the game .
Consider two configurations s.t. some better response step of a miner leads from configuration to configuration , and let and . We need to show that . Since only the RPUs of and are affected we get that
(1) |
By Observation 1, we get that , and thus , By Observation 2, we get that , and thus, together with the definition of and Equation 1, we get that , Therefore, none of them “move down” to a position before in and so
(2) |
That is, the first elements of are equal to the first elements of . Hence, it suffices to show that the element of is lexicographically larger than the element of . Let . From Equation 2, we know that , so there are two possible cases:
First, . The theorem follows from Observation 2.
Second, . In this case,
as needed.
∎
Before moving to our second major result in which we describe a manipulation through dynamic reward design that transitions the system between equilibria, in this section we show that under broad circumstances, in every stable configuration there is at least one miner who has higher payoff in another stable configuration. This means that such a miner will gain from moving the system there. Specifically, we prove this for games that satisfy the following assumptions (note that we use these assumptions only in this section):
For a configuration in a game , if there is a coin s.t. , then there is a miner s.t. changing to is a better response step for .
Although this assumption cannot hold when , it often holds in practice since the number of miners must be much larger than the number of coins for the cryptocurrency to be secure (truly decentralized).
For any two coins and two sets of players in a game , .
This assumption is common in game theory
DBLP:journals/mss/HolzmanL03 , and it makes sense in our game since mining power in practice is measured in billions of operations per hour and coin rewards are coupled with coin fiat exchange rates, so exact equality is unlikely.The following observation follows from Assumption 1 and the fact that coins that are chosen by at least one miner always divide their entire reward. It stipulates that in every stable configuration, the sum of the payoffs the miners get is equal to the sum of the coins’ rewards.
For every stable configuration in a game under Assumption 1, it holds that .
It remains to show that has more than one stable configuration. Consider s.t. , and , let . We first show that the game has two different configurations in which miners do not share a coin and at most one of them is unstable. Then, we inductively construct two configurations in , , based on the two configurations in , in which all miners in keep their locations and all miners except maybe the one that was unstable in are stable. The construction step is captured by Claim 5, where in the step.
Let be a reward function. Consider a system , and a configuration . Now consider another system s.t. , , and . Let and consider a configuration s.t. for all and . Then is stable in in game , and every player that is stable in in is also stable in in .
Finally, we show that the two configurations we construct in are stable: Let be the (possibly) unstable miner. By Assumption 1 (note that the assumption refers only to game ), cannot be alone in a coin (otherwise there must be another unstable miner), and thus it shares the coin with a smaller stable miner, which we show implies that is stable.
In this section we consider a system , where s.t. . For every reward function and every two stable configurations in game we describe a mechanism to move the system from to the desired configuration by temporarily increasing coin rewards. Note that once we lead the system to , we can return to the original rewards (i.e., stop manipulating coin weights) because is stable in . Therefore, a manipulator who gains from moving to a desired stable configuration can do it with a bounded cost.
We first define a reward design function that maps system configurations to reward functions.
Consider a system . A reward design function for system is a function mapping every configuration to a reward function, i.e., .
Consider a system and a reward function . A dynamic reward design mechanism for game is an algorithm that for any two stable configurations in moves the system from to by following the protocol in Algorithm 1.
To describe a dynamic reward design algorithm we need to specify the reward design function for every loop iteration in Algorithm 1. Intuitively, we observe that miners with less mining power are easily moved between coins, meaning that we can increase a coin reward so that a small miner with little mining power will benefit from moving there, but bigger miners with more mining power prefer to stay in their current locations. Therefore, the idea is to evolve the current configuration to in stages, where in stage , we move the miners with the smallest mining powers to the location (coin) of miner in the final configuration (i.e., ) while keeping the remaining miners in their (final) places. To this end, we define intermediate configurations. For , , we define as:
(3) |
That is, in , miners are in their final locations and miners are in the final location of miner . Note that . Figure 1(a) illustrates the stage transitions in the algorithm.
Notice that since we allow arbitrary better response learning (in every iteration), choosing a reward design function is a subtle task; miners can move according to any better response step, and we cannot control the order in which miners move. One may attempt to design a reward function so that in the resulting game there is exactly one unstable miner with exactly one better response step in the current configuration. However, even given such a function, after that miner takes its step, other miners might become unstable, which can in turn lead to a learning process that depends on the order in which miners move and on the choices they make (in case they have more than one better response step). Hence, the main challenge is to be able to restrict the set of the possible stable configurations reached by learning phase in each iteration.
In every loop iteration of stage we pick a miner that we want to move from to (as explained shortly) and choose the reward function carefully so that (1) ’s only better response step is , (2) all other miners are stable, and (3) in every stable configuration reached by better response learning after ’s step, is in , all miners are in either or , and all the other (bigger) miners remain in their (final) locations.
Moreover, our proof shows by induction that our reward design function of stage (defined below) guarantees that the set of possible configurationas reached by learning in stage is
Notice that the stage starts at . We now explain how we choose the reward design function for stage . First, for every configuration , the index of the miner we want to move from to (called mover) is
Note that for every , . Moreover, and . Let . Intuitively, we use as an anchor in configuration ; we choose a reward function that increases the reward of coin as high as possible without making the anchor unstable. As a result, all the miners in (who are bigger than or equal to the anchor) remain stable, and miner has a unique better response step to move to . Figure 1(b) illustrates and for some configuration .
In order to make sure that miners not in also remain stable, and in order to guarantee that that every better response learning after ’s step converges to a configuration in , we choose a reward function that evens out the RPUs of all coins other than . For , let . The reward design function for stage is:
(4) |
Note that the RPUs of all coins except in the game are equal to . In addition, note that if a miner bigger than or equal to moves to , then ’s RPU becomes no bigger than . However, since , has a unique better response step to move to . Therefore, we get that our reward design function allows us to control the first step of the learning process. In the next section we give more intuition on how it also restricts the stable configuration at the end of any learning process at stage to the set .
As for the fist stage, note that we need to move all miners to coin , so intuitively we only need to increase its reward high enough. We therefore choose:
(5) |
In Algorithm 2 we present our reward design algorithm, and in the next section we outline the proof that every stage eventually completes.
The proof for stage 1 is straightforward so we skip it. Consider stage . We prove in the appendix the following technical lemma about stable configurations in the stage:
Consider a configuration . Then every better response learning in the game that starts at converges to a configuration such that:
, .
.
As part of the proof, we show that within stage , all the reached configurations (both stable and unstable) are in . Let and . After moves to according to its only better response step, in the resulting configuration , the RPUs of all coins not in remain . Moreover, , and . Therefore, although , it is still not high enough to drive miners not in (by definition, bigger than ) to move to it. So the only miners that possibly have better response steps at are miners in who wish to move to . Moreover, the total mining power of the miners who actually move to is smaller than , otherwise, ’s RPU will go below . In the proof we use the above intuition to formulate an invariant that captures the lemma statement and prove it by induction on better response steps. The lemma then follows from Theorem 1 (every better response learning converges to a stable configuration).
We next use Lemma 1 to prove that every stage completes in a finite number of loop iterations. To this end, we associate with every configuration
a binary vector
indicating, for each , whether is in , where it needs to be at the end of the stage. Consider the ordered set , where is the set of all binary vectors of length , and is the lexicographical order. For a configuration , we define to be a vector in such that:and the function to be the rank of in .
Every stage of Algorithm 2 completes in a finite number of loop iterations.
By definitions of stage and set , the first configuration of stage is . By definition of , for all , . By inductively applying Lemma 1, we get that every loop iteration in stage ends in a configuration in . Therefore, consider a loop iteration of stage that starts in configuration and ends in configuration , we get by Lemma 1 that and , . Therefore . Now since the set is finite, we get the after a finite number of iterations we reach configuration .
∎
Our work studies and challenges the crypocurrency market from a novel angle – the strategic selections by adaptive miners among multiple coins. There are several central followups one may consider. First, our reward design is effective for arbitrary better-response learning, but one may wonder about its speed of convergence under specific markets. In addition, we consider convergence to equilibrium, and one may consider also convergence to a bad (possibly unstable) configuration in which, for example, a particular miner will have a dominant position in a coin, killing (at least for a while) the basic guarantee of non-manipulation (security) for that coin and allowing him to get a bigger portion of the reward. One also may wonder about the asymmetric case where some coins can be mined only by a subset of the miners.
We show here how to find a stable configuration (pure equilibrium) in the game for any , and . We do this by induction, selecting coins for miners in descending order of mining power.
Consider a reward function , a system , and another system s.t. , , and . Then, if the game has a stable configuration, than the game has a stable configuration as well.
Let be a stable configuration in . We use it to build a configuration in in the following way: for all set , and set to . We now show that configuration is stable in .
First, consider . Since we pick to be , we get that , , so is stable in . Next, consider a miner s.t. . Since is stable, we know that , . Now since and for all , we get that . Meaning that is stable in .
Finally, consider s.t. . We need to show that for all , . Again, since we pick to be , we know that , , and thus , . Now by the claim assumption, . Therefore, , . Thus, , . Meaning that is stable in .
∎
For any set of of miners , set of coins , and a reward function , the game has a stable configuration.
Order miners by mining power so that . First we show that the game has a stable configuration. Let , and define a configuration in as follows: . Since , we get that the configuration is stable in . The lemma follows by inductively applying Claim 6.
∎
We consider here the symmetric case where all coin rewards are equal, and show that any game in this case has a simple potential function.