THERE has been a long history of developments in digital cryptocurrencies . The early digital cryptocurrencies that rely on centralized authorities to settle transactions have failed, until the emergence of Bitcoin. To avoid single points of failure, Bitcoin is designed as a decentralized system free of the control of central authorities, which could be compromised by corruption and attacks . Since the birth of Bitcoin in 2008, it has become a widely accepted currency all over the world. In early 2018, the market price of Bitcoin went as high as 20,000 US dollars, reflecting robust demands and enthusiasm for Bitcoin by the public.
The security of Bitcoin is built on the foundation technology of blockchain. Blockchain contains several key technical components, including its chained data structure, peer-to-peer network protocol, and distributed consensus algorithm [21, 23, 16]. The Bitcoin’s blockchain is not controlled by a central authority: it is assembled by peers in the network independently in a distributed manner. For consistency of the blockchains maintained by different peers, the peers must come to an agreement on a single universal truth about the transactions of Bitcoin through a consensus building process.
Consensus in Bitcoin network is achieved by the proof-of-work (PoW) consensus algorithm. The idea of PoW originated in  and is rediscovered and exploited in the implementation of Bitcoin. PoW provides strong probabilistic consensus guarantee with resilience against up to 1/2 malicious nodes [6, 15]. The successful operation of Bitcoin demonstrates the practicality of using PoW to achieve consensus. Subsequent to Bitcoin, many other cryptocurrencies, such as Litecoin , Ethereum , also adopt the PoW consensus algorithm.
Peers running the PoW consensus algorithm are miners who compete to solve a difficult cryptographic hash puzzle, called the PoW problem. The miner who successful solves the PoW problem obtains the right to extend the blockchain with a block consisting of valid transactions. In doing so, the miner receives a reward in the form of a newly minted coin written into the added block. Solving the PoW problem for rewards is called mining, just like mining for precious metals.
Miners invest computation resources to solve the PoW problem. Previously, it was believed that the most profitable mining strategy is honest mining, wherein a miner will broadcast the newly added block as soon as it has solved the PoW problem. Let
be the ratio of a particular miner’s computing power over the computing powers of all miners. This ratio is also the probability that the miner can solve the PoW problem before others in each round of an added block. Over the long term, the rewards to a miner that executes the honest mining strategy is therefore fraction of the total rewards issued by the Bitcoin network. This is reasonable, since miners share the pie in proportion to their investments. Not known were whether there are other mining strategies more profitable than honest mining.
Later, the authors of  developed a selfish mining strategy that can earn higher rewards than honest mining. A selfish miner does not broadcast its mined block immediately; it carries out a block-withholding attack by secretly linking its future mined blocks to the withheld mined block. If the selfish miner can mine two successive blocks before other miners do, it can broadcast its two blocks at the same time to override the block mined by others. Since Bitcoin has an inherent self-adjusting mechanism to ensure that on average only one block added to the blockcain every 10 minutes , by invalidating the blocks of others (hence, removing them from the blockchain), the selfish miner can increase its own profits. For example, with computing power ratio , the rewards obtained by selfish mining can be up to fraction of the total rewards . Based on this observation,  further proposed different selfish mining strategies with even higher rewards. Despite the many versions of selfish mining, the optimal most-profitable mining strategy remained elusive until .
The authors of  formulated the mining problem as a general Markov Decision Process (MDP) with a large state-action space. The objective of the mining MDP, however, is not a linear function of the rewards as in standard MDPs. Thus, the mining MDP cannot be solved using a standard MDP solver. To solve the problem,  first transformed the mining MDP with the non-linear objective to a family of MDPs with linear objectives, and then employed a standard MDP solver over the family of MDPs to find the optimal mining strategy.
The approach in  is a model-based approach in which various parameter values (e.g., ) must be known before the MDP can be set up. In real blockchain networks, the exact parameter values are not easy to obtain and may change over time, hindering the practical adoption of the solution. In this paper, we propose a model-free approach that solves the mining MDP using machine learning tools. In particular, we solve the mining MDP using reinforcement learning (RL) without the need to know the parameter values in the mining MDP model.
RL is a machine-learning paradigm, where agents learn successful strategies that yield the largest long-term reward from trial-and-error interactions with their environment [20, 8]. Q-learning is the most popular RL technique . It can learn a good policy by updating a state-action value function without an operating model of the environment. RL has been successfully applied in many challenging tasks, e,g., playing video games  and Go , and controlling robotic movements .
The original RL algorithm cannot deal with the nonlinear objective function of our mining problem. In this paper, we put forth a new multi-dimensional RL algorithm to tackle the problem. Experimental results indicate that our multi-dimensional RL mining algorithm can successfully find the optimal strategy. Importantly, it demonstrates robustness and adaptability to changing environment (i.e., parameter values changing dynamically over time).
Ii Blockchain Preliminaries
Blockchain is a decentralized append-only ledger for digital assets. The data of blockchain is replicated and shared among all participants. Its past recorded data are tamper-resistant and participants can only append new data to the tail-end of the chain of blocks. The state of blockchain is changed according to transactions, and transactions are group into blocks that are appended to blockchain. The header of the block encapsulates the hash of the preceding block, the hash of this block, the Merkle root111The Merkle root of the transactions is the hash value of the Mmerkle tree whose leaves are the transactions . of all transactions contained in this block, and a number called nonce that is generated by PoW. Since each block must refer to its preceding block by placing the hash of its preceding block in its header, all the blocks form a chain of blocks arranged in chronological order. Fig. 1 illustrates the data structure of blockchain.
Ii-a Proof of Work and Mining
Blockchain adopts the PoW consensus protocol to validate new blocks in a decentralized manner. In each round, the PoW protocol selects a leader that is responsible for packing transactions into a block and appends this block to blockchain. To prevent adversaries from monopolizing the blockchain, the leader selection must be approximately random. Since blockchain is permissionless and anonymity is inherently designed as a goal of blockchain, it must consider the sybil attack where an adversary simply creates many participants with different identities to increase its probability of being selected as the leader. To address the above issues, the key idea behind PoW is that a participant will be randomly selected as the leader of each round with a probability in proportion to its computing power.
In particular, blockchain implements PoW using computational hash puzzles. To create a new block, the nonce placed into the header of the block must be a solution to the hash puzzle expressed by the following inequality
where the nonce , the hash of the previous block , the Merkle root of all included transactions are taken as the input of a cryptographic hash function and the output of the hash function should fall below a target that is small with respect to the whole range of the hash function outputs. The used hash function (e.g., SHA-256 hash is used for Bitcoin) satisfies the property of puzzle friendliness : it is challenging to guess the nonce to fulfill (1) by a one-shot try. The only way to solve (1) is to try a large number of nonces one by one to check if (1) is fulfilled until one lucky nonce is found. Therefore, the probability of finding such nonce is proportional to the computing power of the participant—the faster the hash function in (1) can be computed in each trial, the more nounces can be tried per unit time. Using the blockchain terminology, the process of computing hashes to find a nonce is called mining, and the participants involved are called miners.
Ii-B Honest Mining Strategy
When a miner tries to append a new block to the latest legal block by placing the hash of the latest block in the header of the new block, we say that the miner mines on the latest block. Blockchain is maintained by miners in the following manner.
To encourage all miners to mine on, and maintain, the current blockchain, a reward is given as an incentive to the miner by placing a coin-mint transaction in its mined block that credits the miner with some new coins. If the block is verified and accepted by other peers in the blockchain network, the reward is effective and thus can be spent on the blockchain. When a miner has found an eligible nonce, it publishes his block to the whole blockchain network. Other miners will verify the nonce and transactions contained in that block. If the verification of the block is passed, other miners will mine on the block (implicitly accepting the block); otherwise, other miners discard the block and will continue to mine on the previous legal block.
If two miners publish two different legal blocks that refer to the same preceding block at the same time, the blockchain is then forked into two branches. This is called forking of blockchain. Fork is an undesirable feature of blockchain because it is manifestation that consensus among peers has not been reached. It can also compromise integrity and security of the blockchain . To resolve a fork, PoW prescribes that only the rewards of blocks on the longest branch (called the main chain) are effective. Then, miners are incentivized to mine on the longest branch, i.e., miners always add new blocks after the last block on the longest main chain that is observed from their local perspectives. If the forked branches are of equal length, miners may mine subsequent blocks on either branch randomly. This is referred to as the rule of longest chain extension. Eventually, one branch will predominate and the other branches are discarded by peers in the blockchain network.
The mining strategy adhering to the rule of longest chain extension and publishing a block immediately after the block is mined is referred to as honest mining [21, 23, 16]. The miners that comply with honest mining are called honest miners. It was widely believed that the most profitable mining strategy for miners is honest mining; and that when all miners adopt honest mining, each miner is rewarded in proportion to its computing power [21, 23, 16]. As a result, any rational miner will not deviate from honest mining. This belief was later shown to be ill-founded and that other mining strategies with higher profits are possible [5, 14, 17]. We will briefly discuss these mining strategies in the next section. For a more concrete exposition, we will first present the mining model.
Iii Blockchain Mining Model
In this section, we present the Markov Decision Process (MDP) model for blockchain mining. Ref.  first developed a MDP mining model and used the model to construct a selfish mining strategy with higher rewards than honest mining. Then,  proposed even more profitable selfish mining strategies. Recently,  extended the MDP mining models of [5, 14] to a more general form. In this work, we adopt the mining model of .
Without loss of generality, we assume the network is split into two mining pools: one is an adversary that controls a fraction of the whole network’s computing power; the other is the network of honest miners that controls a fraction of the computing power of the whole network.
Even if the adversary and an honest miner release their newly mined blocks to the network simultaneously, the blocks will not be received by all miners simultaneously due to propagation delays and network connectivity. We model the communication capability of the adversary using the parameter , defined as the fraction of the honest miners that will first receive the block from the adversary when the adversary and one honest miner release their blocks approximately at a same time—more specifically, is the computing power of the honest network that will mine on the block of the adversary when the adversary and an honest miner release their blocks simultaneously.
As in , we model blockchain mining as a single-player MDP , where is the state space, is the action space, is the transition probability matrix and is the reward matrix. Each transition is triggered by the event of a miner mining a new block, whether the block is mined by the adversary or one of the honest miners. The action taken by the adversary based on the previous state, together with the event, determines the next state to which the system evolve.
The objective of the adversary is to earn rewards higher than its computational power. To achieve this, the adversary will generally deviate from honest mining by building a private chain of blocks without releasing them the moment the blocks are mined; the adversary will release a number of blocks from its private chain at a time to undo the honest chain opportunistically.
|Current State, Action||Next State||Transition Probability||Reward|
The action is allowed when ; the action is allowed when .
State: Each state in the state space is represented by a three-tuple form , where and are respectively the lengths of the adversary’s chain and the honest network’s chain after the latest fork (as illustrated in Fig. 2(a)). In general, can take three possible values . Their meanings will be explained later.
Action: The action space includes four actions that can be executed by the adversary.
Adopt: The adversary accepts the honest chain and mines on the last block of the honest chain. This action discards the blocks in the chain of the adversary and it renews the attack from the new starting point without a fork. This action is allowed by the MDP model for all and .
Override: The adversary publishes one block more than the honest chain (i.e., blocks) to the whole network. This action overrides the conflicting blocks of the honest chain. This action is allowed when .
Match: The adversary publishes the same number of blocks as the honest chain (i.e., blocks) to the whole network. This action creates a fork deliberately and initiates an open mining competition between the two branches of the adversary and the honest network. This action is allowed when and .
Wait: The adversary does not publish blocks and it just keeps mining on its own chain. This action is always feasible.
One remark about the actions of the MDP mining model is that some actions that can generally be performed are deliberately removed from the action-state space because these actions are not gainful for the adversary. For example, when , the adversary can still release a certain number of its blocks. However, since releasing fewer blocks than the number of block on the honest chain will not increase its probability of mining the next block compared to mining it privately, these actions thus are excluded from the allowed actions.
We now explain the three values of the entry in the three-tuple form of states.
: The value of relevant means that the latest block is mined by the honest network. Now, if and , the action is allowed. For example, if the previous state is and now the honest network successfully mines one block, the state then changes to . If at this time, , is allowed. We remark that here may be gainful for the adversary because computing power of the honest network would be dedicated to mining on the adversary chain because of the near simultaneous releases of the latest block of the adversary chain and the latest block of the honest chain. In this state, as far as the public is concerned, there no fork yet, since the mined blocks of the adversary are private and hidden from the public. However, if the adversary execute a match from this state, then a fork will be made known to the public and an active competition between the two branches will follow.
: The value of means that the latest block is mined by the adversary and the blocks published by the honest network have been already received by (the majority of) the honest network. Now, even if , the action is not allowed. For example, if the previous state is and now the adversary successfully mines a new block, the state changes to . We emphasize that is disallowed here even if , not because it cannot be performed in the blockchain, but rather here is not gainful for the adversary. If were to be performed here, no computing power of the honest network would shift to mining on the adversary chain because the miners in the honest network would have received the latest block of the honest chain first (well before the current transition triggered by the adversary mining a new block) and would have dedicated to mining on the honest chain already. Again, in this state, there is no fork as far as the public blockchain is concerned.
: The value of means that the adversary has executed the action from the previous state, and the blockchain is now split into two branches. For example, if the previous state is with and the adversary executed the action . If the new transition is triggered by the honest network mining a new block, then the state transitions to . In short, means a fork is made known to the public and that an active competition between the two branches of the fork is ongoing.
We must emphasize that among the blocks of the adversary, some of the blocks may be private while other blocks are public. Which parts of blocks are private/public are implied by the state implicitly. For example, suppose that the previous state is with (as illustrated in Fig. 2 (a)) and the action is performed. If the adversary subsequently mines a new block on his/her own branch, then the state changes to , where there are private blocks and public blocks among the blocks owned by the adversary (as illustrated in Fig. 2 (b)). If the honest miners mine a new block on the adversary’s branch, the state changes to , where there are private block left for the adversary (as illustrated in Fig. 2 (c)). If the hones miners mine a new block on the honest network’s branch, the state changes to , where there are private blocks and public blocks among the blocks owned by the adversary (as illustrated in Fig. 2 (d)).
Transition and Reward: The occurrence of each state transition is triggered by the creation of a new block (either by the adversary or by the honest network) and the corresponding transition probability is the probability of the block created by the adversary () or by the honest network (). The initial state is with probability or with probability . The reward is given as a tuple , where denotes the number of blocks mined by the adversary and accepted by the whole network, and denotes the number of blocks mined by the honest network and accepted by the whole network. The state transition and reward matrices are given in TABLE I.
Objective Function: The objective of the adversary is to find the optimal mining strategy that can earn as much reward as possible. Since blockchain keeps adjusting the mining difficulty (i.e., the mining target on the RHS of inequality (1)) to ensure that on average one valid block is introduced to the overall blockchain per valid block interval (e.g., one block per 10 minutes for Bitcoin, and per 10-20 seconds for Ethereum), the mining objective of the adversary is not to maximize its absolute cumulative reward, but to maximize the ratio of its cumulative rewards over the cumulative rewards of the whole network (i.e., the cumulative rewards of the whole network advance by one reward per block interval—rewards of all miners/Time is fixed to 1 per block interval; then maximizing adversary rewards/Time is equivalent to maximizing the ratio of adversary rewards/Time to rewards of all miners/Time = adversary rewards/rewards of all miner). We emphasize that blocks mined by the adversary and the honest network that are discarded due to losing out in the competition are not considered as having been successfully introduced to the blockchain. Thus, the principle behind the strategy of the adversary is to maximize the number of blocks mined by the honest network that are later discarded while reducing its own discarded blocks.
As in , we define the following () as the objective function for blockchain mining:
where is the tuple of rewards issued in the block interval , is the size of the observing window. The objective of the adversary is to maximize this relative mining gain.
Honest Mining: For honest mining, miners will follow the rule of longest chain extension. Thus, they will not maintain a private chain: when they have a new block, they will immediately publish it. The honest mining strategy can be written as
where we note that , can only take a value of 0 or 1.
Selfish Mining: The main idea of selfish mining  is described as follows. If one block is found by the adversary, it does not publish it immediately and it keeps mining on its private chain. When the adversary already has one private block and then honest network publishes one block (immediately after an honest miner mines a new block), the adversary chooses to publish its block to match the honest network. This causes computing power of the honest network to mine on the adversary’s chain. When the adversary already has some private blocks and then honest network catches up with only one block less than the adversary (), the adversary overrides the honest network’s block by publishing all its blocks. The selfish mining strategy can be written as
Lead Stubborn Mining: Lead stubborn mining  is different from selfish mining in the following way. A lead stubborn miner always publishes one block from its private chain to match with the honest network when the honest network mines a new block if . The adversary never executes the action override. The lead stubborn mining can be written as
It is shown that this lead stubborn mining can achieve higher profits than selfish mining .
Optimal Mining : Although there are many possible mining strategies that can obtain profits higher than honest mining, the optimal mining strategy is not obvious. Since the state-action space of the MDP is huge, it is not straightforward to derive the optimal mining strategy. The relative mining gain objective (2) is a nonlinear function of the rewards, and thus the corresponding MDP cannot be solved using standard MDP solvers to give the optimal mining strategy. To solve this problem,  first transformed the MDP with the nonlinear objective to a family of MDPs with linear objectives, and then employed a standard MDP solver combined with a numerical search over the family of MDPs to find the optimal mining strategy. As shown in , its solution indeed can find the optimal mining strategy. However, the solution of  is model-based approach: it must know the parameters that characterize the MDP model exactly (i.e., the computing power distribution , the communication capability ). In real blockchain networks, these parameters are not easy to obtain and may change over time, hindering the use of the solution proposed in . We propose a model-free approach that solves the MDP with the nonlinear objective using DRL.
Iv Mining Through Rl
This section first provides preliminaries for RL and then presents a new RL algorithm that can derive the optimal mining strategy without knowing the parameters of the environment. We propose the new RL mining algorithm based on Q-learning, one popular algorithm from the RL family.
Iv-a Preliminaries for Original Reinforcement Learning Algorithm
In RL, an agent interacts with an environment in a sequence of discrete time steps, as shown in Fig. 3. At time , the agent observes the state of the environment, ; it then takes an action, . As a result of the state-action pair, , the agent receives a scalar reward , and the environment moves to a new state at time . Based on , the agent then decides the next action . The goal of the agent is to effect a series of rewards through its actions to maximize some performance criterion. For example, for Q-learning , the performance criterion to be maximized at time is the discounted accumulated rewards going forward , where is a discount factor for weighting future rewards . In general, the agent takes actions according to some decision policy . RL methods specify how the agent changes its policy as a result of its experiences. With sufficient experiences, the agent can learn an optimal decision policy to maximize the long-term accumulated reward .
The desirability of state-action pair under a decision policy decision is captured by a Q function, defined as , i.e., the expected discounted accumulated reward going forward given the current state-action pair . The optimal decision policy is one that maximizes Q function. In Q-learning, the goal of the agent is to learn the optimal policy through an online-iterative process by observing the rewards while it takes action in successive time steps. In particular, the agent maintains the Q function, , for all state-action pairs , in a tabular form.
be the estimated action-value function during the iterative process. At time step, given state , the agent selects a greedy action based on its current Q function. This will cause the system to return a reward and move to state . The experience at time step is captured by the quadruplet . At the end of time step , experience is used to update for entry as follows:
where is a parameter that governs the learning rate. Q-learning learns from experiences gathered over time, , through the iterative process in (6). Note that Q-learning is a model-free learning framework in that it tries to learn the optimal policy without having a model that describes the operating behavior of the environment beyond what can be observed through the experiences.
As a deviation from the above description, a caveat in Q-learning is that the so-called -greedy algorithm is often adopted in action selection. For the -greedy algorithm, the action is only chosen with probability . With probability , a random action is chosen uniformly from the set of possible actions. This is to avoid the algorithm from zooming in to a local optimal policy and to allow the agent to explore a wider spectrum of different actions in search of the optimal policy .
It has been shown that in a stationary environment that can be fully captured by an MDP, the Q-values will converge to optimality if the learning rate decays appropriately and each action in the state-action pair is executed an infinite number of times in the process .
Iv-B New Reinforcement Learning Algorithm for Mining
The original RL algorithm as presented in Section IV.A cannot be directly applied to maximize the mining objective function expressed in (2); there is one fundamental obstacle that must be overcome. The obstacle is the nonlinear combination of the rewards in the objective function. The original RL algorithm can only maximize an objective that is a linear function of the scalar rewards, e.g., the weighted sum of scalar rewards. To address this issue, we put forth a new algorithms that aim to optimize the original mining objective: the multi-dimensional RL algorithm.
We formulate the multi-dimensional RL algorithm as follows. At mined block interval 222Mined block interval is different from valid block interval. A valid block interval separates two valid blocks that are ultimately adopted by the blockchain. The average duration of a valid block interval is a constant in many blockchain systems (e.g., 10 min in bitcoin). The average duration of the valid block interval is defined by the system designer and its constancy is maintained by adjusting the mining target. A mined block interval separated two mined (by either the adversary of the honest network), regardless of whether the blocks becomes valid later. In the MDP model, each transition is triggered by the mining of a new block. Thus the average duration of a mined block interval is the average time separates two adjacent transitions. Due to the actions of the adversary, some of the mined blocks (by the adversary of the honest network) may be discarded later. (), the state takes a value from the state space as defined in the MDP model of blockchain mining, and the action is chosen from the action space . The state transition occurs according to TABLE I. The reward is the pair whose value is assigned according to TABLE I. The experience at the end of mined block interval is given by . The objective of the multi-dimensional RL algorithm is to maximize the relative mining gain as expressed in (2).
For a state-action pair , instead of maintaining an action-value scalar , the multi-dimensional RL algorithm maintains an action-value pair corresponding to the Q function values of the adversary and the honest network, respectively. The Q functions defined by Q learning are the expected cumulative discounted rewards. Specifically, and are defined as
Suppose that at mined block interval , the Q functions in (7) are estimated to be , . For action selection, we still adopt the -greedy approach. To select the greedy action, we construct the following objective function:
After taking action , the state transitions to and the reward pair is issued. With the experience , the multi-dimensional RL algorithm updates the two Q functions as follows:
where . Note that the update rule of (9) is very similar to the update rule of Q learning, except that the greedy action is chosen by maximizing the constructed objective function in (8) rather than maximizing the Q function itself as in Q learning. From the expressions in (7) and (8), we can verify that the adopted objective function in (8) is consistent with the relative mining gain objective function defined in (2), except the discount terms used in the computation of the Q functions. The use of discount terms can ensure that the Q functions can converge to some bounded values; however, adding discount terms to the rewards will change the original mining objective. One simple way to ensure strict objective consistency is to set . Although the setting of will result in unbound values for the Q functions as the RL iteration gradually progresses to infinite time steps, this is not a big problem as long as the Q function values do not overflow during the execution of the algorithm. In practice, we can also set to be very close to one.
The RL algorithm expressed by the Q function updates in (9) is our multi-dimensional RL algorithm. We introduce one additional technical element to the -greedy action selection, as explained in the next paragraph.
As described above, when we select the action, we adopt the -greedy strategy that allows us to select the current best action () with probability and to randomly select an action with probability . This random action selection is used to explore some unseen states and can avoid trapping at local optimal maximums. However, the tuning of parameter is not straightforward. A large reduces the possibility of trapping at local optimal maximums but it also decrease the average reward, since it wastes a fraction of the time to explore non-optimal states. In our algorithm, we adopt the following strategy for dynamically tuning the parameter . Denote the number of times state was visited by . Then, the parameter used at state for performing -greedy action selection is given by
where is a temperature parameter that governs how fast we gradually reduce the parameter.
V Performance Evaluations
We have conducted simulations to investigate our proposed RL mining strategy. Following the simulation approach used in , we constructed a Bitcoin-like simulator that captures all the relevant PoW network details described in previous sections, except that the crypto puzzle solving processing was replaced by a Monte Carlo simulator that simulates the time required for block discovery without actually attempting to compute a hash function. We simulated 1000 miners mining at identical rates (i.e., they each can have one simulated hash test at each time step of the Monte Carlo simulation). A subset of the 1000 miners ( miners) forms an adversary pool running a malicious mining strategy that co-exists with honest mining adopted by the other miners. When co-existing with honest mining, the malicious mining strategy is one of the following mining strategies: i) our RL mining strategy, ii) the optimal mining strategy derived in  or iii) the selfish mining strategy derived in . Upon encountering two subchains of the same length, we divide the honest miners such that a fraction of them mine on the attacking pool’s branch while the rest mine on the other branch. The performance metric used is the relative mining gain (RMG) computed over a window consisting of time steps: .
We first compare the performances of our RL mining, the optimal-policy mining and the selfish mining. Fig. 4-6 plots the mining reward of the adversary versus for different values of . The relative mining gain of is treated as a bound for the mining problem and it can only be achieved by optimal-policy mining for . To derive the optimal policy, we adopt the search algorithm proposed in  and set the search error to a very tiny number of . As in , we truncate the MDP at or for both of optimal-policy mining and RL mining. The temperature parameter is set to as and it is reset to after time steps when convergence is attained. All the results of RL mining are given after the algorithm has converged. From the results, we can see that the performance of our RL mining can converge to the performance of optimal-policy mining without knowing the details about the environment model.
We next consider the impact of the temperature parameter on the convergence of RL mining. Fig. 7-9 present the mining rewards obtained by RL mining with different over time for , respectively ( is fixed to 0.45). From the results, we can see that generally, RL mining with larger can have more explorations and can converge more closely to the optimal performance; however, RL mining with larger also have longer exploration phases that slow down the convergence process. Fig. 10 presents the mining rewards of RL mining with different for different ( is fixed to 1). The mining reward results are given after time steps and without resetting . We see that for larger , we need larger to ensure the convergence of RL mining, although it will slow down the convergence process. In practice, we can dynamically reduce the value of when we find that the mining gain has already converged.
Last, we investigate the mining performances of different mining strategies when the blockchain environment changes. The results are given in Fig. 11, where the values of change in the following order: (0.35,1), (0.35, 0), and (0.15, 0). The temperature parameter of RL mining is fixed to . The optimal-policy mining strategy adopts the optimal policy for the MDP model with . From the simulation results, we can see that when environment has changed, the optimal-policy mining strategy derived from the previous model is not optimal anymore; our RL mining can adaptively learn the optimal policy for different environments. This demonstrates the advantage of RL mining over these model-based mining strategies.
We employed RL algorithms to solve the mining MDP problem of Bitcoin-like blockchains. We showed that, without knowing parameters about the blockchain network model, our RL mining can achieve the mining reward of the optimal policy that requires knowledge of the parameters. Therefore, in a dynamic environment in which the parameter values can change over time, RL mining can be more robust.
Going forward, we will investigate two issues that need to be addressed before RL mining can be practical:
1. More complete MDP model as proposed in  for blockchain networks—This model incorporates detailed blockchain features, such as stale block rate, double spending attack, and eclipsed attack, that have been precluded by the model in the current paper. The large action-space of the complete model will make it more challenging for RL mining to learn an optimal strategy.
2. Cost of lagged time in convergence—Since miners need to pay for their hardware and consume electricity to mine blocks, fast convergence of the mining algorithm is important from the economical standpoint. Deep RL 
that incorporates deep neural networks into RL can potentially speed up the convergence rate. We will consider the exploit of deep RL in our future work.
-  (2014) Mastering bitcoin: unlocking digital cryptocurrencies. O’Reilly Media, Inc.. Cited by: §I.
-  (2018) Deconstructing the blockchain to approach physical limits. arXiv preprint arXiv:1810.08092. Cited by: §II-B.
-  (2014) Ethereum: a next-generation smart contract and decentralized application platform. [Online]. Available: https://github.com/ethereum/wiki/wiki/White-Paper. Cited by: §I.
-  (1992) Pricing via processing or combatting junk mail. In Annual International Cryptology Conference, pp. 139–147. Cited by: §I.
-  (2018) Majority is not enough: bitcoin mining is vulnerable. Communications of the ACM 61 (7), pp. 95–102. Cited by: §I, §II-B, §III, §III, §III, §V.
-  (2015) The bitcoin backbone protocol: analysis and applications. In Annual International Conference on the Theory and Applications of Cryptographic Techniques, pp. 281–310. Cited by: §I.
-  (2016) On the security and performance of proof of work blockchains. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pp. 3–16. Cited by: §VI.
-  (1996) Reinforcement learning: a survey. Journal of artificial intelligence research 4, pp. 237–285. Cited by: §I.
-  (2016) Difficulty control for blockchain-based consensus systems. Peer-to-Peer Networking and Applications 9 (2), pp. 397–413. Cited by: §I.
Litecoin-open source p2p digital currency. [Online]. Available: https://litecoin.com/en/. Cited by: §I.
-  (1987) A digital signature based on a conventional encryption function. In Conference on the theory and application of cryptographic techniques, pp. 369–378. Cited by: footnote 1.
-  (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: §I, §VI.
-  (2008) Bitcoin: a peer-to-peer electronic cash system. [Online]. Available: http://bitcoin.org. Cited by: §I.
-  (2016) Stubborn mining: generalizing selfish mining and combining with an eclipse attack. In 2016 IEEE European Symposium on Security and Privacy (EuroS&P), pp. 305–320. Cited by: §I, §II-B, §III, §III, §III.
-  (2017) Analysis of the blockchain protocol in asynchronous networks. In Annual International Conference on the Theory and Applications of Cryptographic Techniques, pp. 643–673. Cited by: §I.
-  (2018) Everything you wanted to know about the blockchain: Its promise, components, processes, and problems. IEEE Consumer Electronics Magazine 7 (4), pp. 6–14. Cited by: §I, §II-B.
-  (2016) Optimal selfish mining strategies in bitcoin. In International Conference on Financial Cryptography and Data Security, pp. 515–532. Cited by: §I, §I, §I, §II-B, §III, §III, §III, §III, §V, §V.
-  (2015) Trust region policy optimization. In International conference on machine learning, pp. 1889–1897. Cited by: §I.
-  (2017) Mastering the game of go without human knowledge. Nature 550 (7676), pp. 354. Cited by: §I.
-  (2018) Reinforcement learning: an introduction. MIT press. Cited by: §I, §IV-A, §IV-A, §IV-A.
-  (2016) Bitcoin and beyond: a technical survey on decentralized digital currencies. IEEE Communications Surveys & Tutorials 18 (3), pp. 2084–2123. Cited by: §I, §I, §II-B.
-  (2018) Research on the security criteria of hash functions in the blockchain. In Proceedings of the 2nd ACM Workshop on Blockchains, Cryptocurrencies, and Contracts, pp. 47–55. Cited by: §II-A.
-  (2019) A survey on consensus mechanisms and mining strategy management in blockchain networks. IEEE Access 7, pp. 22328 – 22370. Cited by: §I, §II-B.
-  (1992) Q-learning. Machine learning 8 (3-4), pp. 279–292. Cited by: §I, §IV-A.