Bitcoin has gained tremendous concerns as the first fully decentralized cryptocurrency since its advent in 2008. All historical transactions between Bitcoin clients are recorded in a global and public data structure known as the blockchain. The security of the blockchain is established by a chain of cryptographic Hash puzzles, addressed by a large-scale network of pseudonymous participants called miners . Solving a Hash puzzle is deemed as a way to generate Proof-of-Work (PoW) of reaching global consensus. The PoW of Bitcoin demands intensive computations, thus consuming a lot of energy. Each miner competes for this “game”, and is rewarded by cryptocurrencies (i.e. bitcoins) if he is the first acknowledged miner to find a valid block. When the population of miners is large, the aggregate Hash power is sufficiently high such that a malicious miner can hardly accumulate enough Hash power to perform Sybil attacks. The PoW consensus of Bitcoin has been employed in almost % of public blockchains, serving as the cornerstone of current cryptocurrencies.
The security of PoW is challenged by the trend of centralization of Hash power. Mining a Bitcoin block is random and it needs more than 10 years on average with a latest-generation ASIC chip. Therefore, blockchain miners operate strategically to form pools that have a much larger chance of solving puzzles in each round. By splitting the mining reward appropriately, they acquire a stable income rate. As a side effect, a small number of mining pools occupy a vast majority of global Hash power, placing blockchain systems at the risk of being overthrown by a gigantic pool or colluding pools. The conventional wisdom believes that PoW is secure as long as no miner controls 51% of total Hash power. However, a miner can choose a selfish mining scheme instead of conforming to the standard Bitcoin protocol.
Eyal and Sirer pointed out that the selfish mining is profitable (i.e. more rewards than the honest mining) if the Hash power of a miner is larger than 25% . Aiming at the shortcomings of selfish mining, Nayak et al.  proposed the stubborn, it points out that selfish mining is not always the best for different parameters. The stubborn attacks improve about 13.94% revenues compared with the selfish mining attacks. A more intelligent selfish mining strategy has been proposed in  based on the Markov Decision Process and it decreases the profitable threshold to 23.21%. Tao et al.  described semi-selfish mining attack on the basis of selfish mining based on hidden Markov decision process, which not only ensures the benefit of the attack, but also reduces the forking rate. Note that the both studies assume the existence of a single selfish miner while multiple (colluding) pools might be close to this profitable threshold.
At the same time, some papers about the multiple attackers were proposed. Ruan et al.  proposed the publish-n strategy, simulated the situation with two selfish mining attackers in the system. They obtained the profitable threshold decreases. Charlie et al. 
proposed SquirRL framework for multiple attackers. This framework provides the optimal mining strategies based on the reinforcement learning. Zhang et al. simulated the situation when there were multiple selfish mining attackers in the system. They obtained the profitable threshold for different number of attackers. However, the literature puts the emphasis on either the modeling on the single selfish mining attacker or the simulating on the multiple selfish mining attackers, while little attention is paid to modeling on the multiple attackers analysis.
semi is the contemporary job, so I think it is more appropriate to join squirRL.
In this paper, we study a fundamental question regarding the blockchain security: How does the existence of two selfish mining attackers affect the system security? Will selfish mining become more easily profitable with the number of attackers increases? Is there a more profitable attack decision algorithm in the case of multiple attackers? How long should a selfish mining attacker wait until being profitable? The first subquestion aims to unravel whether each selfish miners needs a smaller threshold of Hashrate to gain more rewards than mining honestly. The second subquestion is intended to reveal whether the benefit threshold decreases as the number of attacker increases. The third subquestion is to design an algorithm that can obtain the optimal mining strategy when the attacker can not obtain the global complete information. The last pays attention to the transient behavior in the process of selfish mining that takes into account the mining difficulty adjustment. The transient analysis is also crucial for a selfish miner is inclined to waiting for a long period to gain more rewards, especially when the global Hashrate increases rapidly.
We establish the selfish mining model for an honest pool that represents all honest miners, and different number of selfish mining pools who are not aware of each other’s misbehaving role. By dissecting all the possible events that trigger the change of private and public chains, we formulate a set of Markov chains to capture all the state transitions. In contrast to the former literature, our work present the mathematical model for each case and yield close-form expression of such a revenue. According to the characteristics of multiple attackers selfish mining, we designed a more profitable attack algorithm. In the transient analysis, the selfish mining is found of wasting computing power and thus is definitely unworthy without the subsequent difficulty adjustment of puzzle-solving.
The major contributions and observations are summarized as below.
We establish a set of Markov chain models to characterize the state transition of public and private chains in selfish mining and compute the steady state distributions.
The minimum threshold of Hashrate is symmetric around 21.48% if two selfish miners are both profitable. While the profitable selfish mining becomes more difficult when one of the selfish miner increases his Hashrate, arousing a more furious competition.
We model the case where there are attackers in the system, and calculate that the profitable threshold decreases with the number of attackers increases. Each attacker with 10% Hashrate can obtain extra benefit when there are eight attackers in the system.
The selfish mining is profitable after 51 rounds of difficulty adjustment (i.e. 714 days in Bitcoin) if the Hashrates of selfish miners are both 22% (slightly higher than the profitably threshold). This delay decreases to 5 rounds (i.e. 70 days in Bitcoin) as their Hashrates accrues to 33%, which is still very long.
The revenue of selfish mining attacker increases gradually and tends to converge with the increases of the difficulty adjustment algorithm period.
We design a more profitable attack algorithm based on the Partially Observed Markov Decision Process. In view of the particularity of selfish mining revenue, we design a more efficient algorithm, which can improve the solution efficiency by more than 10%.
Ii System Model
In this section, we present the block-release procedure of blockchain mining in the presence of two adversarial pools. We further introduce the new features on tie-breaking and chain-reaction release.
Ii-a System Description
Consider a blockchain system with two misbehaving mining pools Alice and Bob, as well as an honest mining pool, Henry111Multiple honest miners can be boiled down to a single miner for the sake of their linear additivity of Hashrates.
. They compete to solve cryptographic puzzles to mine a valid block for the purpose of acquiring bitcoin-like tokens. The proof-of-work (PoW) consensus is adopted and the mining of blocks is stateless: the probability of discovering a block by a miner is proportional to his current Hashrate, but inversely proportional to the current aggregate Hashrate of the entire blockchain network. The blockchain system dynamically adjusts the difficulty of cryptographic puzzles such that new blocks are generated at a fixed average rate (e.g. one block per 10 minutes on average in Bitcoin). The miners maintain a globally-agreed ordered set of transactions via the adoption and the mining on the longest chain. The revenue of a miner is the fraction of blocks mined by him out of all the blocks in the longest chain.
The total Hashrate of the blockchain system is normalized as a unit. Then, the Hashrate of a mining pool is represented as a fraction of the total.
The block discovery time by a mining pool is exponentially distributed when his Hashrate is large.
The reward of each valid block is normalized as one cryptographic coin.
The honest pools who find a valid block will release it immediately. Alice (resp. Bob) may release her blocks strategically by forcing Henry into wasting his computations. When Alice and Bob are both selfish miners, the interaction between two private chains becomes more complicated because none of them know other’s behaviour. In what follows, we capture all the different states that each miner may encounter.
Denote by , and the Hashrates of Alice, Bob and Henry respectively, i.e. . Denote by (resp. ) the probability that honest miners mine after Alice’s (resp. Bob’s) released chain in the tie-breaking between Alice (resp. Bob) and Hence. Denote by and the probabilities that honest moners choose to mine after Alice’s and Bob’s chains in the three-party tie-breaking, respectively. When the blockchain system creates a new block, it is mined by pool with the probability , , owing to the memorylessness of exponentially distributed mining intervals.
Ii-B Selfish Mining Mode
Alice maintains a private chain, so does Bob, while Henry operates on the public chain. Alice and Bob are not aware of each other’s role. We suppose that all the miners work on the same public chain in the beginning where the starting point is expressed as “0”. The length of the private chain is kept as a private information by Alice and Bob, and the length of the public chain is observed by all of them. We consider the selfish mining method proposed by , and our analytical approach can be generalized to a variety of other methods.
The mining procedure consists of two cases as follows.
(Public-chain mining case) Henry always mines after the public chain. Alice or Bob also mines on the public chain if it is longer than his private chain.
(Private-chain mining case) Alice (resp. Bob) continues to mine on her (resp. his) private chain if she (resp. he) discovers a new block and the private chain is now longer than the public chain.
The release procedure is more complicated than the mining procedure. Henry broadcasts his mined block as soon as it is discovered, while Alice and Bob will decide whether to release their mined blocks depending on the length of the public chain.
(Forfeit case) Alice (resp. Bob) abandons her (resp. his) private chain and conforms to mining after the public chain if the latter is longer. Henry also abandons his public chain if Alice or Bob publishes a longer chain.
(Risk-avoiding release case) Alice (resp. Bob) releases her (resp. his) privately mined blocks to the public because of the fear of loss if the new block is mined by the others and the leading advantage of her private chain is no more than two blocks.
(Chain reaction case) When Alice (resp. Bob) releases her (resp. his) blocks to the public chain and updates its length, the release of Bob’s (resp. Alice’s) private blocks is triggered immediately.
The chain reaction case is the combination of the forfeit and the risk-avoiding cases, whereas the existence of chain reaction complicates evolution of the public chain. Suppose that Alice publishes her private blocks to obsolete the current public chain. After the construction of new public chain, Bob may release his private chain to forfeit it immediately.
Ii-C Release procedure and tie-breaking Logics
The consensus on the public chain requires that it is the longest. A crucial question is how the public chain evolves when it is of the same length as Alice or Bob. In general, each miner works on his own chain, and the release behavior of Alice and Bob is triggered when Henry mines a new block. We hereby illustrate the evolution of private and public chains where , , and denote that the blocks belong to Alice, Bob and Henry respectively. The blocks of private chains are in grey and those of public chains are in white.
Risk-avoiding release case. We show the risk-avoiding release of Alice’s private chain in Figure 1. Alice is only one block ahead of Henry after the latter mines a new block for the public chain. Because Alice fears of losing the competition, she publishes her private blocks, obseleting Henry’s public chain, so that both Alice and Henry mine on the new longest chain afterwards.
Tie-breaking resolvings. If Alice’s private chain is only one block ahead of Henry’s, Henry may catch up with her. When it happens, Alice publishes her private blocks immediately to compete with Henry. Thus, two public chains of the same length exist in Figure 2. Since only one public chains prevails, a tie-breaking rule needs to be taken into account. The first case is that the public chains of Alice and Henry have the same length, and Bob’s private chain is either 0 or very long. Hence, we only need to resolve the tie between Alice and Henry. All the miners are possible to mine after block , while Bob and Henry may mine after . There are five possibilities of extending the longest public chain, and the shorter one will be obsoleted. We omit the tie-breaking between Bob and Henry because this can be analyzed in the same way.
For the situation that each of Alice and Bob hides one private block, they will publish their private chains instantly after Henry finds a new block. As shown in Figure 3, there exists three competing public chains. Alice will mine after and Bob will mine after for sure; Henry is not aware of which chain is maliciously forked so that he may mine on each public chains. There are also five possible situations. The risk-avoiding release, together with two tie-breaking solutions, constitutes all the dynamics of private and public chains.
Chain reaction release. We next introduce the chain reaction release that complicates the evolution of the private and public chains. Note that the chain reaction release consists of a sequence of risk-avoiding releases and tie-breaking resolvings. Figure 4 illustrates an example on how the chain reaction phenomenon is triggered. At stage 1, Alice’s private chain contains four blocks while the lengths of Bob’s private chain and Henry’s public chain are 0. After a tie-breaking resolving at stage 2, the longer public chain contains two blocks and , and the shorter is orphaned. Bob construct a new private chain starting from to , while Henry continues to mine one block after at stage 4. From Alice’s perspective, her private chain is merely one block ahead of the public chain. She releases her private blocks in order to avoid the risk of losing the race with Henry. The new public chain now starts from block . Next, stage 5 and 6 constitute a new round of tie-breaking resolving between Alice and Henry, extending the public chain to block . However, the release of triggers Bob to release all of his private blocks starting from to . When retrospecting all the mining stages, we observe that the winning branch switches back and forth, making the analysis of selfish mining extremely complicated. It is noted that the chain reaction case occurs only when the length of private chain is greater than 3.
Iii Steady-state analysis
In this section, we formulate an Markov chain model to characterize the block-publishing dynamics with multiple selfish miners. The expected revenues of selfish miners are derived in explicit form.
In this section, we model the system with multiple selfish mining attackers based on the Markov chain. We describe the block-publishing dynamics for different number of miners with different length of longest private chain. This allows us to solve for the close form solution of the revenue in steady state.
Iii-a Steady-state Analysis for two attackers
We hereby formulate a finite state machine to characterize the evolution of private and public chains. Figure 5 illustrates the state machine when the maximum length of private chain is two (i.e. ). We define the state as a three-tuple consisting of the lengths of Alice, Bob and Henry. The arrows indicate the corresponding state transitions and the associated values represent the transition probabilities. For instance, all the transitions to mean that the forked chains boil down to the unanimous public chain and a new round of selfish mining starts. If the maximum length of private chains is less than 2, they can still hide their private chain, otherwise, they must publish all of the private blocks. Denote by the steady state distribution of . Denote by (resp. , ) the revenue of Alice (resp. Bob, Henry) during the selfish mining attack from the state probabilities and transition frequencies. Using the state transition equations, we obtain as follows .
When all attackers have same Hashrates and ability to broadcast, i.e., , the revenue can be represented as:
Iii-B Model scaling to attackers
We extend the number of attackers to any value and formulate the corresponding state machine to characterize the properties. Figure 6 shows the state machine if there are attackers when the maximum length of private chain is two. We define the state as an tuple . Each state consists of the lengths of each attacker’s private chain when there are attackers. If the maximum length of the private chain is less than 2, they will always hide the valid block unless the honest miner mine one valid block. If the length of private chain is 2, the miner will release all the private blocks.
In Figure 6, the first part indicates that only one attacker owns one private block and the second part indicates that there are two attackers whom each have one private chain. The double circle indicates a return to the initial state with a certain probability. The number of states in part one is and that in part two is , i.e., of and of . Therefore, the total number of states in Figure 6 is:
Let be the initial state and be the steady state probability of . We denote
as the unit vector with 1 in thecoordinate. is the collection of attackers. The hashrate of attacker is and the hashrate of honest miner is . The transition of each state is described as:
where represents the subset of with size . Define as the number of valid block that attacker obtained during the selfish mining attack. The cases that attacker can benefit are shown in Figure 7Figure 9 and the case that honest miner can benefit is shown in Figure 10. The dashed lines in these figures represent the blocks is not released and the solid lines represent the block is public. We detail the revenue for attacker and honest miner on each event below.
(a) Was any branches of length 1, pool finds a block. The pool publishes its secret of length two, thus obtaining a revenue of two.
(b)Was any branches of length 1, the honest miner finds a block. All pools publish their secret blocks. If attacker finds a block, he obtain a revenue of two. If the others including the honest miner and other attacker with no blocks (i.e., in Figure 8) find a block after attacker head, attacker obtain a revenue of one.
(c)Was any branches of length 1 but attacker has nothing, the honest miners find a block. All pools publish their secret blocks. The attacker will choose one branch to mine and obtains a revenue of one if he finds a block.
(d)Was any branches of length 1, the honest miner find a block. If the honest miner finds a block after the honest miner’s head, he can obtain a revenue of two. If the honest miner finds a block after others’ head, he can obtain a revenue of one. If the other miners including all attackers with no secret blocks find a block after honest miner’s head, the honest miner can obtain a revenue of one.
(e) Was no branches, the honest miner find a block. The honest miner can obtain a revenue of one and all attackers will adopt this block.
We assume if there are some forks, the probability of each fork being selected is same. Then the revenue for each attacker is given by:
Iii-C Model scaling to
When is large (e.g., three or four), there will be “chain reaction” so the finite state machine is very complicated. The states should include the length of each private chain and how many blocks each private chain has been published. At the same time, Bob (resp. Alice) will not mine on the private chain of Alice (resp. Bob) even though part of its private chain has been published. However, Henry will only select the longest public chain to mine. Therefore, the states should also include the number of blocks mined by Henry in each public fork.
Figure 11 describes the state machine when there are two attackers if . The state includes eight parameters, where (resp. ) represents the number of unreleased blocks held by Alice (resp. Bob); represents the number of blocks mined by Henry that has been adopted neither by Alice nor by Bob; represent in Alice’s, Bob’s and Henry’s view, the length of the public chain respectively in this attack round; (resp. ) represents the number of blocks that Henry has mined in Alice’s (resp. Bob’s) fork. According to the transitive probability, the revenue for each miner can be represent as:
The cases with can be analyzed in the same way. In the following experiments, we have proved that the revenue tends to converge when . If , repeated chain-reaction cases will occur, which aggravates the instability of the system and increases the probability that attackers will be detected.
Iv Transient State Analysis
In this section, we first describe the process of difficulty adjustment algorithm (DAA) in Bitcoin system, and then analyze why the difficulty adjustment mechanism is the reason that the selfish mining can obtain extra revenue. After that, we study how the difficulty adjustment periods affect the absolute revenue of the selfish mining.
Iv-a Bitcoin Difficulty Adjustment
We will introduce in detail the principle of the Bitcoin difficulty adjustment mechanism. The block structure is shown in Figure 12. The mining difficulty is determined by the difficulty target in the block header. We denote as the last 24 bits of difficulty level and as the first 8 bits of difficulty level. The target hash not the difficulty target but the target hash of block can be calculated as
The success of bitcoin mining means the hash value of the corresponding nonce is less than the of this block. Because of the memorylessness of the mining process, for each nonce calculated, the probability of successful mining is . Therefore, the expectation of calculation number to mine a valid block is .
The difficulty of any block can be computed by the following formula
where is the hex target of the genesis block.
In the current Bitcoin system, the difficulty is adjusted according to the actual generation time and the expected generation time (10 minutes in Bitcoin system) for each block every 2016 blocks. If the actual total time for finding 2016 blocks becomes times of 20160 minutes, the difficulty in the next period will be adjusted times of current difficulty. In the bitcoin protocol, to prevent violent fluctuation of the difficulty of mining, the difficulty of the next period can not be adjusted to more than 4 times or less than times of the current difficulty in each adjustment.
Iv-B Transient absolute revenue
We begin with two definitions associated with the transient analysis.
(Relative Revenue) Define Alice’s, Bob’s and Henry’s relative revenue as the proportion of the valid block obtained to total valid block during the attack, which are
where , and are the reward obtained by Alice, Bob and Henry in attack round (from the initial state to initial state).
According to the steady state analysis, the relative revenue can also be represented as following:
If there are attackers, the relative revenue for each attacker can be written as:
(Absolute Revenue) The absolute revenue is revenue obtained per unit of time, that is:
where represents the time that spent during the selfish attack.
According to the data from , the Hashrate of the Bitcoin system grows exponentially. At the same time, the time takes for a valid block to appear on the public chain is away from 10 minutes because of selfish mining attack. Although previous papers have considered relative revenue , we can prove that the relative revenue is infinitely close to the absolute revenue but always greater than absolute revenue. We need to model the absolute revenue based on the above reasons. In Bitcoin system, we take 10 minutes as unit time and take as the difficulty adjustment period.
Through the state machine, in the adjustment interval (i.e., difficulty adjustment period), blocks will appear on the longest public chain and blocks are mined totally during one attack round (i.e., from stage (0,0,0) back to stage (0,0,0)). In addition, we let denote the total actual time spent in the adjustment interval and be the DAA period. Considering the change of Hashpower, we let denote the total Hashrate in system. Define the and as the theoretical time and the actual time that is spent mining one block during the adjustment interval respectively. Here, is given by:
The actual time spent during the first difficulty adjustment period can be written as follows:
After the first period, the system will adjust the difficulty to satisfy mining one block per ten minutes. We can obtain the new actual average generation time of blocks during the period.
We assume the difficulty adjustment period is large enough. That means can be expressed as:
Alice’s absolute revenue can be expressed as Eq. (37):
The average time for mining a block during the first difficulty adjustment period is not affected by the selfish mining attack. The time spent in the first adjustment interval will be more than the expected adjustment period because some blocks will be orphans due to the attack.
In the second difficulty adjustment period, the difficulty will decrease and the average time for mining a valid block will be less than 10 minutes. However, the difficulty adjustment period is measured by the number of blocks published on the longest chain. The actual number of valid blocks is more than the number of blocks appearing on the longest chain. The total time of the second difficulty adjustment period is still 20160 minutes. Therefore, in the first difficulty adjustment period, the absolute revenue of selfish mining is less than that of honest mining. In other difficulty adjustment periods, the absolute revenue of selfish mining is greater than that of honest mining.
Therefore, the difficulty adjustment mechanism is the fundamental reason that selfish mining can obtain extra revenue. Also, the relative revenue is infinitely close to the absolute revenue, but it is always greater than the absolute revenue.
Iv-C The influence of difficulty adjustment period
The occurrence of mining a valid block can be well approximated by a random variable following a Poisson process. That means the interval between two valid blocks is exponentially distributed with mean in the DAA period. If the difficulty adjustment period is too short, causing the actual time spent in one DAA to be much more or less than the expected time. How the difficulty adjustment period can influence the revenue of selfish mining is an interesting question.
Take the difficulty adjustment period of block as an example, the time of mining valid block can be obtained from Eq. (36). The inter-arrival time of blocks obeys is memoryless exponentially distributed so that we can write down the sum of random variables during adjustment interval in the form of Erland distribution.
where the expectation of is . We will simulate to explore the impact of the DAA period on the revenue of selfish mining in Section VI.
V Optimal strategy under multiple attackers
In the previous section, all the attackers implement basic selfish mining strategy. There have been some mining strategies that can obtain more extra revenue than basic selfish mining strategy  . In this section, we want to explore how to conduct the optimal mining strategy when there are other basic selfish mining attackers in Bitcoin system.
V-a The revenue upper bound
If the valid block is mined every unit time, all miners can obtain the full information about the system. In the real system, the block arrival process is Poisson process. Bob can not obtain the information about the number of Alice’s private chain based on the current blocks that are published, and vice versa. The revenue under the complete information is the upper bound for that in the actual Bitcoin system  . In this part, we propose the optimal mining strategy based on Markov Decision Process (MDP) when the block arrives steadily to compute the upper bound of actual revenue.
V-A1 Main components
Our model captures the selfish mining based on where corresponds to the state space, corresponds to the action space, corresponds to the transition matrix, and corresponds to the reward matrix .
State: The state space is defined by 10-tuples of the form , where is the location that Henry is mining on. If (resp. 2), Henry and Alice (resp. Bob) have the same understanding of the location where current forks start. If , Henry has different understandings about that with Alice and Bob.
The file obtains six values, dubbed , where represents that the attacker can release blocks to compete with the current public chain while represents attacker can not. If multiple chains are competing, the state can be divided into four categories according to the attribution of the chain, which means Alice and Bob in competition, Bob and Henry in competition, Alice and Henry in competition, and three miners all in competition.
Define and as the number of blocks that Alice and Bob do not publish. To maintain system stability, preventing the selfish mining attack from being detected, the attacker can hide a maximum of blocks. The length of public chains that Alice, Bob and Henry are mining on is defined as , and . A block is recognized as a valid block if six blocks have been mined after it in the current Bitcoin system. Therefore, we assume that attacker must give up his private chain and adopt the current longest public chain when the maximum length of the public chain is .
Henry will always choose the current longest public chain as the valid chain and change the mining location based on it. So, there may be blocks belonging to Henry on the chain where Alice and Bob locate. Let , and denote the number of blocks mined by Henry on the chains where Alice, Bob and Henry are currently located.
Action: The action indicates the number of blocks that the attacker publishes. The size of the action space depends on the current number of blocks hidden by Bob. In addition, attackers can also choose to give up their private chains and adopt the longest public chain, which is denoted as adopt. Take Bob as example, let denote the action space of him.
State Transition Function: The general expression of the state transition function is described as . The transition probability depends on who mined the next valid block.
Reward function: Our goal is to find the optimal mining strategy. Based on the comparison of relative revenue and absolute revenue, there are sufficient reasons for the approximately equivalence of the two types of revenues. Assume Bob takes the optimal mining strategy, we adopt the technique of Sapirshtein et al.’s  to transform the problem into a family of MDPs, which we describe below.
Assume the value of the objective function is and define the transformation function related to the Alice’s reward and the reward of others’ in the follows:
where , and represent the reward in step (from the last state to current state) for Alice, Bob and Henry. This results in the finite state MDP model for each that has the same parameters except that the reward matrix is transformed using . The mean revenue of such an MDP under policy is then defined by
The mean revenue under the optimal policy is:
The solution method is based on the following proposition:
If for some , , then any policy that can obtain this value can also maximize the relative revenue, and the relative revenue is equal to .
is monotonically decreasing in .
It is necessary to set to a sufficiently large number . The truncation parameter can be defined as:
where is a precision value. Therefore, it can be approximated by the average revenue over steps under policy as . The transition matrix and reward matrix are succinctly described in Table I.