Recently, there has been a surge in the popularity of cryptocurrencies, which are digital currencies that enable transactions through a decentralized consensus mechanism. Most cryptocurrencies are based on a blockchain, which is an ever-growing list of transactions that are grouped in blocks. Individual blocks in the chain are linked together using a cryptographic hash of the previous block, which ensures resistance against modifications, and every transaction is digitally signed. A blockchain needs to be protected from the double spending problem (i.e., an attacker spending the same digital money twice) and this is generally achieved by using a pow system. This system requires that new blocks provide proof that a certain amount of processing power went into constructing them before they get accepted in the chain. For cryptocurrencies, this is typically achieved by appending random numbers to a block until its cryptographic hash meets a certain condition. The chain with the most cumulative pow is accepted as the correct one, so that an attacker must control more than half of the active processing power to perform a double-spend attack. Processing nodes that help to compute the hashes of new blocks (called miners) are rewarded with a fraction of the cryptocurrency.
The first cryptocurrency, i.e., Bitcoin , was initially mined using desktop CPUs. Then, GPUs were used to significantly increase the hashing speed. Eventually, GPU mining was outpaced by FPGA miners, which were in turn surpassed by ASIC miners. Nowadays, the majority of the computing power on the Bitcoin network is found in large ASIC farms, each operated by a single entity, which makes the decentralized nature of Bitcoin debatable. To solve this issue, new pow algorithms have been proposed that aim to be ASIC-resistant. ASIC resistance is achieved by using hashing algorithms that are highly serial, memory intensive, and parameterizable so that a manufactured ASIC can be easily made obsolete by simply changing some of the parameters, meaning that GPU mining is much more cost-effective. However, since GPUs are generally much less energy-efficient than ASICs, a massive adoption of ASIC-resistant cryptocurrencies would significantly increase the (already very high) energy consumption of cryptocurrency mining. FPGA-based miners, on the other hand, are flexible, energy efficient, and readily available to the general public at reasonable prices. Thus, they are an attractive platform for ASIC-resistant cryptocurrencies.
A prime example of an ASIC-resistant hashing algorithm is Lyra2REv2 (used by Vertcoin , MonaCoin , and other cryptocurrencies), whose chained structure is shown in Fig. 1. The BLAKE, Keccak, Skein, BMW, and CubeHash hashing algorithms are well-known and have been studied heavily, both from a theoretical and from a hardware implementation perspective, as they were all candidates in the SHA-3 competition. On the other hand, to the best of our knowledge, no hardware implementation of the version of Lyra2 [4, 5] that is used in the Lyra2REv2 algorithm has been reported in the literature.
In this paper, we present the first hardware implementation of the version of Lyra2 used in Lyra2REv2 as a stepping stone towards the implementation of an energy-efficient FPGA miner for Lyra2REv2 cryptocurrencies. Post-layout results for two Xilinx FPGAs show that our proposed Lyra2 hardware architecture consumes very few FPGA resources to achieve a hashing throughput between 2.6 MHash/s and 3.7 MHash/s with an energy efficiency between 432 nJ/Hash and 323 nJ/Hash.
In this section, we provide the necessary background on some components of the Keccak and BLAKE2 hashing algorithms, since they are also used in Lyra2.
Ii-a The Keccak Duplex
Keccak is a family of hashing algorithms based on a cryptographic sponge [6, 7]. A cryptographic sponge is a function that takes an arbitrary-length input to produce an arbitrary-length hashed output. Lyra2 uses a specific implementation of the sponge, called the duplex construction, which has a state that is preserved across different inputs. The duplex construction with naming conventions as adopted in Lyra2 can be found in [8, Fig. 2]. It consists of a permutation function that operates on a
-bit state vector, whereand the parameters and are called the bitrate and the capacity
of the sponge, respectively, as well as a padding rulepad. We note that the permutation is iterative and performs a pre-defined number of iterations, also called rounds.
A call to the duplex construction proceeds as follows. An input string is first fed into the duplex. Then, it is padded to length and XOR’d into the lower bits of the state. The state is then fed through the permutation . The output of is the new state of the duplex, while its lower bits are the output hash, where . If we consider the duplex construction as an object , then the aforementioned procedure is referred to as a method . The following two auxiliary methods are useful to simplify the notation: updates the state using the input but discards the output (equivalent to ), while reads output bits and then calls , where denotes an empty input string.
Ii-B The BLAKE2b Round Function
BLAKE2  is a family of hash functions designed for fast software implementations. It is the successor of BLAKE as submitted to the SHA-3 competition . The Lyra2 algorithm heavily draws from the round function of BLAKE2b, the 64-bit variant of BLAKE2. The round function consists of an arrangement of blocks that apply a so-called G-function to a 16-word state, where one G-function operates on 4 different state words. For BLAKE2b a word has 64 bits meaning that 16 state words amount to 1024 bits. The total round transforms these 1024 bits using four G-blocks, rearranges the output, and then does a four G-block transformation again. Algorithm 1 describes the modified BLAKE2b G-function as used in Lyra2.
Iii The Simplified Lyra2 Algorithm of Lyra2REv2
Lyra2 uses the duplex construction from Keccak, where the permutation function is the round function from BLAKE2b. In the remainder of the text, calls to a full-round (i.e., 12 iterations) duplex will be denoted as calls to , while reduced-round duplexing as calls to , where denotes the reduced number of rounds. Because the G-functions are specified to operate on an array of 16 64-bit words, Lyra2 uses a duplex with a width of bits. Pseudocode for the simplified version of Lyra2 that is used specifically in Lyra2REv2 is given in Algorithm 2 and can be compared to the original Lyra2 pseudocode available in . In the following sections, we explain each phase of the simplified Lyra2 algorithm in more detail.
Iii-a Bootstrapping Phase
In the bootstrapping phase, the duplex is initialized with a state that depends on the password input , a salt (which in Lyra2REv2 is set to be equal to ), and the parameters , , and by using a full-round absorb. The duplex in Algorithm 2 internally uses a bitrate bits and a capacity bits. The call on line 5, however, considers only inputs of 512 bits instead of bits, so as to not overwrite the upper part of the initialization state, i.e, the 512-bit initialization value IV specified by BLAKE2b. This results in two full-round absorbs, where the first and second absorbs process and pad, respectively.
Iii-B Setup Phase
During the setup phase, an memory matrix is initialized using the single-round duplex . During setup, rows are initialized from first to last, while columns within a row are initialized from last to first. From the second row onward a previous row is re-read, making it impractical to only store parts of the memory matrix. Also, from the third row onward, in addition to the previous row, i.e., , a specific pre-initialized row, i.e., , is revisited (i.e., read and updated) in a deterministic manner. Rows are re-read or revisited from the first to the last column. Revisited rows use a rotated version of the duplex output, where the rotation number is chosen as in Lyra2REv2.
Iii-C Wandering Phase
The wandering phase is generally the most time-consuming phase and it proceeds similarly to the setup phase. Specifically, it revisits two rows and , where is chosen deterministically but is chosen in a pseudorandom fashion by using the least significant part of the duplex output. We note that the pseudorandom and deterministic row can collide, resulting in the operations on line 26 and 27 to sequentially read from and then write to the same matrix cell.
Iii-D Wrap-up Phase
The wrap-up phase consists of a full-round absorb of a specific cell of followed by a squeeze of the hashed output . This specific cell is likewise pseudorandom, as it is selected as the first cell of the lastly revisited pseudorandom row. The requested squeeze length is lower than the bitrate , which means that the final output is provided directly from the duplex state without a permutation .
Iv FPGA Implementation of Simplified Lyra2
In the current instance of Lyra2 as used in Lyra2REv2, the timecost parameter is , the number of rows in the memory matrix is , the number of columns in the memory matrix is , and the desired hashing output length is . We note that our architecture is optimized for these parameter values, but it can be modified relatively easily to accommodate potential parameter changes if a hard fork is decided. Moreover, for the aforementioned parameters, the memory matrix is 1.5 kB in size, which is clearly not prohibitively large to be implemented either on an FPGA or on an ASIC. The claimed ASIC-resistance of the Lyra2REv2 algorithm comes from the fact that , , and can be increased easily if necessary.
The datapath of our proposed FPGA implementation of the simplified Lyra2 algorithm used in Lyra2REv2 is shown in Fig. 2, where the duplex construction with its state, round, and XOR input block can be clearly distinguished. The memory matrix is mapped to a bram. To reduce the complexity of the mux at the input of the duplex, the bram also contains constant vectors of bits used during the bootstrapping and setup phases: an all-zero vector and the vector. We first describe a version of the hardware architecture where each round of the function is executed in a single cc. We then describe how this basic architecture can be improved through pipelining.
Iv-a Basic Iterative Architecture
Our basic iterative Lyra2 architecture requires 68 cc per hash: 24 for the bootstrapping phase, 16 for the setup and wandering phases, and 12 for the wrap-up phase.
Iv-A1 Bootstrapping Phase
During the bootstrapping phase, the duplex processes two 512-bit input blocks from using a full-round absorb. In Lyra2REv2, , with the output from CubeHash, the previous algorithm in the chain. Thus, as shown in Fig. 2, the vector is one of the inputs to the duplex’s mux. On the other hand, the vector is fed into the sponge by loading it on while simultaneously loading the all-zero vector on . Both constants are stored at known addresses in the bram, and are absorbed in a separate 12-round Bootstrap state. During bootstrapping, the duplex only receives an input vector in the first round. Hence, for subsequent rounds, and output the all-zero vector, and their sum is passed to the duplex via its input mux.
Iv-A2 Setup Phase
We split the setup phase into three distinct phases for convenience, namely Setup0, Setup1, and Setup2, which correspond to Lines 6–8, Lines 9–11, and Lines 12–20 of Algorithm 2, respectively. Similarly to the bootstrapping phase, the setup phase uses the all-zero vector stored in the bram. In the Setup0 state, the squeezes input an empty message into the duplex and directly write the duplex output to the bram. To achieve that, the all-zero vector is output on , , and . Setup1 reads the all-zero vector on , but a specific vector from the bram on . Setup2 reads two vectors from and . Both the duplex output and the rotated duplex output are XORed with two other vectors from the bram, requiring the two XOR blocks in parallel illustrated in Fig. 2. On the control path, counters keep track of the various rows () and their corresponding columns to generate read and write addresses for the RAM.
Iv-A3 Wandering Phase
The input to the duplex in the wandering phase is always the word-wise addition of two RAM cells. Both XOR blocks connected to the duplex output are used. As mentioned before, the pseudorandom and deterministic rows used in the wandering phase can collide. In hardware, this special case requires the output of one XOR block to input to the other, while the write port of the first XOR block needs to be disabled to prevent write collisions on the RAM.
Iv-A4 Wrap-Up Phase
Wrap-up inputs one RAM cell into the sponge and then processes it using a full-round absorb. For the following squeeze, the requested hashed-output length is lower than the bitrate , meaning that the duplex state at that point directly provides the output hash.
Iv-B Memory Matrix
In the wandering phase, up to two RAM cells need to be written and three RAM cells need to be read per cc. These operations cannot be spread over multiple cc without affecting the overall throughput of the design. Therefore, we use standard two-port bram along with multipumping and replication techniques  in order implement the required functionality. Replication provides extra read ports by physically replicating the bram while connecting the write ports to keep the two copies coherent. Multipumping operates the bram at double the clock frequency of the surrounding logic, which, together with replication, effectively provides four read ports and two write ports.
Iv-C Pipelined Architecture
Pipelining the BLAKE2b round function can greatly reduce the delay of the critical path, which extends from the RAM read ports to the RAM write ports in the basic iterative version described above. Eight pipeline stages in the round were found to optimally increase throughput/area. Each hash that is concurrently being processed by the core needs its own memory. However, extra RAM-based memory is readily available since the current Lyra2REv2 parameters result in a RAM depth much shallower than that of the FPGA bram. With adequate scheduling, concurrent hashes write to the same bram in distinct cc. While read ports and feed the duplex, and feed the XORs with duplex outputs. When pipelining the round function, and therefore need to be delayed by as many cc as there are pipeline stages. The extra read port that is unused in the basic architecture allows delaying the control path for rather than using a delayed version of , avoiding a long chain of 768-bit registers. Eight pipeline stages in the round increase the latency to 544 cc per hash. On the other hand, eight hashes are processed concurrently and the achievable clock frequency more than doubles, so the overall hashing throughput is improved significantly.
V Hardware Implementation Results
To the best of our knowledge, there is no FPGA implementation of Lyra2 in the literature. For this reason, we can unfortunately not provide comparative FPGA implementation results. Moreover, since in this work we only present a core for Lyra2 and not a full miner architecture, we can also not compare our implementation to existing FPGA miners for cryprocurrencies based on other pow algorithms.
presents post-fitting results for the Xilinx Virtex 7 485T FPGA featured on the Xilinx VC707 Evaluation Kit as well as for the Xilinx Zynq Ultrascale+ 7EV FPGA from the ZCU104 Evaluation Kit. The power-consumption estimation was obtained using Xilinx’s Vivado Power Estimator tool, where the timing constraints are those required for the operating frequencies of TableI, the switching activity is from the simulation of the Lyra2 core processing random input vectors, and the post-fitted design provided to the tool meets all timing constraints. Table I reports the estimated dynamic power for the Lyra2 core. The functionality of the Lyra2 core was verified against test vectors that were generated using CPUminer .
From Table I, looking at the number of slices or CLBs required for the Virtex and Zynq FPGAs, respectively, it can be seen that the proposed Lyra2 core amounts to less than 4% of the resources available. The amount of RAM occupied is the same for both FPGAs, however the usage share is greater for the Zynq as it has less RAM blocks than FPGAs from the Virtex series. Also, from Table I, it can be seen that the throughout is 2.58 MHash/s and 3.69 MHash/s for the Virtex and Zynq FPGAs, respectively. The estimated dynamic power consumption of the Lyra2 core is under 1.2 W for both FPGAs. As a result, the energy efficiency is estimated to be in the vicinity of 325 to 435 nJ/Hash.
|Area (slices or CLBs)||2 163 (2.85%)||1 153 (4.00%)|
|LUTs||6 047 (1.99%)||6 010 (2.61%)|
|Registers||8 296 (1.37%)||8 296 (1.80%)|
|RAM (kbits)||1 548 (4.27%)||1 548 (5.39%)|
|Dyn. Power (W)||1.12||1.19|
|Energy Eff. (nJ/Hash)||432||323|
In this paper, we presented the first hardware implementation of the Lyra2 hashing algorithm, tailored to Lyra2REv2, an ASIC-resistant chained hashing algorithm employed by a few cryptocurrencies. The key to achieve good throughput and energy efficiency is to efficiently map the memory matrix to FPGA RAM blocks and to pipeline the BLAKE2b round function. Based on post-fitting results for two Xilinx FPGAs, we believe that the proposed Lyra2 implementation is a promising core for the purpose of FPGA-based Lyra2REv2 mining.111Our VHDL code and relevant scripts are publicly available at
https://github.com/Michielvb/lyra2-hw. For example, we showed that, for a Zynq Ultrascale+ FPGA featured on an affordable evaluation kit, the achievable throughput is of 3.7 MHash/s and the energy efficiency of 323 nJ/Hash, for a resource usage of 4%.
The authors thank Jean-Franc¸ois Têtu for useful feedback. The authors also gratefully acknowledge the support of NVIDIA Corporation with the donation of a Titan Xp GPU.
-  S. Nakamoto, “Bitcoin: A peer-to-peer electronic cash system,” 2008.
-  “Vertcoin.” [Online]. Available: http://vertcoin.org
-  “MonaCoin.” [Online]. Available: https://monacoin.org
-  M. A. Simplício Jr, L. C. Almeida, E. R. Andrade, P. C. dos Santos, and P. S. Barreto, “Lyra2: Password hashing scheme with improved security against time-memory trade-offs,” Cryptology ePrint Archive, Report 2015/136, 2015. [Online]. Available: https://eprint.iacr.org/2015/136
-  E. R. Andrade, M. A. Simplicio, P. S. L. M. Barreto, and P. C. F. d. Santos, “Lyra2: Efficient password hashing with high security against time-memory trade-offs,” IEEE Trans. Comput., vol. 65, no. 10, pp. 3096–3108, Oct 2016.
-  G. Bertoni, J. Daemen, M. Peters, and G. V. Assche, “Cryptographic sponge functions,” Tech. Report v0.1, Jan. 2011.
-  NIST, “SHA-3 standard: Permutation-based hash and extendable output functions,” FIPS Publication 202, Aug. 2015.
-  M. A. Simplicio Jr, L. C. Almeida, E. R. Andrade, P. C. dos Santos, and P. S. Barreto, “The Lyra2 reference guide,” Tech. Report v2.3.2, 2014.
-  J.-P. Aumasson, S. Neves, Z. Wilcox-O’Hearn, and C. Winnerlein, “BLAKE2: simpler, smaller, fast as MD5,” in Int. Conf. on Applied Crypto. and Netw. Security (ACNS). Springer, 2013, pp. 119–135.
-  J.-P. Aumasson, L. Henzen, W. Meier, and R. C.-W. Phan, “SHA-3 proposal BLAKE,” Tech. Report v1.3, Dec. 2010.
-  C. E. LaForest and J. G. Steffan, “Efficient multi-ported memories for FPGAs,” in Ann. ACM/SIGDA Int. Symp. on FPGAs, 2010, pp. 41–50.
-  T. Pruvot, “CPUMiner-Multi,” GitHub repository, 2017. [Online]. Available: https://github.com/tpruvot/cpuminer-multi