Recently, there has been a surge in the popularity of cryptocurrencies, which are digital currencies that enable transactions through a decentralized consensus mechanism. Most cryptocurrencies are based on a blockchain, which is an ever-growing list of transactions that are grouped in blocks. Individual blocks in the chain are linked together using a cryptographic hash of the previous block, which ensures resistance against modifications, and every transaction is digitally signed, typically by using public-key cryptography. Various mechanisms are used in order to deter denial-of-service attacks and, in particular, double-spending attacks where the same digital coin is used in multiple concurrent transactions. Many popular cryptocurrencies, incuding Bitcoin , use a pow mechanism, which was first proposed in  to combat the problem of junk mail. The pos and pob mechanisms are other notable proposals.
The pow system requires that new blocks provide proof that a function that requires a significant amount of a limited resource was used to construct them before they get accepted into the chain. For example, the employed function can be limited by the available processing power, the available memory, or the network bandwidth and latency. Cryptocurrencies typically use functions that are limited by the available processing power, the most common approach being that random numbers are appended to a block until its cryptographic hash meets a certain condition (e.g., some of its most-significant bits are equal to ). The chain with the most cumulative pow is accepted as the correct one, so that an attacker must control more than half of the active processing power on the network to perform a double-spend attack. This is unlikely to happen in practice if the processing power is large enough and is owned by non-colluding entities. Processing nodes that help to compute the hashes of new blocks are called miners, and are rewarded with a fraction of the cryptocurrency when a new block is accepted into the blockchain.
The first cryptocurrency, i.e., Bitcoin , was initially mined using desktop CPUs. Then, GPUs were used to significantly increase the hashing speed. Eventually, GPU mining was outpaced by FPGA miners, which were in turn surpassed by ASIC miners. Nowadays, the majority of the computing power on the Bitcoin network is found in large ASIC farms, each operated by a single entity, which makes the decentralized nature of Bitcoin debatable. To solve this issue, new pow algorithms have been proposed that aim to be ASIC-resistant. ASIC resistance is achieved by using hashing algorithms that are highly serial, memory-intensive, and parameterizable so that a manufactured ASIC can easily be made obsolete by changing some of the parameters. Since the cost of manufacturing new ASICs whenever some parameters change is prohibitive, GPU mining of ASIC-resistant cryptocurrencies is generally much more low-risk and cost-effective. A prime example of an ASIC-resistant hashing algorithm is Lyra2REv2 (and its recently introduced Lyra2REv3 modification), which is used by MonaCoin , Verge , Vertcoin , and some smaller cryptocurrencies. The chained structures of Lyra2REv2 and Lyra2REv3 are shown in Fig. 1 and Fig. 2, respectively. The BLAKE , Keccak , Skein , bmw , and CubeHash  hashing algorithms are well-known and have been studied heavily, both from theoretical and hardware-implementation perspectives (e.g., [12, 13, 14, 15]), as they were all candidates in the SHA-3 competition. On the other hand, to the best of our knowledge, apart from our own previous work , no hardware implementation of the simplified Lyra2 and Lyra2MOD versions of Lyra2 [16, 17] as used in the Lyra2REv2 and Lyra2REv3 algorithms, respectively, have been reported in the literature.
One potential issue with ASIC-resistant cryptocurrencies is that GPUs are generally much less energy efficient than ASICs, meaning that a massive adoption of ASIC-resistant cryptocurrencies would significantly increase the (already very high) energy consumption of cryptocurrency mining. FPGA-based miners, on the other hand, are flexible, energy efficient, and readily available to the general public at reasonable prices. Thus, provided that public and user-friendly FPGA-based miners become available, we believe that FPGAs are in fact an attractive platform for ASIC-resistant cryptocurrencies that should not be shunned by the community.
In this work, we present the first FPGA implementation of the simplified Lyra2 hashing algorithm as used in Lyra2REv2. Moreover, we describe an FPGA-based hardware implementation of a complete Lyra2REv2 hashing chain on a Xilinx mpsoc. While we do not provide explicit implementation results for Lyra2MOD or for a Lyra2REv3 chain, currently only used by the (somewhat less popular) Vertcoin cryptocurrency, we explain in detail how our architecture can be modified correspondingly. We present post-layout results for a Xilinx mpsoc for the complete Lyra2REv2 mining chain as well as for the individual hashing cores. These results show that our proposed Lyra2REv2 hardware architecture can achieve a hashing throughput of 11.76 MHash/s with an energy-efficiency approximately 2 to 4 times better than existing solutions at 804 nJ/Hash, while requiring less than 70% of the pl resources of the mpsoc.
The remainder of this paper is organized as follows. In Section II, we provide the necessary background for the pow concept and for the Lyra2 algorithm. In Section III, we give an in-depth explanation of the simplifications that Lyra2REv2 and Lyra2REv3 make to the generic Lyra2 algorithm. The hardware implementation of the simplified Lyra2 and Lyra2MOD algorithms is described at length in Section IV. In Section V, we describe an mpsoc-based hardware architecture that implements the full Lyra2REv2 hashing chain and can be easily modified to also implement the Lyra2REv3 hashing chain. We provide implementation and comparison results for the full Lyra2REv2 hashing chain in Section VI. Section VI also includes results for the individual hashing cores, notably including our simplified Lyra2 core. Finally, Section VII concludes this paper.
In this section, we provide the necessary background on the pow concept, as well as some components of the Keccak and BLAKE2 hashing algorithms which are used in Lyra2.
Ii-a Proof of Work
In order to explain the pow concept in more detail, we use Bitcoin as an example , but we note that many other Bitcoin-derived cryptocurrencies, such as MonaCoin and Vertcoin, use the same header structure. Each block in the Bitcoin blockchain has an -byte (or -bit) header that contains information about the block, as shown in Table I. The version field dictates which version of the block validation rules needs to be followed. The previous block header hash and merkle root hash contain hashes of the headers of previous blocks to ensure that no previous transaction in the blockchain can be modified without also modifying the header of the current block. The time
field contains the Unix epoch time at which each miner started performing the pow, which must be strictly greater than the median time of the previousblocks. The nBits and nonce fields are the most relevant to the pow. Specifically, nBits defines a -bit numerical value using an encoding explained in detail in , while nonce can be chosen freely by the miner. The pow that each miner performs amounts to finding a value for nonce so that a (chained) hash function of the header has a numerical value that is strictly smaller than the target threshold, i.e., the value defined by nBits. Since hash functions are generally not invertible, this can only be achieved by testing a very large number of nonce values until the target threshold is satisfied.
|previous block header hash|
|merkle root hash|
Ii-B The Keccak Duplex
Keccak is a family of hashing algorithms based on a cryptographic sponge [19, 20]. A cryptographic sponge is a function that takes an arbitrary-length input to produce an arbitrary-length hashed output. Lyra2 uses a specific implementation of the sponge, called the duplex construction, which has a state that is preserved across different inputs. The duplex construction with naming conventions as adopted in Lyra2 can be found in [21, Fig. 2]. It consists of a permutation function that operates on a
-bit state vector, whereand the parameters and are called the bitrate and the capacity
of the sponge, respectively, as well as a padding rulepad. We note that the permutation is iterative and performs a pre-defined number of iterations, also called rounds.
A call to the duplex construction proceeds as follows. An input string is first fed into the duplex. Then, it is padded to length and XOR’d into the lower bits of the state. The state is then fed through the permutation . The output of is the new state of the duplex, while its lower bits are the output hash, where . If we consider the duplex construction as an object , then the aforementioned procedure is referred to as a method . The following two auxiliary methods are useful to simplify the notation: updates the state using the input but discards the output (equivalent to ), while reads output bits and then calls , where denotes an empty input string.
Ii-C The BLAKE2b Round Function
BLAKE2  is a family of hash functions designed for fast software implementations. It is the successor of BLAKE as submitted to the SHA-3 competition . The Lyra2 algorithm heavily draws from the round function of BLAKE2b, the 64-bit variant of BLAKE2. The round function consists of an arrangement of blocks that apply a so-called G-function to a 16-word state, where one G-function operates on 4 different state words. For BLAKE2b a word has 64 bits meaning that 16 state words amount to 1024 bits. The total round transforms these 1024 bits using four G-blocks, rearranges the output, and then does a four G-block transformation again. Algorithm 1 describes the modified BLAKE2b G-function as used in Lyra2, where denotes a -bit right rotation of .
Iii The Simplified Lyra2 Algorithms Used in Lyra2REv2 and Lyra2REv3
Lyra2 was initially created as a phs for secure storage [16, 17]. Lyra2 uses the duplex construction from Keccak, where the permutation function is the round function from BLAKE2b. The reasoning for this choice is twofold and stems from the concept of favoring CPUs. On one hand, the G-function of BLAKE2b is software-oriented (e.g., the rotations are chosen to specifically benefit from SIMD instructions). On the other hand, the permutation of BLAKE2b has been shown to be secure even with a reduced number of rounds , whereas a full permutation normally consists of 12 rounds. As explained in more detail in the sequel, after every permutation, the Lyra2 algorithm performs a memory access. A reduced number of rounds in a permutation allows more memory accesses for the same execution time, making low-memory attacks on parallel platforms more costly.
In the remainder of the text, calls to a full-round (i.e., iterations) duplex are denoted as calls to , while reduced-round duplexing as calls to , where denotes the reduced number of rounds. Because the G-functions are specified to operate on an array of -bit words, Lyra2 uses a duplex with a width of bits. Pseudocode for the simplified version of Lyra2 that is used specifically in Lyra2REv2 is given in Algorithm 2 and can be compared to the original Lyra2 pseudocode available in [16, Algorithm 2]. In the following sections, we first explain each phase of the simplified Lyra2 algorithm used in Lyra2REv2 and how it differs from the reference implementation of Lyra2 in detail. Then, we explain the differences between Lyra2 used Lyra2REv2 and Lyra2MOD used in the updated Lyra2REv3 algorithm.
Iii-a Bootstrapping Phase
In the bootstrapping phase, the duplex is initialized with a state that depends on the input , a salt (which in Lyra2REv2 is set to be equal to for simplicity), and the parameters , , and by using a full-round absorb. The duplex in Algorithm 2 internally uses a bitrate bits and a capacity bits. The call on line 5, however, considers only inputs of 512 bits instead of bits, so as to not overwrite the upper part of the initialization state, i.e, the 512-bit initialization value IV specified by BLAKE2b. This results in two full-round absorbs, where the first and second absorbs process and pad, respectively.
Iii-B Setup Phase
During the setup phase of Lyra2, an memory matrix is initialized using the single-round duplex . In the simplified version of Lyra2 used in Lyra2REv2, we have . Rows are initialized from first to last, while columns within each row are initialized from last to first. From the second row onward, a previous row is re-read, making it impractical to only store parts of the memory matrix. Also, from the third row onward, in addition to the previous row, i.e., , a specific pre-initialized row, i.e., , is revisited (i.e., read and updated) in a deterministic manner. Rows are re-read or revisited from the first to the last column. Revisited rows use a rotated version of the duplex output, where the rotation number is chosen as in Lyra2REv2. We note that the general revisiting scheme for is significantly more complicated when , as rows to be revisited can be chosen from within a specific window.
Iii-C Wandering Phase
The wandering phase is configurable to be the most time-consuming of the four phases. This is done through a timecost parameter , that sets a number of rows to be revisited. In Lyra2REv2, there is only a single iteration over the memory matrix, as . Specifically, it revisits two rows and , where is chosen deterministically but is chosen in a pseudorandom fashion by using the least significant part of the duplex output. We note that the pseudorandom and deterministic row can collide, resulting in the operations on lines 26 and 27 to sequentially read from and then write to the same matrix cell. We also note that the reference implementation of Lyra2 selects not only , but also pseudorandomly. Furthermore, whereas the simplified Lyra2 in Lyra2REv2 uses a deterministic column counter , the reference implementation features pseudorandom counters and . Lastly, similar to as the previous , is introduced to track the previous . These extra variables appear, for example, on line 25, where the simplified Lyra2 has a two-operand wordwise addition, but the reference implementation would pass as input to the sponge.
Iii-D Wrap-up Phase
The wrap-up phase consists of a full-round absorb of a specific cell of followed by a squeeze of the hashed output . This specific cell is likewise pseudorandom, as it is selected as the first cell of the lastly revisited pseudorandom row. The requested squeeze length is lower than the bitrate , which means that the final output is provided directly from the duplex state without a permutation .
Iii-E From Lyra2REv2 to Lyra2REv3
Recently, the developers of Lyra2REv2 proposed Lyra2REv3 with the goal to make ASIC miners for Lyra2REv2, that became available on the market, obsolete. Vertcoin is currently the only Lyra2REv2-based cryptocurrency that has performed a hard fork to force the miners to use Lyra2REv3 . Fig. 2 illustrates the new chained hashing algorithm. Compared to the Lyra2REv2 chain in Fig. 1, it can be seen that the Keccak-256 and Skein-256 hashing algorithms were removed from the chain, and a second instance of a Lyra2-based hashing algorithm was added. The developers justified the removal of both Keccak-256 and Skein-256 by mentioning the existence of significantly more efficient hardware implementations of these algorithms compared to their software counterparts. In addition to these changes, the simplified Lyra2 algorithm itself has been modified.
The updated Lyra2 algorithm as used in Lyra2REv3, called Lyra2MOD, is illustrated in Algorithm 3, where the changes from the simplified Lyra2 algorithm used in Lyra2REv2 are highlighted in blue (lines 4, and 24–26). While the changes appear to be minor, the Lyra2MOD modifications are non-conventional in the Lyra2 scheme. Lyra2MOD introduces a new variable called , that can take the value of the four least-significant bits of any word in the -bit sponge state. This assignment is non-conventional, because it does not exclude the four words that make up the sponge capacity . Within its specifications, the sponge construction does not allow for such an operation that directly reads bits from the capacity part of the sponge . The variable is then used to update , which can now similarly be assigned some least significant part of any state word. The assignments to and require defining a new operation on the sponge that requests the current state without performing any rounds. We call this new operation for its similarity with the operation, with the difference that the former is not restricted to requests of bits on the state. To omit the round functionality of the sponge, we call on , i.e., the sponge reduced to zero rounds. The intended effect of these changes is to further serialize the algorithm, making hardware implementation more challenging. We briefly discuss the impact of these changes on resource requirements and on performance in Section IV-D.
Iv Programmable Logic Implementation of Simplified Lyra2
In this section, we discuss how the Lyra2 algorithm, which is the most complex algorithm of the Lyra2REv2 chain, can be efficiently mapped to a hardware implementation. The hardware implementation of the full Lyra2REv2 hashing chain is discussed in Section V, as well as the changes that would be required for a Lyra2REv3 chain. Similarly to the previous section, we first describe an implementation of Lyra2 for Lyra2REv2, and we then explain the necessary changes to implement Lyra2MOD for Lyra2REv3.
Recall that, in the current instance of Lyra2 as used in Lyra2REv2, the timecost parameter is , the number of rows in the memory matrix is , the number of columns in the memory matrix is , and the desired hashing output length is (we note that the same parameter values are also used for Lyra2MOD in Lyra2REv3). Our architecture is optimized for these parameter values, but can be modified relatively easily to accommodate potential changes in the aforementioned parameters. Moreover, for and , the memory matrix is 1.5 kB in size, which is clearly not prohibitively large to be implemented either in pl or on an ASIC. The claimed ASIC-resistance of the Lyra2REv2 algorithm comes from the fact that , , and can be increased easily if necessary and that the chain of hashing algorithms itself can be modified (as is the case with the newer Lyra2REv3 algorithm).
The high-level datapath of our proposed pl implementation of the simplified Lyra2 algorithm used in Lyra2REv2 is shown in Fig. 3, where the duplex construction with its state, round, and XOR input block can be clearly distinguished. The memory matrix is mapped to a bram. To reduce the complexity of the mux at the input of the duplex, the bram also contains constant vectors of bits used during the bootstrapping and setup phases, i.e., an all-zero vector and the vector.
As mentioned in Sections II-C and III, the round function of the Lyra2 algorithm is an arrangement of BLAKE G-functions. Fig. 4 shows the hardware architecture of BLAKE’s G-function, where all signals are bits wide. Lyra2 uses the BLAKE2b variation, i.e., , , , , and (cf. Algorithm 1). Furthermore, the CM and CM inputs are not used, thus the corresponding adders are omitted in our implementation of the round function for Lyra2.
In the following, we first describe a version of the hardware architecture of our simplified Lyra2, where each round of the function is executed in a single cc. We then describe how this basic architecture can be improved through pipelining.
Iv-a Basic Iterative Architecture
Our basic iterative Lyra2 architecture requires 68 cc per hash: 24 for the bootstrapping phase, 16 for the setup and wandering phases, and 12 for the wrap-up phase.
Iv-A1 Bootstrapping Phase
During the bootstrapping phase, the duplex processes two 512-bit input blocks from using a full-round absorb. In Lyra2REv2, , with being the output of the first CubeHash instance, i.e., the previous algorithm in the chain. Thus, as shown in Fig. 3, the vector is one of the inputs to the mux of the duplex. On the other hand, the vector is fed into the sponge by loading it on while simultaneously loading the all-zero vector on . Both constants are stored at known addresses in the bram, and are absorbed in a separate 12-round Bootstrap state. During bootstrapping, the duplex only receives an input vector in the first round. Hence, for subsequent rounds, and output the all-zero vector, and their sum is passed to the duplex via its input mux.
Iv-A2 Setup Phase
We split the setup phase into three distinct phases for convenience, namely Setup0, Setup1, and Setup2, which correspond to Lines 6–8, Lines 9–11, and Lines 12–20 of Algorithm 2, respectively. Similarly to the bootstrapping phase, the setup phase uses the all-zero vector stored in the bram. In the Setup0 state, the squeezes input an empty message into the duplex and directly write the duplex output to the bram. To achieve that, the all-zero vector is output on , , and . Setup1 reads the all-zero vector on , but a specific vector from the bram on . Setup2 reads two vectors from and . Both the duplex output and the rotated duplex output are XOR’d with two other vectors from the bram, requiring the two XOR blocks in parallel as illustrated in Fig. 3. On the control path, counters keep track of the various rows () and their corresponding columns to generate read and write addresses for the RAM.
Iv-A3 Wandering Phase
The input to the duplex in the wandering phase is always the word-wise addition of two RAM cells. Both XOR blocks connected to the duplex output are used. As mentioned in the algorithmic description of the wandering phase in Section III-C, the pseudorandom and deterministic rows used in this phase can collide. In hardware, this special case requires the output of one XOR block to be input to the other, while the write port of the first XOR block needs to be disabled to prevent write collisions on the RAM.
Iv-A4 Wrap-Up Phase
During the wrap-up phase, one RAM cell is input into the sponge and then processed using a full-round absorb. For the following squeeze, the requested hashed-output length is lower than the bitrate , i.e., the duplex state directly provides the output hash.
Iv-B Memory Matrix
In the wandering phase, up to two RAM cells need to be written and three RAM cells need to be read per cc. These operations cannot be spread over multiple cc without negatively affecting the overall throughput of the design. Therefore, we use standard true-dual-port bram along with multipumping and replication techniques  in order implement the required functionality. Replication provides extra read ports by physically replicating the bram while connecting the write ports to keep the two copies coherent. Multipumping operates the bram at double the clock frequency of the surrounding logic, which, together with replication, effectively provides four read ports and two write ports. A -bit wide bram with true-dual-port functionality can be implemented using and one pl BRAM primitives, which are and bits wide, respectively. Considering that, even with added FIFOs, our full chain is not limited by the availability of bram, the cost of replication in terms of hardware resources is reasonable. In total, our Lyra2 core then uses Kbits of bram.
Iv-C Pipelined Architecture
Pipelining the BLAKE2b round function can greatly reduce the delay of the critical path. Recall that the round function consists of an arrangement of G-functions, whose architecture is illustrated in Fig. 4. In the basic iterative version described above, the critical path of the round function extends from the RAM read ports to the RAM write ports. Eight pipeline stages in the round were found to optimally increase throughput/area. Each hash that is concurrently being processed by the core needs its own memory. However, extra RAM-based memory is readily available since the current Lyra2REv2 parameters result in a RAM depth much shallower than that of the pl bram. With adequate scheduling, concurrent hashes write to the same bram in distinct cc. While read ports and feed the duplex, and feed the XORs with duplex outputs. When pipelining the round function, and therefore need to be delayed by as many cc as there are pipeline stages. The extra read port that is unused in the basic architecture allows delaying the control path for rather than using a delayed version of , avoiding a long chain of 768-bit registers. Eight pipeline stages in the round increase the latency to 544 cc per hash. On the other hand, eight hashes are processed concurrently. This results in one hash being output every 68 cc, and the achievable clock frequency more than doubles so the overall hashing throughput is improved significantly.
Iv-D Programmable Logic Implementation of Lyra2MOD
A potential pl implementation of the Lyra2MOD algorithm would be based on the pipelined architecture of the simplified Lyra2 algorithm as described in Section IV-C, with appropriate changes to support the modified wandering phase explained in Section III-E. Fig. 5 shows the hardware implementation of the row selection during the wandering phase for both the simplified Lyra2 (Lyra2REv2) and Lyra2MOD (Lyra2REv3) algorithms. Specifically, Fig. 5(a) shows that in the simplified Lyra2 algorithm, the row is selected simply based on the least-significant bits of the state (cf. line 23 of Algorithm 2). On the other hand, the row selection in Lyra2MOD is much more involved (cf. lines 24–26 of Algorithm 2). Thus, as shown in Fig. 5(b), Lyra2MOD requires the addition of multiplexers and memory to store the new variable. The -variable memory is initialized to all zeros during the bootstrap phase. It should be noted that the changes introduced by Lyra2MOD have negligible impact in terms of resources. Furthermore, the critical path is unaffected as the new row-selection logic in Lyra2MOD translates to significantly fewer logic levels than that of the -bit adders on the datapath. We also note that, in the -stage pipelined architecture, and would need to be stored for every hash in the pipeline such they are small bits and bits RAMs, respectively.
V mpsoc Implementation of a Lyra2REv2 Miner
In this section, we present an mpsoc-based architecture for the Lyra2REv2 chained hashing algorithm, and the changes that would be required to support the Lyra2REv3 chain. Specifically, we implement the computation-intensive part of the Lyra2REv2 (or Lyra2REv3) chained hashing algorithm on the pl and use the ps capabilities of the mpsoc to run supporting software used to enable fast verification of the hashing chain.
In the following, we first briefly describe the communication mechanism between the ps and the pl sides of the mpsoc. Then, we discuss the verification software on the ps. Lastly, we discuss our hardware implementation of the mining algorithm on the pl side of the device including the hashing algorithms, other than simplified Lyra2 and Lyra2MOD that we have already described above.
V-a Communications Between the Processing System and the Programmable Logic
Given the limited distinct data types (i.e., block headers and hashes) that transit between the pl and the ps, we use a FIFO interface. This allows the verification software to easily write -bit block headers to the hardware miner, and to read back -bit hashes. The parameterized macros that Xilinx provides include a FIFO device. Two instances of this device are used: one to write block headers to the miner, and one to read hashes back from the miner. The FIFO devices are wrapped in an adapter allowing access to the FIFOs through a memory-mapped AXI4-Lite bus, as shown in Fig. 6, with a -bit width clocked at MHz.
V-B Verification Software on the Processing System
The verification software consists of a Linux driver and a userspace application running inside a custom embedded GNU/Linux distribution. The driver exposes memory-mapped FIFO devices as a character device, while doing basic sanity checking to ensure that only whole block headers are written and whole hashes are read. Userspace applications can then use the character device to write block headers to the mining algorithm on the pl side, and read back hashes.
The userspace application is based on the existing cpuminer-multi  open-source mining software, which we enhanced by adding a new type of algorithm, namely lyra2rev-hw. This new algorithm pipes the nonce search space, block headers at a time, to the mining hardware using the character device mentioned above. The hashes read from the mining hardware are then verified to meet the imposed threshold. As mentioned in Section II-A, for a pow to be accepted by the network, the miner has to find a nonce that results in a hash with a value strictly smaller than the target threshold.
To ensure reliable and reproducible software builds, the Linux-based firmware and boot image are created using a customized Yocto  bsp. This bsp includes a custom layer on top of Xilinx’s base Yocto bsp and a set of supporting scripts to build and flash a boot image onto an SD card. The custom layer contains the patches to the Linux kernel and the patches to cpuminer-multi described above.
V-C Mining Algorithm in Programmable Logic
As shown in Fig. 6, each hash function has its dedicated scheduler, and is bounded by FIFOs. The number of instances of each hash function varies, as it is chosen depending on their respective maximum clock frequency and throughput in hashes per second with the goal to balance the processing pipeline. More details are provided in Section VI, but the number of instances per hashing algorithm is selected in order to maximize the overall mining algorithm throughput. In this section, we provide implementation details about all components in the pl part of the Lyra2REv2 miner.
Reference implementations for the SHA-3 candidates that are optimized for various performance metrics are publicly available. In particular, a research team at the gmu described a methodology to compare the hardware performance of fourteen round-two candidates, including all of those utilized in Lyra2REv2 , and they also provide the source code for their implementations . We used the gmu throughput-per-area-optimized designs as starting points for most of our own implementations of these hashing cores.
The Lyra2REv2 chain passes only -bit inputs between the algorithms in the chain, while all of the SHA-3 candidates were required to support arbitrary input lengths. Generally, this results in some functionality that does not appear and allows for heavy optimizations. Also, the implementations from gmu include interfaces to communicate with software, accounting for such things as endianness and serialization at the output, which are not required for our custom mining chain. As such, we re-used the main computational block of the gmu implementations and customized the control path. We were thus able to greatly simplify the control flow for these algorithms and could often also introduce optimizations for the computational datapath. More details are provided for individual hashing cores in the following. Moreover, the performance of each hashing core, measured in cc per hash, is summarized in Table II.
The hashing cores have different nominal frequencies and throughputs. Firstly, FIFOs are used to normalize data transfers between hashing cores with different throughput, by properly asserting the forward- and back-pressure signals. Secondly, since the hashing cores also have various operating frequencies, asynchronous FIFOs are used to safely transfer data from one clock domain to another. The forward and back-pressure signals are individually set to match the internal pipelined architecture of each hashing core. Lastly, the input and output FIFOs of the full-chain convert the AXI4-Lite memory-mapped protocol to a native FIFO protocol.
While the FIFOs are necessary to interface hashing algorithms operating at different frequencies, data schedulers—one per hashing step in the chain—are needed to balance throughput between cores with varying execution times. For example, an upstream hashing core producing an output hash every 192 cc will inherently starve a downstream core that can accept new data only every 68 cc. To address this limitation, in this example we would replicate the upstream core 3 times and schedule the read/write operation of each core to produce a hash every 64 cycles.
The scheduler consists of a state machine that monitors the upstream and downstream FIFO back-pressure signals and that tracks each hashing core computation. Schedulers have knowledge of the execution time and pipeline depth of the hashing cores they are associated to. Given this information, the scheduler will assert the ready signal of the next available core, in a round-robin fashion, when the upstream FIFO has enough data to sustain the hashing core internal pipeline and the downstream FIFO has enough space to receive new data. Subsequently, when a core finishes its computations, the resulting hash is written to the downstream FIFO.
Like the round function in the sponge of Lyra2 which is based on BLAKE2b, the round function of BLAKE is given by an arrangement of G-functions. The G-functions themselves are not the same as the ones of Algorithm 1, with different constants for the rotations and with the insertion of additional adders. We adapted our BLAKE2b round implementation for Lyra2 to implement the BLAKE algorithm.
Consider Fig. 4, which shows the hardware architecture of BLAKE’s G-function that updates 4 out of 16 state words, with all signals being bits wide. In the Lyra2REv2 and Lyra2REv3 algorithms, the BLAKE hash function is for bits, and uses the constants , , , and . The inputs CM and CM take the value of a round-dependent permutation of a message block and constant . Notably, these inputs are excluded when the G-function is implemented together with the sponge, because an interface to inject message blocks into the state is already present in the functions and .
BLAKE hashes a 512-bit message in 14 rounds. Contrary to the other cores in the Lyra2REv2 chain that pass 256-bit values, the BLAKE core, at the head of the chain, takes 640-bit block headers as input. The block header is therefore split into two message blocks, and our BLAKE implementation can then output one hash every 28 cc.
Keccak, which introduced the concept of a sponge, is very efficiently implementable in hardware, which is one of the main reasons it won the SHA-3 competition. While Lyra2 uses a sponge with the BLAKE2b round function, Keccak defines its own family of round functions called Keccak-f, with being one of seven values for the sponge permutation width. Lyra2REv2 uses an instance of Keccak-f, with , the permutation applied in 24 rounds and bits of output hash length. We use our own sponge implementation with its corresponding control logic, along with the Keccak-f round function from gmu. Executing one round per clock cycle, our Keccak implementation can then output one hash every 24 cc.
Each CubeHash round is simple, but it is applied many times. Cubehash in Lyra2REv2 does initialization rounds and a total of finalization rounds. Each round takes a single cc so that a total of cc are required to compute one hash. The difference between initialization and finalization rounds amounts to flipping a single bit of the state and is trivial to implement in hardware. We re-use the CubeHash round function from gmu, and implement round-serial control logic to output one hash every 192 cc.
Skein is based on the Threefish tweakable block cipher , and uses the ubi chaining mode for hashing, as illustrated in Fig. 7. For the Lyra2REv2 algorithm, all inputs of the first ubi block are constant, hence it can be pre-computed as an initialization value. In normal operation of Skein, for an arbitrary length input message, there is an iterative implementation of the ubi block, where the last round is slightly different as it inputs the constant zero instead of a message. However, for Skein as used in Lyra2REv2, there is an equal number of hashing rounds (taking message inputs) and finalization rounds (taking zero-inputs). It is useful to unroll and pipeline the remaining two ubi blocks. With one of its input as a constant, a significant portion of the logic in the ubi block that corresponds to the finalization round can be removed. The ubi blocks in our design are derived from the implementation of gmu. Our Skein implementation can output one hash every 19 cc.
Our implementation of bmw is derived from that of gmu, where the control logic has been completely replaced. bmw in Lyra2REv2 takes a 256-bit input and only has a single hash round followed by a finalization round for this input length. The finalization round can be implemented more efficiently than the more general hashing round, and since there is an equal number of each, similarly to Skein, we opt for an unrolled implementation of bmw. The unrolled implementation contains separate hash and finalization cores, with a pipeline register in-between them. This architecture achieves equal throughput to two parallel round-serial bmw cores, but requires significantly fewer resources due to the optimized finalization core. Our bmw implementation outputs one hash at every cc.
V-D mpsoc Implementation of Lyra2REv3
A potential mpsoc implementation of the Lyra2REv3 hashing chain would be very similar to that of Lyra2REv2 described previously. The required modifications consist of the removal of the Keccak-256 and Skein-256 blocks, the replacement of the simplified Lyra2 block with the new Lyra2MOD block described in Section IV-D, and the re-arrangement of the hashing chain as shown in Fig. 2. On the ps side, the verification software would need to be modified to use Lyra2REv3, which is also supported by cpuminer-multi.
|Exec. time (cc/Hash)||28||24||192||68||19||1|
Vi Implementation Results
In this section, we provide implementation results for a full Lyra2REv2 miner, notably using our simplified Lyra2 core.111We note that our VHDL code and relevant scripts for the simplified Lyra2 core are publicly available at https://github.com/Michielvb/lyra2-hw. To the best of our knowledge, there are no other FPGA-based implementations of simplified Lyra2 cores or for Lyra2REv2 miners in the open literature. For this reason, we can unfortunately not provide detailed comparative FPGA/mpsoc implementation results.
|Area (CLBs)||978 (5 866)||453 ( 905)||241 (5 788)||1 342 (5 369)||938 (2 815)||1 305 (1 305)||23 054 (67%)|
|LUTs||5 042 (30 252)||2 677 (5 354)||1 553 (37 265)||6 001 (24 004)||6 425 (19 274)||7 903 (7 903)||147 641 (54%)|
|Registers||1 975 (11 848)||1 607 (3 214)||1 039 (24 936)||8 288 (33 152)||2 894 (8 682)||770 ( 770)||87 524 (16%)|
|RAM (kbits)||0 ( 0)||0 ( 0)||0 ( 0)||1 548 (6 192)||0 ( 0)||0 ( 0)||7 524 (23%)|
Vi-a Lyra2REv2 Miner
The Lyra2REv2 miner was implemented on a Xilinx ZCU102 Evaluation Kit, which is based on the Xilinx Zynq UltraScale+ 9EG (ZU9EG) mpsoc. The pl of the ZU9EG mpsoc contains a total of 34 260 clb with 274 080 lut, 548 160 registers, and 32.1 Mbits of bram. The ps of the ZU9EG mpsoc contains four ARM Cortex-A53 cores clocked at 1.2 GHz. The functionality of the Lyra2REv2 chain was verified against test vectors generated using cpuminer-multi.
The power-consumption estimation was obtained using Xilinx’s Vivado Power Estimator tool, where the timing constraints are those required for the operating frequencies of TableIII, the switching activity is obtained by way of simulation  with the miner processing input vectors generated using cpuminer-multi , and the post-fitted design provided to the tool meets all timing constraints.
In Table II, we show the throughput metrics for the individual hashing cores. Due to the different hashing core architectures, we use a total of 5 clock domains, namely, 55.2 MHz for the BLAKE cores, 76.2 MHz for Skein, 13.3 MHz for bmw, and 200 MHz for all the remainder of the sequential logic, with the exception of the RAM blocks that also require a 400 MHz clock for multipumping in addition to the core clock. Clock-domain crossings are done over the asynchronous FIFOs. From Table II, we can observe that both the execution time and the resulting individual throughput vary significantly among the hashing cores, thus making it challenging to perfectly balance the Lyra2REv2 chain. In the bottom half of Table II, we provide the number of cores per hashing step that we use in the Lyra2REv2 chain, which result in a relatively balanced pipeline that is limited by the 11.76 MHash/s combined throughput of the Lyra2 cores. It should be noted that our implementation has a total of 24 instances of the CubeHash core as there are two CubeHash steps in the chain (cf. Fig. 6).
In Table III, we show the post-fitting area results of our Lyra2REv2 chain. Specifically, we show the average individual area results for each hashing core and the total amount for all combined instances of a core in parenthesis. The rightmost column is the total for the complete design, which includes the FIFOs and the individual hashing core schedulers. We observe that the 6 BLAKE instances require the most pl clb resources, followed closely by the the 24 CubeHash instances and the 4 Lyra2 instances. Interestingly, CubeHash requires the most lut, even though it does not require the most clb. Keccak and bmw, on the other hand, are much more hardware-efficient.
In Table IV, we show the post-fitting power consumption results of our Lyra2REv2 chain. The pl part of our Lyra2REv2 chain consumes 9.46 W, which leads to an energy-efficiency of 0.80 J/Hash at a throughput of 11.76 MHash/s.
|Implementation||NVIDIA||Hash Altcoin||Xilinx Zynq|
|Titan Xp||BlackMiner F1+||Ultrascale+ 9EG|
Vi-B Comparison with a GPU and a Commercial FPGA Miner
Table IV also shows a performance comparison of our work against a Lyra2REv2 miner running on a (non-overlocked) NVIDIA Titan Xp GPU and on the Hash Altcoin BlackMiner F1+ commercial (multi-)FPGA miner . The power consumption of the BlackMiner F1+ has been measured and found to be 543 W when mining a Lyra2REv2-based cryptocurrency . For the GPU, we use version 390.48 of the NVIDIA drivers for Linux and version 2.3.1 of the ccminer software  compiled from scratch with version 9.1.85 of the CUDA compilation tools. The ccminer intensity option was set to 22 (out of 25), which is the largest supported value before the GPU memory runs out. All remaining parameters of the NVIDIA drivers and of the ccminer tool have their default values. We set ccminer up to mine MonaCoin using Lyra2REv2 on the zergpool.com mining pool.222We note that all mining rewards obtained during our tests were directly sent as Vertcoin to the Tip Jar wallet of the Vertcoin Developers (VnfNKCy5Aq7vZq5W9UKgMwfDLT7NrPRWZK), who are also the developers of Lyra2REv2 and Lyra2REv3. The power and hash rates reported in Table IV are average values that are provided directly by the ccminer software.
We observe that our FPGA-based Lyra2REv2 miner is estimated to be 4.24 times more energy efficient than the GPU-based miner. Moreover, our FPGA-based Lyra2REv2 miner is also estimated to be 2.09 times more energy efficient than the BlackMiner F1+. We note, however, that due to a lack of details on the implementation of the BlackMiner F1+, it is difficult to assess whether the improved energy efficiency is due to a better implementation of the various hashing cores or if it is simply due to a difference in the employed FPGAs. Moreover, we also note that the BlackMiner F1+ is a standalone device, while the power we report for the GPU and our FPGA-based Lyra2REv2 miner is only for the computation-intensive hashing chain which, however, typically requires the vast majority of the power of a miner.
In this paper, we presented the first FPGA-based implementation of the Lyra2REv2 chained hashing algorithm, which is an ASIC-resistant hashing algorithm employed by several cryptocurrencies. To this end, we also presented the first implementation of the simplified Lyra2 hashing algorithm used by Lyra2REv2 in the open literature. The key to achieve a good throughput and energy efficiency for Lyra2 is to efficiently map the memory matrix to pl RAM blocks and to pipeline the BLAKE2b round function. Our Lyra2REv2 FPGA-based architecture has an estimated energy efficiency of 0.80 J/Hash at a throughput of 11.76 MHash/s, which is 4.24 and 2.09 times better than an NVIDIA Titan Xp GPU and a commercial FPGA-based miner, respectively. At the same time, our FPGA-based architecture is easily reconfigurable so that it can be adapted to future versions of Lyra2RE which may be introduced to deter ASIC-based miners.
The authors gratefully acknowledge the support of NVIDIA Corporation with the donation of a Titan Xp GPU, and of Xilinx for the donation of a Zynq UltraScale+ MPSoC ZCU102 Evaluation Kit.
-  M. Van Beirendonck, L.-C. Trudeau, P. Giard, and A. Balatsoukas-Stimming, “A Lyra2 FPGA core for Lyra2REv2-based cryptocurrencies,” in IEEE Int. Symp. on Circuits and Syst. (ISCAS), May 2019, to appear.
-  S. Nakamoto, “Bitcoin: A peer-to-peer electronic cash system,” 2008.
-  C. Dwork and M. Naor, “Pricing via processing or combatting junk mail,” in Advances in Cryptology (CRYPTO). Springer Berlin Heidelberg, 1993, pp. 139–147.
-  “MonaCoin.” [Online]. Available: https://monacoin.org
-  “Verge.” [Online]. Available: https://vergecurrency.com
-  “Vertcoin.” [Online]. Available: http://vertcoin.org
-  J.-P. Aumasson, L. Henzen, W. Meier, and R. C.-W. Phan, “SHA-3 proposal BLAKE, submission to NIST,” 2008. [Online]. Available: http://131002.net/blake
-  G. Bertoni, J. Daemen, M. Peeters, and G. van Assche, “The Keccak SHA-3 submission,” 2011. [Online]. Available: http://keccak.noekeon.org/Keccak-submission-3.pdf
-  N. Ferguson, S. Lucks, B. Schneier, D. Whiting, M. Bellare, T. Kohno, J. Callas, and J. Walker, “The Skein hash function family,” 2010. [Online]. Available: http://www.skein-hash.info/sites/default/files/skein1.3.pdf
-  D. Gligoroski, V. Klima, S. J. Knapskog, M. El-Hadedy, and J. Amundsen, “Cryptographic hash function Blue Midnight Wish,” in Int. Workshop on Security and Commun. Networks, May 2009, pp. 1–8.
-  D. J. Bernstein, “CubeHash specification,” 2009. [Online]. Available: http://cubehash.cr.yp.to/submission2/spec.pdf
-  S. Tillich, M. Feldhofer, W. Issovits, T. Kern, H. Kureck, M. Mühlberghuber, G. Neubauer, A. Reiter, A. Köfler, and M. Mayrhofer, “Compact hardware implementations of the SHA-3 candidates ARIRANG, BLAKE, Grøstl, and Skein,” Cryptology ePrint Archive, Report 2009/349, 2009. [Online]. Available: https://eprint.iacr.org/2009/349
-  B. Baldwin, A. Byrne, L. Lu, M. Hamilton, N. Hanley, M. O’Neill, and W. P. Marnane, “FPGA implementations of the round two SHA-3 candidates,” in Int. Conf. on Field Programmable Logic and Applications (FPL), Aug. 2010, pp. 400–407.
-  K. Gaj, E. Homsirikamol, and M. Rogawski, “Fair and comprehensive methodology for comparing hardware performance of fourteen round two SHA-3 candidates using FPGAs,” in Cryptographic Hardware and Embedded Systems (CHES), S. Mangard and F.-X. Standaert, Eds. Berlin, Heidelberg: Springer, 2010, pp. 264–278.
-  E. Homsirikamol, M. Rogawski, and K. Gaj, “Comparing hardware performance of fourteen round two SHA-3 candidates using FPGAs,” Cryptology ePrint Archive, Report 2010/445, Dec. 2010. [Online]. Available: https://eprint.iacr.org/2010/445
-  M. A. Simplício Jr, L. C. Almeida, E. R. Andrade, P. C. dos Santos, and P. S. Barreto, “Lyra2: Password hashing scheme with improved security against time-memory trade-offs,” Cryptology ePrint Archive, Report 2015/136, 2015. [Online]. Available: https://eprint.iacr.org/2015/136
-  E. R. Andrade, M. A. Simplicio, P. S. L. M. Barreto, and P. C. F. d. Santos, “Lyra2: Efficient password hashing with high security against time-memory trade-offs,” IEEE Trans. Comput., vol. 65, no. 10, pp. 3096–3108, Oct 2016.
-  “Bitcoin developer reference,” 2019. [Online]. Available: https://bitcoin.org/en/developer-reference
-  G. Bertoni, J. Daemen, M. Peters, and G. V. Assche, “Cryptographic sponge functions,” Tech. Report v0.1, Jan. 2011.
-  NIST, “SHA-3 standard: Permutation-based hash and extendable output functions,” FIPS Publication 202, Aug. 2015.
-  M. A. Simplicio Jr, L. C. Almeida, E. R. Andrade, P. C. dos Santos, and P. S. Barreto, “The Lyra2 reference guide,” Tech. Report v2.3.2, 2014.
-  J.-P. Aumasson, S. Neves, Z. Wilcox-O’Hearn, and C. Winnerlein, “BLAKE2: simpler, smaller, fast as MD5,” in Int. Conf. on Applied Crypto. and Netw. Security (ACNS). Springer, 2013, pp. 119–135.
-  J.-P. Aumasson, L. Henzen, W. Meier, and R. C.-W. Phan, “SHA-3 proposal BLAKE,” Tech. Report v1.3, Dec. 2010.
-  L. Ji and X. Liangyu, “Attacks on round-reduced BLAKE,” Cryptology ePrint Archive, Report 2009/238, 2009. [Online]. Available: https://eprint.iacr.org/2009/238
-  Vertcoin Development Team Blog, “Vertcoin development update,” Jan. 2019. [Online]. Available: https://medium.com/vertcoin-blog/vertcoin-development-update-january-2019-8dc39f6df210
-  C. E. LaForest and J. G. Steffan, “Efficient multi-ported memories for FPGAs,” in Ann. ACM/SIGDA Int. Symp. on FPGAs (FPGA), 2010, pp. 41–50.
-  T. Pruvot, “cpuminer-multi,” GitHub repository, 2017. [Online]. Available: https://github.com/tpruvot/cpuminer-multi
-  “Yocto Project.” [Online]. Available: https://www.yoctoproject.org
-  George Mason University - Cryptographic Engineering Research Group, “Source code for the SHA-3 round 2 candidates & SHA-2 - Hash 2011 release.” [Online]. Available: https://cryptography.gmu.edu/athena/index.php?id=source_codes
-  Xilinx Inc., “AR# 53544: Vivado power analysis - How do I simulate for accurate power analysis (SAIF)?” [Online]. Available: https://www.xilinx.com/support/answers/53544.html
-  “Hash Altcoin BlackMiner F1+,” 2019. [Online]. Available: https://www.hashaltcoin.com/en/batches/11
-  “Blackminer F1+ Review–FPGA Miner,” 2019. [Online]. Available: https://1stminingrig.com/blackminer-f1-review-fpga-miner/#Verge_XVG_Lyra2rev2_Mining_Hashrate_Power_Draw
-  T. Pruvot, “ccminer,” GitHub repository, 2019. [Online]. Available: https://github.com/tpruvot/ccminer