Randen - fast backtracking-resistant random generator with AES+Feistel+Reverie

10/04/2018
by   Jan Wassenberg, et al.
0

Algorithms that rely on a pseudorandom number generator often lose their performance guarantees when adversaries can predict the behavior of the generator. To protect non-cryptographic applications against such attacks, we propose 'strong' pseudorandom generators characterized by two properties: computationally indistinguishable from random and backtracking-resistant. Some existing cryptographically secure generators also meet these criteria, but they are too slow to be accepted for general-purpose use. We introduce a new open-sourced generator called 'Randen' and show that it is 'strong' in addition to outperforming Mersenne Twister, PCG, ChaCha8, ISAAC and Philox in real-world benchmarks. This is made possible by hardware acceleration. Randen is an instantiation of Reverie, a recently published robust sponge-like random generator, with a new permutation built from an improved generalized Feistel structure with 16 branches. We provide new bounds on active s-boxes for up to 24 rounds of this construction, made possible by a memory-efficient search algorithm. Replacing existing generators with Randen can protect randomized algorithms such as reservoir sampling from attack. The permutation may also be useful for wide-block ciphers and hashing functions.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

05/08/2017

A revision of the subtract-with-borrow random number generators

The most popular and widely used subtract-with-borrow generator, also kn...
05/21/2020

Random Number Generator Attack against the Kirchhoff-Law-Johnson-Noise Secure Key Exchange Protocol

This paper introduces and demonstrates two new attacks against the Kirch...
04/04/2020

Scalable Pseudorandom Quantum States

Efficiently sampling a quantum state that is hard to distinguish from a ...
10/14/2019

It is high time we let go of the Mersenne Twister

When the Mersenne Twister made his first appearance in 1997 it was a pow...
12/29/2019

The Algebraic Theory of Fractional Jumps

In this paper we start by briefly surveying the theory of Fractional Jum...
11/29/2017

Local-Access Generators for Basic Random Graph Models

Consider a computation on a massive random graph: Does one need to gener...
10/12/2019

Efficient and Secure Substitution Box and Random Number Generators Over Mordell Elliptic Curves

Elliptic curve cryptography has received great attention in recent years...

Code Repositories

randen

Fast backtracking-resistant random generator: https://arxiv.org/abs/1810.02227


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Pseudorandom number generators are very widely used. For example, searching Github for C++ code containing mt19937 (Mersenne Twister) returns 220,000 hits. Some of these usages will be vulnerable to unexpected correlations [1] or exploitation by attackers [2]. To avoid having to audit each call site, we propose to replace most of them with our new fast and ‘strong’ generator.

1.1 Definition of strong

In this paper, we choose to characterize a strong deterministic random generator by two properties:

  1. Even relatively powerful adversaries able to generate and store up to random outputs cannot distinguish the output from random unless they know the current state. This property is useful even for non-cryptographic applications: it implies empirical randomness, which reduces the likelihood of flaws such as correlations that might affect simulations [1]. This property also ensures adversaries cannot predict future outputs, which makes it harder for them to trigger worst cases in randomized algorithms.

  2. Past outputs cannot be reconstructed even after the state is compromised. This is known as enhanced backward secrecy [3], forward security [4] and backtracking resistance [5]. We use the latter name because it is more clear. This may not be necessary for simulation applications, but it prevents adversaries from discovering past behavior, e.g. which inputs were sampled.

The notion of ‘robustness’ from the literature also requires generators to recover security after a state compromise [4]. This is typically achieved by periodically reseeding from an entropy source. However, our applications require at least the option of deterministic results for reproducibility and debugging. Our definition of ‘strong’ describes the achievable security in this model.

1.2 Existing generators

RC4

was a popular stream cipher designed in 1987, but attacks with practical complexity have recently been published [6].

ISAAC

was published in 1996 and has some similarities with RC4. It has many weak initial states, though the resulting biases can be avoided with a modified algorithm [7]. Despite the large 1024 byte state, ISAAC is relatively fast and widely used [8][p. 200].

ChaCha20

is an ARX-based hash/stream cipher used by OpenBSD arc4random and Linux 4.8 /dev/urandom. It is an order of magnitude slower than some general-purpose generators [9][p. 41]; a similar result is observed in our benchmark. Note that a ChaCha20 generator reportedly fails one part [8][p. 205] of the dieharder test of empirical randomness.

Tyche-i

is based on ChaCha and reaches 1.5 cycles per byte [10]. However, its authors discovered some short cycles and recommended a workaround that is seven times slower [11].

Mersenne Twister

is a popular general-purpose generator included in the C++ standard library. It is fast but not strong: the generated numbers are an easily inverted bijection (‘tempering function’) of a portion of the state, so adversaries learn the entire state after generating one full buffer.

xorshift128+ and xoroshiro128+

[12] fail a PractRand test due to lack of randomness in the lower bit [13]; the latter is also easily distinguishable from random [14].

Philox

is a noncryptographic counter-based generator with an iterated bijection based on a Feistel network using double-width multiplications [15]. It passes the TestU01 suite by construction [16]. GPU implementations achieve very high throughput [15], but our benchmarks indicate our hardware AES-based permutation is twice as fast on CPUs.

PCG

includes an extension that XORs the output of a 128-bit generator with one of 32 table entries [9]. Periodically scrambling the table using entropy from the state might be sufficient for backtracking resistance, which the previously mentioned generators lack. However, PCG makes no concrete security claims [17]. Although its statistical quality appears good, we are unaware of any existing proofs of indistinguishability and backtracking resistance, so PCG is not known to be strong.

AES-CTR

is a block-cipher mode that can be used as a generator by enciphering all-zero plaintexts [15]. Although indistinguishable from random, this lacks backtracking resistance. Once the attacker knows the current counter value and key, they can reconstruct all prior outputs.

AES-CTR-DRBG

is a strong generator specified in NIST 800-90A [5]. Similarly to Fortuna [18], it periodically re-keys based on the current output. However, this is about five times slower than AES-CTR, and too slow for general purpose use (see Section 7). Note that relaxing the re-keying requirements, e.g. only after every 100 blocks, greatly reduces the overhead and could yield a faster generator. However, applications may not be willing to accept exposing several thousand prior outputs when the state leaks.

Fast-key-erasure RNGs

are a more efficient alternative to CTR-DRBG without the potential for unsafe usage [19]. Bernstein reiterates the importance of backtracking resistance and proposes to generate a buffer of random bits using a stream cipher, immediately overwriting its key with part of the buffer, and returning the rest. However, there are two integration issues with this approach. If the stream cipher relies on AVX2 or AVX-512 SIMD for speed [20], there is a risk of slowing down the entire application. Frequency throttling has been identified as a cause [21], but this only applies to ChaCha20-Poly1305. Salsa/ChaCha are unaffected because they only require low-power operations, whereas the multiplications in Poly1305 trigger throttling. Instead, we are concerned about another AVX2 implementation detail: a warmup period of about 60,000 cycles triggered by the first AVX2 instruction within a 675 µs window. During this time, SIMD instructions are considerably slower; Haswell CPUs can even stall for 10 µs due to their internal voltage regulator. Thus, sporadic use of stream ciphers relying on AVX2/AVX-512 can slow down the entire application, or even unrelated jobs running on the same socket. The second integration issue is buffer size. Stream ciphers are considerably slower for small buffers [22], which are preferred by applications and library writers because generators may be short-lived or only used to produce a few numbers. Our proposed approach avoids both issues. First, 128-bit AES hardware runs at full frequency without warmup, and is performance-portable to other 128-bit SIMD architectures – see our measurements in Section 7. Second, our Feistel permutation does not require a buffer larger than its 256-byte size.

We are unaware of any existing generator that is both strong and fast in real-world applications.

1.3 Intended applications

We argue that the default choice of random generators should be ‘strong’. This makes it harder to attack randomized algorithms and trigger skewed samples or worst-case performance. Security-critical applications such as generating cryptographic keys should continue to use well-studied and trusted cryptographic generators such as Fortuna

[18]. However, these are too slow to be accepted for general use. For example, we have tens of thousands of high-end CPU cores occupied by general-purpose random generators. Thus, the speed of our proposed generator is important. Because Mersenne Twister is commonly used in C++ applications (see Github usage above), we assume its level of performance is generally acceptable. The Randen generator is designed to reach similar performance.

Note that applications that require many random numbers without any concern for security (such as Monte-Carlo simulations) may still prefer a faster but weaker generator such as pcg32 [17]. For other applications, we suggest using Randen because it is strong and tends to outperform Mersenne Twister (see Section 7).

1.4 Contributions

This paper makes four contributions:

  • Introducing Randen111RANDen = RANDom number generator, or beetroots in Swiss German., a new generator based on Reverie [23] instantiated with a generalized Feistel structure [24] (Section 2).

  • Arguing that Randen is ‘strong’, and explaining why this is important even for non-cryptographic applications (Section 6).

  • Showing that existing secure generators are too slow for general purpose use (Section 7). By contrast, Randen outperforms Mersenne Twister in some real-world use cases despite providing a higher level of security. To the best of our knowledge, Randen is the fastest ‘strong’ software generator.

  • Proposing an efficient algorithm for lower-bounding active s-boxes in 16-branch generalized Feistel networks with SPSP-type round functions (Appendix A). We provide results for up to 24 rounds, whereas prior work reaches 18 rounds [25].

Absence of backdoors

We, the designers of Randen, faithfully declare that we have not inserted any weaknesses in this algorithm/implementation, nor have we discovered any weakness not described in this paper.

Acknowledgment

Thanks to Jeffrey Lim, Titus Winters, Chandler Carruth and Daniel Lemire for suggestions and technical help on improving the benchmarks. We also appreciate the many clear and accessible posts on practical random number generation topics by Melissa O’Neill (author of PCG).

2 Specification

Randen is an instantiation of Reverie, a sponge-like construction that scrambles its internal state using a permutation [23]

. To avoid the ambiguities of pseudocode, we describe its parts using standard C++11, plus explanatory text. The permutation operates on 128-bit pieces of the state called ‘branches’. This corresponds to the block size of AES. For convenience, we assume the availability of a platform-specific 128-bit SIMD vector type

V with associated Load, Store and AES functions.

2.1 Initialization

Randen operates on a 2048-bit state, of which the first 128 bits are the inaccessible ‘inner’ portion corresponding to the ‘capacity’ of a sponge. The remaining ‘outer’ bits are the generated random bits. To simplify initialization of the state, we partition it into 32 64-bit integers, two per 128-bit branch. Zero-initializing the state yields a valid generator, but applications will typically set some of its outer bits to arbitrary user-specified ‘seed’ values. Providing more than 128 seed bits may help against multi-user attacks involving precomputation. We suggest a 256-bit seed, specified as four 64-bit seed integers. For more thorough diffusion, the seeds should be placed into ‘even-numbered‘ (according to zero-based index) branches of the state, e.g. the third (with zero-based indices 4, 5 in the array of 64-bit integers) and fifth.

  uint64_t state[32];
  memset(state, 0, sizeof(state));
  state[4] = seed0;
  state[5] = seed1;
  state[8] = seed2;
  state[9] = seed3;

2.2 Permutation

Randen’s Permute is a generalized type-2 Feistel network [26] with 16 branches of 128 bits.

Figure 1: One round of a four-branch type-2 generalized Feistel network with a block shuffle. F is a 128-bit permutation consisting of two AES rounds, described below.

It consists of two layers (Figure 1). The first (denoted RoundFunctions

) XORs odd-numbered branches

with a function F of their even-numbered neighbors. F is the same as in Simpira v2 [27]: two rounds of AES. The first round’s constant, denoted key, is unique for every instance of F. This avoids any potential weaknesses due to weak or structured round constants, e.g. in Simpira v1 [27][p. 12]. We will discuss the size and purpose of the constants in the description of Permute below. The second constant is zero, which enables an optimization below.

  // Round function: two-round AES with a unique round constant.
  V F(const V even, const V key) {
      const V f1 = AES(even, key);
      return AES(f1, zero);
  }

For every adjacent pair of even and odd branches, RoundFunctions loads the two corresponding 128-bit pieces of the state and overwrites odd with F(even, key) XOR odd.

  const V* RoundFunctions(const V* keys) {
    for (int branch = 0; branch < 16; branch += 2) {
      const V even = Load(state, branch);
      const V odd = Load(state, branch + 1);
      const V new_odd = F(even, *keys++) ^ odd;
      Store(new_odd, state, branch + 1);
    }
    return keys;
  }

Note that the XOR can be computed for free because the last step of AES simply XORs with its round constant. We change the second AES round constant in F from zero (which has no effect) to odd. The key passed to each call to F comes from an array of eight AES keys.

The second layer of the Feistel network (denoted BlockShuffle) rearranges the 128-bit branches into the prescribed order [24][p. 21, no. 10]. We permute the state such that the previous branch 7 comes first, followed by 2, 13, 4 and so on (see shuffle below):

  void BlockShuffle() {
    uint64_t source[32];
    memcpy(source, state, sizeof(source));
    constexpr int shuffle[16] = {
      7, 2, 13, 4, 11, 8, 3, 6, 15, 0, 9, 10, 1, 14, 5, 12};
    for (int branch = 0; branch < 16; ++branch) {
      const V v = Load(source, shuffle[branch]);
      Store(v, state, branch);
    }
  }

Together, these two layers constitute one round of a generalized Feistel network. The final permutation Permute consists of 17 rounds. Each invocation of the RoundFunctions layer requires eight AES round constants, for a total of 2176 bytes.

  void Permute() {
    // Round keys for one AES per Feistel round and branch.
    const V* keys = Keys();
    for (int round = 0; round < 17; ++round) {
      keys = RoundFunctions(keys);
      BlockShuffle();
    }
  }

The keys can be a fixed array of nothing-up-my-sleeve numbers shared by all generators. However, our indistinguishability result (Section 6.1) assumes a keyed/secret permutation, otherwise attackers could distinguish the permutation from random by querying it and verifying the expected result. Applications running on secure servers may reasonably expect that attackers do not have access to the key. For additional safety, applications could instead generate the keys at startup using a stream cipher such as ChaCha20 keyed with 256 bits obtained from a trusted source such as the operating system. Note that the generator remains backtracking-resistant (Section 6.2) even if the keys are leaked.

2.3 Generator

Now that we have defined the permutation, Reverie’s Generate produces random outer bits by invoking Permute on the state and XORing the inner bits with the value they had before the permutation, which cannot be reversed by an attacker with knowledge of the current state [23]:

  void Generate() {
    const uint64_t prev_inner[2] = { state[0], state[1] };
    Permute();
    // Ensure backtracking resistance.
    state[0] ^= prev_inner[0];
    state[1] ^= prev_inner[1];
  }

As a result, the last 1920 bits of state are uniform random and available for use. In practice, the generator is packaged as a C++ ‘random engine’ that returns 32 or 64-bit bundles of random bits and calls Generate again once all remaining bits have been consumed.

3 Rationale

Here we briefly justify design decisions.

The AES block cipher

is well-understood and often hardware-accelerated. Intel’s AESNI instructions [28] are five to ten times faster than optimized software implementations [29, 30]. This implies a software-only Randen would be unacceptably slow (and likely vulnerable to side-channel attacks). On CPUs without hardware AES, it may be faster to replace AES with SIMD-friendly permutations such as ChaCha [22]. However, most modern CPUs have AES hardware, including POWER (VCIPHER [31]) and ARMv8 (AESE [32]).

Two AES rounds

are necessary for full-bit diffusion [27] and more efficient than a single round in terms of the ratio of active s-boxes222in a standard type-2 Feistel with four branches. [33].

Dense and independent AES round keys

ensure that an AES round breaks the symmetry of plaintext with all-equal columns. We use unique keys to rule out attacks similar to those on Haraka v1 [34] and Simpira v1 [27]. This requires a total of 2176 bytes, which is somewhat excessive, but the keys are typically hardcoded (but not necessarily public) nothing-up-my-sleeve numbers and there is little cost to loading unique keys because they easily fit in the L1 cache.

Type-2 generalized Feistel networks

are often used to construct large permutations from smaller blocks. These constructions are ‘sound’ in the sense that they are strong pseudorandom permutations after sufficient rounds of a pseudorandom function [24]. In contrast to the variants of Simpira v2 [27], they enable good performance without relying on multiple independent inputs to keep the CPU pipeline filled.

An improved block shuffle

for the generalized Feistel network reaches full sub-block diffusion (i.e. each block depends on every other input block) much sooner than traditional cyclic shifts [24]. It also reduces vulnerability to sliced-biclique [35] and integral attacks [36][p. 226].

16-branch generalized Feistel networks

are the largest for which the diffusion properties are known [35]. Larger branch counts have two related benefits without requiring multiple independent inputs like Simpira [27]. First, they enable parallel evaluation of the round functions, which hides the long latency of AESENC [37]. Second, they can benefit from increased hardware parallelism such as recently announced quadruple-AES hardware [38][p. 2-14].

A 2048-bit permutation

is a natural result of 16-branch Feistel with 128-bit AES blocks. Larger states cannot be accommodated within the 16 SSE4 registers.

17 Feistel rounds

improve the diffusion relative to the minimum of 16 rounds required for Feistel block diffusion (propagating input differences to each branch of the state) [24].

Reverie

is an efficient construction for backtracking-resistant generators. It avoids the heavy rekeying cost of CTR-DRBG and exposes fewer prior outputs than an only periodically re-keyed stream cipher.

Reseeding

the state from external entropy sources periodically is beyond the scope of this paper because our applications typically require reproducible sequences of random numbers.

4 Implementation details

We implement the algorithm in C++ using SIMD intrinsics that are available on current Intel, AMD and POWER CPUs. The final optimized code is quite short (only about 150 lines) and very similar to the straightforward listings above! If state is a restrict-qualified pointer, Clang understands that BlockShuffle simply renames memory locations. We have released this code [39] under an open-source license so our results can be reproduced.

In the rest of this section, we study how well the algorithm maps to the Haswell and Skylake microarchitectures. Despite the high-level implementation, the measured Permute throughput is within 5% of the lower bound (one AESENC per cycle). Intel’s IACA simulator [40] reports the code is bottlenecked by the ‘frontend’ in addition to the expected port 5 (AESENC

), but still claims its throughput should exactly match the lower bound. Note that IACA does not model memory accesses, and the limited set of 16 SSE4 registers necessitates many spills to memory, so the 5% difference is probably due to loads. However, we also investigate the alleged frontend limitation using performance counters captured via the Linux perf utility.

Is decode throughput the bottleneck? This can be a problem because the 16 byte fetch window (unchanged since Pentium Pro) is too small for large SIMD instructions (nine bytes for AESENC with a 32-bit offset). Two such instructions do not fit in a fetch window, so only one can decode per cycle. However, Sandy Bridge and later Intel CPUs include a decoded instruction cache (DSB), which is very helpful because it avoids the 16-byte limitation. Indeed we find 99.9% of µops are delivered from the DSB. However, the effective DSB capacity is lower than the documented maximum of 1536 µops. Fully unrolling the Feistel rounds generates about 750 µops and causes a 10x increase in DSB misses. Unrolling by a factor of two generates good code.

Is microcode a factor? In 2012 there was speculation that AESENC uses the microcode sequencer (MSROM) [41]. We can confirm this is not the case (on Haswell) because IDQ.MS_UOPS (79_30) is zero. Given the low values of IDQ_UOPS_NOT_DELIVERED.CORE (9C_01), we can conclude the bottleneck does not involve the decoders.

What about other stalls? LD_BLOCKS_PARTIAL.ADDRESS_ALIAS (07_01) detects 4K aliasing between compiler-generated spills to the stack and loads of round keys. This is difficult to reliably avoid, but only affects 1% of all instructions. RESOURCE_STALLS (A2_FF) affect 18% of all instructions; 90% of these are waiting for the reservation station. We speculate that this is due to a lack of physical registers and/or waiting for loads. Either way, the problem should disappear on Skylake. With its 32 vector registers, we can devote 8 to the AES inputs and outputs (updated in-place) and 8+8 to hold the XOR inputs for the next two rounds, thus entirely avoiding spills. In summary, it appears difficult to further optimize the implementation. We emphasize that the compiler and out-of-order CPU extract good performance (within 5% of the lower bound) from our minimally annotated high-level language implementation.

5 Smoke test

Every random generator should avoid ‘recognizable patterns’, which can cause systematic errors in applications such as simulations [1]. In the next section, we argue Randen is computationally indistinguishable from random, which implies the non-existence of any patterns. However, general-purpose generators are unable to furnish such arguments, so they instead apply statistical tests to detect obvious flaws. Several batteries of tests are well-known and often used for verifying empirical randomness. For completeness, we also apply them to Randen. We begin with BigCrush from TestU01 version 1.2.3 [16]. Its interface requires a small wrapper around the raw generator [42]:

  randen::Randen<uint32_t> engine;
  uint32_t Rand32() { return engine(); }
  int main(int, char*[]) {
    unif01_Gen* gen = unif01_CreateExternGenBits(”R”, Rand32);
    bbattery_BigCrush(gen);
    unif01_DeleteExternGenBits(gen);
    return 0;
  }

All 160 tests pass for PCG [9] and Randen with original and inverted bits. By contrast, BigCrush reports two failures when testing MT19937 and one near-failure for AES-CTR (p-value of 0.000092, but it did not recur in subsequent test(s)) [16].

We also test Randen with the current version 0.93 of PractRand [43]. To avoid file or pipe overhead, we integrate Randen into the DummyRNG class by having its raw32 function return Randen’s output. The test battery is invoked with default settings via ./RNG_test dummy -multithreaded. Running all tests up to the upper limit of 32 terabytes reports two ‘unusual’ p-values (0.9921 and 0.0013). Note that pcg64 also leads to an unusual p-value (0.0016) in a much smaller test, and failures have more extreme p-values, e.g.  for Mersenne Twister [13]. We conclude that Randen passes state of the art tests of empirical randomness about as well as pcg64 and better than Mersenne Twister.

6 Security

Some developers are unaware that randomized applications can be vulnerable to adversaries and we have observed reluctance to sacrifice speed for security. It is expensive to audit tens of thousands of random generator usages to determine the appropriate security/speed tradeoff. We therefore propose to provide a higher baseline level of security than existing general-purpose generators. To gain user acceptance, we ensure our generator remains within the performance envelope of Mersenne Twister. What security guarantees can we provide? In this paper, a ‘strong’ generator is characterized by two properties: computational indistinguishability from random, and backtracking resistance. In the following, we show that these hold for Randen.

6.1 Indistinguishability

‘Indistinguishable from random’ is a very strong property often used in cryptography. We emphasize that security-critical applications should continue to use trusted cryptographically secure generators. However, other applications also benefit from a strong generator. Indistinguishability implies the output is unpredictable, which prevents adversaries from triggering worst case execution time in randomized algorithms such as Quicksort (quadratic rather than linearithmic time), or influencing the samples drawn by randomized online sampling algorithms.

We now apply a standard computational indistinguishability argument. Suppose a deterministic adversary is given query access to either a real or ideal (i.e. uniform random) generator and returns 0 or 1 to indicate which generator it is interacting with. We assume an adversary cannot issue more than permutation queries. Then, a real generator is computationally indistinguishable from random if the distinguishing advantage (absolute difference in probability of any such adversary returning 1 when given the ideal vs. real generator) is negligible.

Lemma 1.

In the ideal permutation model, if the Randen permutation is replaced with an ideal permutation, Randen is indistinguishable from random by adversaries limited to permutation queries.

Proof.

Randen is an instantiation of Reverie, which guarantees that the best possible attack must guess its inner bits [23][p. 12]. That requires an average of evaluations of the Randen permutation, which is beyond the capabilities of our assumed adversary. ∎

There are two practical difficulties with the ideal permutation model. First, attackers can trivially distinguish a Randen permutation with known key by simply querying it. In this section, we need to assume the permutation is keyed. Second, a truly random permutation is impractical because its representation requires bits. We could instead argue that the generalized Feistel structure of the Randen permutation ensures it would be indistinguishable from random if its round functions were pseudorandom [24]. However, our round function consists of two rounds of AES, and up to three are efficiently distinguishable from random [44]. We could construct a round function that is believed to be indistinguishable from a random function by XORing two permutations [45] that are widely recognized to be secure, such as 10 rounds of AES. Unfortunately this would be about ten times slower. Instead, we will study known attacks on the actual Randen permutation.

Rounds Active Functions Rounds Active Functions
1 0 13 27
2 1 14 30
3 2 15 32
4 3 16 35
5 4 17 36
6 6 18 39
7 8 19 41
8 11 20 44
9 14 21 45
10 18 22 48
11 22 23 50
12 24 24 53
Table 1: Lower bound on active functions after a given number of rounds of a 16-branch type-2 Feistel network with improved block shuffle. Derived via exhaustive search in Appendix A.

The security of Substitution-Permutation (SP) networks such as AES is often established by showing sufficiently many s-boxes are active to resist differential and linear attacks [27]. Such results are also available for generalized Feistel networks, but they are specific to the number of branches and type of round function. We use 16 branches and SPSP-type functions (two rounds of AES). Existing results are available for either situation, but not both. 6 rounds of SPSP functions in a 4-branch type-2 network guarantee 6 differentially active functions [33]. 17 rounds of SP functions in a 16-branch network with improved diffusion guarantee 78 active s-boxes [25][p. 226]. Later in this section, we provide new results for 16-branch networks with SPSP functions.

Note that 16-branch Feistel networks have a maximum impossible differential characteristic of 14 rounds [24], and the sliced biclique technique only attacks 15 rounds [35]. A recent attempt to find integral distinguishers reports ‘difficulty’ for such large branch counts [36][p. 219]. We compute new lower bounds for active functions in 16-branch type-2 Feistel networks via exhaustive search. Details of the algorithm are deferred to Appendix A. The resulting lower bounds are given in Table 1. Note that we are able to compute bounds for up to 24 rounds, whereas prior results for 16-branch Feistel networks only extend to 18 rounds [25]. A meet in the middle attack [46] splits a permutation into three parts. Hence, we consider the number of active functions after six rounds.

Theorem 1.

The probability of differential characteristics and correlation of linear characteristics of six rounds of the Randen permutation are at most and .

Proof.

Per Table 1, at least six functions are active after six rounds. Each active SPSP function provides at least active s-boxes [33]. is the branch number of the SP permutation layer, which is 5 for AES. Thus, at least 30 s-boxes are active. Each active AES s-box contributes a maximum differential probability and correlation amplitude [44]. Thus, the overall differential probability and linear correlation are and . ∎

Note that Simpira’s security arguments only require 25 active s-boxes [27]. Also, Table 1 indicates there are active s-boxes after our 17 rounds with SPSP functions. By contrast, the prior bound for SP functions only guarantees 78 active s-boxes after 17 rounds [25][p. 226].

Claim.

A keyed Randen permutation cannot be distinguished from random with complexity less than .

This bound is a conservative estimate based on our initial analysis. Per Theorem 

1, differential/linear attack complexity is and . Symmetry attacks on AES are also unlikely to succeed because our round keys lack structure. Note that Randen involves 17 AES subrounds per 16 permuted bytes, versus only 10 for the AES-128 cipher. Any distinguishers would seem to imply new (or unknown to us) attacks on generalized Feistel with AES-like rounds.

For comparison, a recent successful attack on the full SHA-1 involved work at an estimated cost of 110,000 USD [47]. We assume this is a sufficient deterrent to predicting outputs.

Lemma 2.

If a computationally bounded adversary cannot distinguish the Randen permutation from random, then they cannot predict the Randen output with less than work based only on prior outputs.

Proof.

In this setting, adversaries do not know the AES round keys. The only way adversaries can access the permutation is by requesting random output. We can meet the requirements of Lemma 1 by instantiating Randen with an oracle implementing a randomly keyed Randen permutation. From the perspective of the adversary, this behaves in the same way as Randen instantiated with a real permutation. Then, the Randen output is indistinguishable from random, which implies unpredictability by contradiction (predicting a future output would also allow an adversary to distinguish the generator from random). ∎

Note that “based only on prior outputs” excludes cases where attackers gain access to the inner state. By contrast, NIST 800 90a requires prediction resistance even after the state is compromised [5]. This would entail periodic reseeding from external entropy sources, which we must avoid to ensure repeatability. Instead, we note that side-channels such as core dumps and paging [48] are less relevant in a server environment and can be mitigated with the help of the operating system (using madvise and mlock). Then, attackers can only guess the inner state at a cost of , which is beyond their assumed capability.

6.2 Backtracking resistance

The second property is backtracking resistance: adversaries have a negligible advantage at distinguishing prior outputs from random even if they gain access to the state [49]. This is important for portable devices and long-running applications without access to external entropy because it ensures adversaries cannot reconstruct prior outputs. If a generator is robust, it also provides backtracking resistance (also known as forward security) [4]. Reverie is robust in the ideal permutation model [23] and the previous section argues that instantiating Reverie with a random Randen permutation retains its security guarantees. However, robustness requires reseeding the generator from external entropy, which is not always possible in our applications. We instead show that backtracking resistance follows from the security of Reverie’s next function [23][p. 12], i.e. Randen’s Generate. Assume an adversary has gained access to the current state and AES keys. This allows them to invert the Randen permutation. Note that Reverie’s security model assumes a public permutation that attackers can already invert. We will illustrate the backtracking resistance in a scenario with two calls to Generate. Additional calls do not affect the argument.

For the following, let us define new notation: the state after the first () and second () call to Generate can be partitioned into inner/outer parts and . Let and denote the uniform random initial state. Per Lemma 1 of Reverie [23][p. 12], the return values of Generate are indistinguishable from random. However, the attacker knows all random outputs (i.e. outer states ) and learns the current state . Does this allow them to recover the remaining prior inner states ? Recall that the final Generate returns Permute() XOR (, zero). All terms except are known. However, the attacker cannot query the permutation in either direction without guessing the value at a cost of ; this is best possible attack on Reverie [23]. Hence, knowledge of is insufficient, and adversaries cannot expect to distinguish prior outputs from random with less than forward or backward queries to the permutation (e.g. by guessing the inner bits) [23][p. 12]. Therefore, Randen is backtracking-resistant.

7 Performance

7.1 Contenders

We emphasize that our comparison involves three groups of generators, in increasing order of security.

Insecure generators

To establish a performance baseline, we include the commonly used but insecure Mersenne Twister (‘MT’) as implemented by the C++11 standard library. Note that faster variants of MT exist [50, 51]. However, we advocate using more secure generators in most applications with the exception of Monte Carlo simulations.

Medium-strength

Several recent generators are at least nontrivial to distinguish from random, although indistinguishability and backtracking-resistance have not been formally shown. We include ‘Philox’ [15] and pcg64_c32 [9] (‘PCG’), both of which make no concrete security claims. We also place ISAAC into this category – although no bias has been shown, there are doubts about its security and similarity to RC4.

Strong

The third group consists of generators with security claims (see Section 6). In addition to Randen, we include ‘ChaCha20’ (provided by Linux 4.9 /dev/urandom [52]) and ‘CTR-DRBG’ from NIST SP 800-90A (provided by Windows 7 BCryptGenRandom). These have higher overhead, possibly due to calling into kernel mode. We reduce this somewhat by using a 256-byte buffer, the same size as Randen. To fully exclude the OS overhead, we also include a user-mode SSE2 implementation of ChaCha8 by Orson Peters that uses a single 64-byte block. Note that Bernstein recommends ChaCha20 instead due to its higher security margin [53].

7.2 Infrastructure

All generators except ‘CTR-DRBG’ are implemented in C++ and compiled using Clang r331746 with -O3 -std=gnu++11. The ‘x86’ benchmark is pinned to a single core of a lightly loaded dual-socket Xeon E5-2690 v3 clocked at 2.6 GHz running Linux 4.9 with Turbo Boost and throttling disabled. We also report performance on a POWER 8e clocked at 3.6 GHz (‘PPC’). The ‘CTR-DRBG’ measurements are obtained on an Intel i7 4790K CPU clocked at 4.0 GHz running Windows 7 x64 and using the Microsoft Visual Studio 2017 compiler. To increase the precision and accuracy of generator speed measurements, we use an improved version of the ‘nanobenchmark’ infrastructure [54] developed for HighwayHash. It prevents elision of the generator by passing its output as an input to an empty inline assembly block marked as modifying memory. To reduce variability between runs, it records high-resolution timestamps (in units of CPU cycles) from the invariant TSC, uses fences to ensure the measured code is not reordered by the compiler nor CPU, subtracts the overhead of the TSC reads and uses the median (for small sample counts) or mode as a robust estimator of the central tendency. As a result, variability between measurements (defined as median absolute deviation from the median) is about 0.2%. To improve comparability between benchmarks of different sizes, we divide the elapsed times by the number of random bytes generated to yield cycles per byte. Note that the PPC elapsed times are relative to its 512 MHz timebase, so we multiply measurements by to obtain CPU cycles.

7.3 Benchmarks

We go beyond conventional microbenchmarks by including three simple real-world applications of random numbers: a Fisher-Yates shuffle [55], reservoir sampling [56], and a Monte Carlo estimator for the value of . Together, these exercise all consumers of random bits in the C++ standard library.

We emphasize that our measurements encompass the entire application, viz.: the algorithm consuming random numbers (e.g. shuffling) plus buffer-empty checks required by the C++ random generator interface plus the generator itself. Thus, the reported throughput will naturally be lower than best-case microbenchmarks of a stream cipher or merely generating large quantities of random bits.

7.3.1 Microbenchmark

C++11 only requires amortized constant-time complexity for its uniform random generators. This allows them to return numbers from a large buffer which is periodically refilled. To measure the amortized cost, we must ensure the elapsed time measurements include sufficient refills. Although the buffer sizes are known, C++11 does not provide a guaranteed means of flushing or querying the buffer. We therefore generate 800 KB of random bits such that the cost of ‘wasting’ part of the final buffer is negligible.

Engine x86 (MAD) Speedup PPC (MAD) Speedup
Randen 1.54 ( 0.002) 2.94 ( 0.007)
PCG 0.78 ( 0.003) 0.5 1.68 ( 0.007) 0.6
MT 1.79 ( 0.001) 1.2 3.99 ( 0.014) 1.4
ChaCha8 3.02 ( 0.003) 2.0
ISAAC 4.08 ( 0.006) 2.6 7.91 ( 0.014) 2.7
Philox 4.70 ( 0.003) 3.1 9.94 ( 0.014) 3.4
ChaCha20 15.27 ( 0.018) 9.9 197.96 ( 0.315) 67.3
CTR-DRBG 16.80 ( 0.009) 11.2
Table 2: Cycles per byte for a small loop, plus variability (MAD is the median absolute deviation) and speedup factor of Randen vs. other generators.

The x86 microbenchmark (Table 2) seems to indicate PCG is twice as fast as Randen, which is in turn 1.2 times as fast as MT. The trend is similar on PPC. Despite their high precision (median absolute deviation below 0.2%), these microbenchmark results are quite irrelevant in practice — which actual application repeatedly calls a random generator and ignores the results? Any that do should use the more efficient discard function instead. As we will see, these results are not representative of real-world performance. There are at least three reasons why microbenchmarks may mischaracterize actual performance. First, their small working set leads to unrealistically high cache and TLB hit rates. Second, tight loops benefit from special CPU decoding hardware [57][p. 123]. Third, simple microbenchmarks may use fewer CPU resources (e.g. registers and load-store buffers) than real-world applications.

7.3.2 Shuffle

For a more realistic use case, we measure a Fisher-Yates shuffle that swaps elements at a randomly chosen position. Although the C++ standard library provides an implementation (std::shuffle), its mapping of random bits to uniform integers is quite slow. Instead of costly divisions, we use a multiplication followed by bit-shift [58]. The resulting shuffle is about three times as fast as std::shuffle. The array is 400 KB large, which exceeds the 256 KiB L2 cache on x86 but fits into the 512 KiB PPC cache.

We observe different performance characteristics (Table 3) than in the microbenchmark. As shown by the ‘Randen factor’ columns, Randen is 1.2 times as fast as PCG and slightly faster than MT on x86. By contrast, MT is the fastest on PPC. In all benchmarks, Randen is roughly twice as fast as ISAAC, which is still faster than Philox. Replacing ChaCha20/CTR-DRBG with Randen leads to an overall shuffle speedup of 7 to 8, and 36 on PPC. We see nearly identical results on x86 for 100 KB and 25 KB inputs, which fit into L2 and L1, respectively. This implies that caching and prefetching are effective. Indeed VTune reports that only Philox and ChaCha20 have high levels of load/store stalls: 45% and 85%, versus less than 30% for the other generators.

Engine x86 (MAD) Speedup PPC (MAD) Speedup
Randen 2.19 ( 0.004) 5.46 ( 0.014)
PCG 2.65 ( 0.005) 1.2 6.65 ( 0.014) 1.2
MT 2.19 ( 0.004) 1.0 4.48 ( 0.021) 0.8
ChaCha8 3.63 ( 0.006) 1.7
ISAAC 4.15 ( 0.007) 1.9 8.19 ( 0.021) 1.5
Philox 4.87 ( 0.008) 2.2 10.57 ( 0.021) 1.9
ChaCha20 15.87 ( 0.027) 7.2 198.24 ( 0.917) 36.3
CTR-DRBG 20.45 ( 0.017) 8.2
Table 3: Cycles per byte for engines called from Fisher-Yates shuffle, plus variability (MAD is the median absolute deviation) and speedup factor of Randen vs. other generators.

7.3.3 Sample

Our third benchmark measures reservoir sampling, a randomized online algorithm for retaining an 80 KB subset of a 400 KB data stream. It probabilistically overwrites prior samples at random position. As with shuffling, using a division-free mapping of random bits to integers is much faster than std::uniform_int_distribution. We see similar performance (Table 4), except that Randen is now 1.2 times as fast as PCG on x86, and 1.4 on PPC. On both platforms, Randen outperforms MT. Also as before, speeds are comparable when reducing the input sizes to one quarter.

Engine x86 (MAD) Speedup PPC (MAD) Speedup
Randen 2.60 ( 0.008) 4.97 ( 0.007)
PCG 3.03 ( 0.009) 1.2 6.72 ( 0.021) 1.4
MT 2.82 ( 0.009) 1.1 5.32 ( 0.014) 1.1
ChaCha8 3.75 ( 0.008) 1.4
ISAAC 4.46 ( 0.014) 1.7 8.12 ( 0.014) 1.6
Philox 4.95 ( 0.009) 1.9 9.87 ( 0.007) 2.0
ChaCha20 13.46 ( 0.017) 5.2 159.67 ( 0.168) 32.1
CTR-DRBG 16.41 ( 0.015) 6.4
Table 4: Cycles per byte for engines called from reservoir sampling, plus variability (MAD is the median absolute deviation) and speedup factor of Randen vs. other generators.

7.3.4 Monte Carlo

The fourth benchmark is Monte Carlo estimation of the value of via the ratio of points that fall within a unit circle versus the unit square. This is similar to the microbenchmark in that it calls the generator 200,000 times in a fairly tight loop. Note that std::uniform_real_distribution is slow and not actually uniform [59], so we again implement a replacement. It constructs an IEEE-754 mantissa using the lower 53 bits of a generated uint64_t and chooses an exponent based on the base-2 logarithm of its upper bit. The results in Table 5 show that PCG is 1.2 times as fast as Randen on x86 but slower on PPC. Randen outperforms MT on both platforms.

Engine x86 (MAD) Speedup PPC (MAD) Speedup
Randen 2.14 ( 0.002) 3.43 ( 0.007)
PCG 1.69 ( 0.031) 0.8 3.85 ( 0.007) 1.1
MT 2.55 ( 0.015) 1.2 4.90 ( 0.007) 1.4
ChaCha8 4.58 ( 0.002) 2.1
ISAAC 4.35 ( 0.003) 2.0 8.54 ( 0.056) 2.5
Philox 4.97 ( 0.002) 2.3 11.62 ( 0.014) 3.4
ChaCha20 16.65 ( 0.006) 7.8 194.53 ( 8.428) 56.7
CTR-DRBG 17.37 ( 0.031) 9.3
Table 5: Cycles per byte for engines called from a Monte Carlo simulation, plus variability (MAD is the median absolute deviation) and speedup factor of Randen vs. other generators.

7.4 Discussion

To summarize the four benchmarks, we compute the geometric means of the ‘Randen factors’ from the above tables, i.e. the cost (cycles per byte) of other generators divided by that of Randen (Table 

6). Due to the large differences in the (lack of) security guarantees of the various generators, we discuss them separately.

Engine x86 PPC
PCG 0.9 1.0
MT 1.1 1.1
ChaCha8 1.8
ISAAC 2.0 2.0
Philox 2.3 2.6
ChaCha20 7.3 45.9
CTR-DRBG 8.5
Table 6: Geometric means of Randen speedup factors across the benchmarks. A value of 1.1 indicates the benchmarks run 1.1 times as fast after replacing MT with Randen.

Insecure generators

One of our main results is that the Randen generator does not increase CPU cost relative to the commonly used but insecure Mersenne Twister generator. The geometric mean of speed ratios indicates Randen is slightly faster on both x86 and PPC.

Medium-strength

Randen is about twice as fast as ISAAC and Philox in all benchmarks. Our choice of geometric mean indicates PCG is the fastest on x86, and tied for first on PPC. However, this is mainly due to its result in the (unrealistic) microbenchmark. PCG is a good choice for Monte Carlo applications but Randen is 1.2 to 1.4 times as fast for shuffling and sampling. Note that ISAAC, Philox and PCG lack concrete security claims and have not been shown to be indistinguishable from random nor backtracking-resistant. To the best of our knowledge, Randen is the fastest software generator with these properties.

Strong

Is it feasible to use cryptographically secure generators as the default even in non-cryptographic applications? This depends on the scale of usage. We profiled production code running company-wide and found that traditional non-cryptographic random generators account for tens of thousands of CPU cores. From this and the above benchmarks, we conclude it would be too expensive to use an OS-provided ChaCha20 (/dev/urandom) or CTR-DRBG (BCryptGenRandom) as general-purpose generators. By contrast, Randen is 5 to 10 times as fast in real-world benchmarks. Switching from Mersenne Twister to Randen would actually reduce cost (according to the geometric mean of our benchmarks), while greatly increasing the baseline security of non-cryptographic randomized applications.

8 Conclusion

Recent random generators have desirable characteristics: SIMD-accelerated Mersenne Twister (MT) is efficient [51]. PCG has good statistical properties [9]. AES-CTR is unpredictable by attackers. AES-CTR-DRBG ensures backtracking resistance [5]. Thanks to recent hardware acceleration of AES, a single generator can now achieve all these goals!

This work proposes Randen, an instantiation of Reverie [23] with a permutation based on a generalized Feistel structure. We show that it is ‘strong’, i.e. computationally indistinguishable from random and backtracking resistant. This high level of security is useful even for general-purpose applications such as shuffling and sampling because it greatly increases the attacker cost of triggering worst-case behavior in randomized algorithms. Note that Randen is not intended for cryptographic applications such as key generation, but the permutation may also be useful for wide-block ciphers and hashing functions. Despite its statistical quality and resistance to attacks, Randen is actually faster than the commonly used MT generator, ChaCha8, ISAAC, Philox and a variant of PCG in some real-world benchmarks on Haswell and POWER 8.

We invite external analysis and verification of Randen’s properties and suggest it as a safer alternative to arguably obsolete [9][p. 6] algorithms such as small linear congruential generators, linear feedback shift registers, well equidistributed long-period linear [60], unaugmented XorShift, and MT.

References

  • Yesil and Yalabik [2014] A. Yesil and M. Yalabik. A report of a significant error on a frequently used pseudo random number generator. CoRR, 2014. http://arxiv.org/abs/1408.1900.
  • MITRE [2017] MITRE. CWE-338: Use of cryptographically weak pseudo-random number generator (PRNG), May 2017. https://goo.gl/AaUXQL.
  • Killmann and Schindler [2011] W. Killmann and W. Schindler. A proposal for: Functionality classes for random number generators version 2.0, September 2011. https://goo.gl/FgFdVR.
  • Dodis et al. [2013] Y. Dodis, D. Pointcheval, S. Ruhault, D. Vergnaud, and D. Wichs. Security analysis of pseudo-random number generators with input: /dev/random is not robust. IACR, 2013. http://eprint.iacr.org/2013/338.pdf.
  • Barker and Kelsey [2015] E. Barker and J. Kelsey. Recommendation for Random Number Generation Using Deterministic Random Bit Generators. NIST, June 2015. https://goo.gl/68Fwmv.
  • Garman et al. [2015] C. Garman, K. Paterson, and T. van der Merwe. Attacks only get better: Password recovery attacks against RC4 in TLS. In J. Jung and T. Holz, editors, USENIX Security Symposium, pages 113–128, 2015. https://tinyurl.com/m6zdole.
  • Aumasson [2006] J. Aumasson. On the pseudo-random generator ISAAC. IACR, 2006. http://eprint.iacr.org/2006/438.pdf.
  • Kneusel [2018] R. Kneusel. Random Numbers and Computers. Springer, 2018. ISBN 978-3-319-77696-5.
  • O’Neill [2014] M. O’Neill. PCG: A family of simple fast space-efficient statistically good algorithms for random number generation. Technical report, Harvey Mudd College, September 2014. https://tinyurl.com/yddde6u2.
  • Neves and Araujo [2011] S. Neves and F. Araujo. Fast and small nonlinear pseudorandom number generators for computer simulation. In R. Wyrzykowski, J. Dongarra, K. Karczewski, and J. Wasniewski, editors, PPAM, volume 7203 of LNCS, pages 92–101. Springer, 2011. ISBN 978-3-642-31463-6.
  • Neves and Araujo [2013] S. Neves and F. Araujo. Engineering nonlinear pseudorandom number generators. In R. Wyrzykowski, J. Dongarra, K. Karczewski, and J. Wasniewski, editors, PPAM, volume 8384 of LNCS, pages 96–105. Springer, 2013. ISBN 978-3-642-55223-6. https://goo.gl/ASxc5B.
  • [12] S. Vigna. xoroshiro+ / xorshift* / xorshift+ generators and the PRNG shootout. http://vigna.di.unimi.it/xorshift/.
  • Cook [2017] J. Cook. Testing RNGs with PractRand, August 2017. https://goo.gl/crmfLe.
  • Lemire [2017] D. Lemire. Cracking random number generators (xoroshiro128+), August 2017. https://goo.gl/g1uHNs.
  • Salmon et al. [2011] J. Salmon, M. Moraes, R. Dror, and D. Shaw. Parallel random numbers: As easy as 1, 2, 3. In SC’11, Seattle, WA, USA, November 2011. ACM SIGARCH/IEEE Computer Society. https://goo.gl/cu0l9e.
  • L’Ecuyer and Simard [2007] P. L’Ecuyer and R. Simard. TestU01: A C library for empirical testing of random number generators. ACM Trans. Math. Softw, 33(4), 2007. https://www.iro.umontreal.ca/~lecuyer/myftp/papers/testu01.pdf.
  • O’Neill [2017] M. O’Neill. PCG random number generation for C++, April 2017. https://goo.gl/W17ctZ.
  • Ferguson et al. [2010] N. Ferguson, B. Schneier, and T. Kohno. Cryptography Engineering - Design Principles and Practical Applications. Wiley, 2010. https://www.schneier.com/academic/paperfiles/fortuna.pdf.
  • Bernstein [2017] D. Bernstein. Fast-key-erasure random-number generators, July 2017. https://blog.cr.yp.to/20170723-random.html.
  • Goll and Gueron [2013] M. Goll and S. Gueron. Vectorization of ChaCha stream cipher. IACR, page 759, 2013. http://eprint.iacr.org/2013/759.
  • Krasnov [2017] V. Krasnov. On the dangers of Intel’s frequency scaling, November 2017. https://tinyurl.com/yblczuwb.
  • VAMPIRE [2017] VAMPIRE. eBACS: ECRYPT benchmarking of cryptographic systems, July 2017. https://bench.cr.yp.to/results-stream.html.
  • Hutchinson [2016] D. Hutchinson. A robust and sponge-like PRNG with improved efficiency. IACR, 2016. http://eprint.iacr.org/2016/886.pdf.
  • Suzaki and Minematsu [2010] T. Suzaki and K. Minematsu. Improving the generalized Feistel. In S. Hong and T. Iwata, editors, FSE, volume 6147 of LNCS, pages 19–39. Springer, 2010. ISBN 978-3-642-13857-7. https://www.iacr.org/archive/fse2010/61470020/61470020.pdf.
  • Shibutani [2010] K. Shibutani. On the diffusion of generalized feistel structures regarding differential and linear cryptanalysis. In A. Biryukov, G. Gong, and D. Stinson, editors, Selected Areas in Cryptography, volume 6544 of LNCS, pages 211–228. Springer, 2010. ISBN 978-3-642-19573-0. https://goo.gl/MK4Koj.
  • Zheng et al. [1989] Y. Zheng, T. Matsumoto, and H. Imai. On the construction of block ciphers provably secure and not relying on any unproved hypotheses. In CRYPTO, 1989. https://goo.gl/rHLZWq.
  • Gueron and Mouha [2016] S. Gueron and N. Mouha. Simpira v2: A family of efficient permutations using the AES round function. IACR, 2016. http://eprint.iacr.org/2016/122.pdf.
  • Gueron [2012] S. Gueron. Intel Advanced Encryption Standard (AES) New Instructions Set. Intel, September 2012. https://goo.gl/DwqC6N.
  • Käsper and Schwabe [2009] E. Käsper and P. Schwabe. Faster and timing-attack resistant AES-GCM. IACR, 2009. http://eprint.iacr.org/2009/129.pdf.
  • Kivilinna [2013] J. Kivilinna. Block ciphers: Fast implementations on x86-64 architecture. Master’s thesis, University of Oulu, May 2013. https://goo.gl/1gZb2B.
  • Barbosa [2015] L. Barbosa. Power8 in-core cryptography, September 2015. https://goo.gl/R8RKGt.
  • ARM [2017] ARM Architecture Reference Manual. ARM Limited, 2017. https://goo.gl/SJ3StU.
  • Bogdanov and Shibutani [2011] A. Bogdanov and K. Shibutani. Double SP-functions: Enhanced generalized Feistel networks. In U. Parampalli and P. Hawkes, editors, ACISP, volume 6812 of LNCS, pages 106–119. Springer, 2011. ISBN 978-3-642-22496-6. https://goo.gl/hGC7Mu.
  • Kölbl et al. [2016] S. Kölbl, M. Lauridsen, F. Mendel, and C. Rechberger. Haraka v2 - efficient short-input hashing for post-quantum applications. IACR, 2016. https://eprint.iacr.org/2016/098.pdf.
  • Wang and Wu [2016] Y. Wang and W. Wu. New criterion for diffusion property and applications to improved GFS and EGFN. Des. Codes Cryptography, 81(3):393–412, 2016. https://goo.gl/6hbdbn.
  • Zhang and Wu [2015] H. Zhang and W. Wu. Structural evaluation for generalized Feistel structures and applications to LBlock and TWINE. In A. Biryukov and V. Goyal, editors, INDOCRYPT, volume 9462 of LNCS, pages 218–237. Springer, 2015. ISBN 978-3-319-26616-9.
  • Intel [2016] Intel. Intel 64 and IA-32 architectures optimization reference manual, January 2016. http://goo.gl/9IkxGj.
  • Intel [2017a] Intel. Intel architecture instruction set extensions, October 2017a. https://goo.gl/SFqMHT.
  • Wassenberg [2018] J. Wassenberg. Randen open-source, April 2018. https://github.com/google/randen.
  • Intel [2017b] Intel. Intel Architecture Code Analyzer. Intel, 2017b. https://goo.gl/nnDkPu.
  • Aoki et al. [2012] K. Aoki, T. Iwata, and K. Yasuda. How fast can a two-pass mode go? a parallel deterministic authenticated encryption mode for AES-NI, July 2012. https://hyperelliptic.org/DIAC/slides/aoki.pdf.
  • L’Ecuyer and Simard [2013] P. L’Ecuyer and R. Simard. TestU01 A Software Library in ANSI C for Empirical Testing of Random Number Generators, May 2013. http://simul.iro.umontreal.ca/testu01/guideshorttestu01.pdf.
  • Doty-Humphrey [2016] C. Doty-Humphrey. Practrand, September 2016. http://pracrand.sourceforge.net/.
  • Daemen and Rijmen [2002] J. Daemen and V. Rijmen. The Design of Rijndael: AES - The Advanced Encryption Standard. Information Security and Cryptography. Springer, 2002. ISBN 3-540-42580-2; 978-3-642-07646-6.
  • Mennink and Preneel [2015] B. Mennink and B. Preneel. On the XOR of multiple random permutations. In T. Malkin, V. Kolesnikov, A. Lewko, and M. Polychronakis, editors, ACNS, volume 9092 of LNCS, pages 619–634. Springer, 2015. ISBN 978-3-319-28165-0. https://goo.gl/4JCBeB.
  • Tolba and Youssef [2016] M. Tolba and A. Youssef. Generalized mitM attacks on full TWINE. Inf. Process. Lett, 116(2):128–135, 2016. https://goo.gl/wEXHD7.
  • Stevens et al. [2017] M. Stevens, E. Bursztein, P. Karpman, A. Albertini, and Y. Markov. The first collision for full SHA-1. IACR, 2017. http://eprint.iacr.org/2017/190.pdf.
  • Corrigan-Gibbs and Jana [2015] H. Corrigan-Gibbs and S. Jana. Recommendations for randomness in the operating system, or how to keep evil children out of your pool and other random facts. In G. Candea, editor, HotOS. USENIX Association, 2015. https://goo.gl/gYUpC9.
  • Fischer et al. [2012] M. Fischer, M. Paterson, and E. Syta. On backtracking resistance in pseudorandom bit generation. Technical Report TR-1466, Yale, October 2012. http://cs.yale.edu/publications/techreports/tr1466.pdf.
  • Saito [2017] M. Saito. Simd-oriented fast mersenne twister, 2017. http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/SFMT/.
  • Barash et al. [2017] L. Barash, M. Guskova, and L. Shchur. Employing AVX vectorization to improve the performance of random number generators. Programming and Computer Software, 43(3):145–160, 2017. https://goo.gl/E1tW8Z.
  • Corbet [2016] J. Corbet. Replacing /dev/urandom, May 2016. https://lwn.net/Articles/686033/.
  • Bernstein [2018] D. Bernstein. Tweet, July 2018. https://twitter.com/hashbreaker/status/1023966188949463046.
  • Wassenberg [2017] J. Wassenberg. nanobenchmark.h, 2017. https://goo.gl/Bi1yuu.
  • Black [2015] P. Black. Fisher-Yates shuffle. Dictionary of Algorithms and Data Structures, March 2015. https://goo.gl/6eMQzA.
  • Vitter [1985] J. Vitter. Random sampling with a reservoir. ACM Trans. Math. Softw, 11(1):37–57, 1985. https://goo.gl/U35NaK.
  • Fog [2017] A. Fog. The microarchitecture of Intel, AMD and VIA CPUs, May 2017. http://www.agner.org/optimize/microarchitecture.pdf.
  • Lemire [2018] D. Lemire. Fast random integer generation in an interval, May 2018. https://arxiv.org/abs/1805.10941.
  • Grabowski [2015] A. Grabowski. uniform real distribution isn’t uniform, April 2015. https://bugs.llvm.org/show_bug.cgi?id=23168.
  • Panneton et al. [2006] F. Panneton, P. L’Ecuyer, and M. Matsumoto. Improved long-period generators based on linear recurrences modulo 2. ACMTMS, 32, 2006. http://www-labs.iro.umontreal.ca/~lecuyer/myftp/papers/wellrng.pdf.
  • Bogdanov and Shibutani [2013] A. Bogdanov and K. Shibutani. Generalized feistel networks revisited. Des. Codes Cryptography, 66(1-3):75–97, 2013. https://goo.gl/SG6BV8.

Appendix A Active Functions in 16-branch Feistel

Lemma 3.

A type-2 generalized Feistel network with 16 branches and an improved block shuffle [24] has at least as many differentially active functions as listed in Table 1.

To the best of our knowledge, these bounds are new. Note that the 6-round bound is the same as reported for a type-2 network with four branches [33]. We will establish our bounds via exhaustive enumeration. Type-2 Feistel networks update their odd branches by XORing them with the result of a function of the corresponding even branch: new_odd := F(even) XOR odd. There are two simple properties (numbered 3 and 4 [61][p. 83]) regarding the propagation of differences. First, if both even and odd are differentially inactive, then so is new_odd. Second, at least two of them are active if any of the three are active. Thus, given an input configuration (i.e. whether each branch is differentially active), the output is active if exactly one input is active. The inputs are booleans, so this corresponds to simply XORing them. Next, our simulator counts the number of active functions (i.e. the number of differentially active even input branches), shuffles the outputs and passes them as inputs to the next round. This process is repeated for every round up to the desired limit. We consider all input configurations except the trivial case of zero input differences. This logic is implemented by the following Python script.

# (Over)estimates a lower bound of differentially active
# functions in a 16-branch generalized Feistel network.
ROUNDS = range(6)
BRANCHES = 16
idx = range(BRANCHES / 2)    # indices within odd/even
bit_shifts = [2 * i for i in idx]
# Shuffle: ‘Improving the Generalized Feistel’ No.10
shuffle_for_new_odd  = (3,6,5,1,7,4,0,2) # = even i / 2
shuffle_for_new_even = (1,2,4,3,0,5,7,6) # = odd i / 2
min_active_funcs = 99999
def XorResult(even, xor):
  # Page 83 in ‘Generalized Feistel networks revisited’.
  # 3) if even (input to F) and the XOR input are both
  #    zero (inactive), so is the XOR result.
  if even == 0 and xor == 0: return 0
  # 4) otherwise, at least two of the inputs/output are
  #    active => an inactive input implies active output.
  if even == 0 and xor == 1: return 1
  if even == 1 and xor == 0: return 1
  # Assume inactive => overestimate the lower bound!
  return 0
# For every combination of differentially active
# branches except all-zero (no active functions):
for bits in range(1, 1 << BRANCHES):
  # Extract bits into integers, partition into even/odd.
  even = [((bits >> bit_shifts[i]) & 1) for i in idx]
  odd = [((bits >> (bit_shifts[i] + 1)) & 1) for i in idx]
  # Total differentially active functions.
  active_funcs = 0
  for round in ROUNDS:
    # Active functions (nonzero even[]) in this round.
    active_funcs += even.count(1)
    # Shuffle(even) will later replace the current odd.
    new_odd = [even[shuffle_for_new_odd[i]] for i in idx]
    # Shuffle(F(even, odd)) replaces the current even.
    f_out = [XorResult(even[i], odd[i]) for i in idx]
    even = [f_out[shuffle_for_new_even[i]] for i in idx]
    odd = new_odd
  # Remember and report the lowest.
  min_active_funcs = min(min_active_funcs, active_funcs)
print min_active_funcs

Note an important limitation of this algorithm: Property 4 does not provide any guidance when both inputs are active. The differences may cancel, or not. Thus, this algorithm does not guarantee a lower bound, but it does indicate such a bound is at most six. We now extend the search to cover all these possibilities and thus obtain a lower bound. The search can be paused and resumed from a ‘state’ consisting of the round number, odd/even status, and the number of active functions so far. When both inputs are active, we enqueue new states with every possible combination of the output. Although quick to compute for six rounds, additional rounds yield trillions of possible combinations. We retain a brute-force approach, but add some optimizations to make the search tractable. First, the odd and even differentially-active status can be represented as separate bit arrays, such that all calls to XorResult simplify to a single 8-bit XOR and the shuffle reduces to an 8-bit table lookup. Second, we can prune search areas where active_funcs already exceeds the minimum seen so far, because they will not influence the lower bound. Third, a fixed-size priority-queue with bitwise operations reduces the space and time overhead to constants. The C++ source code corresponding to this description will later be open-sourced alongside the Randen implementation [39]. It can trace about a trillion combinations arising during 18 rounds within a few minutes on a workstation with 24 cores.