Random numbers are key ingredients in various applications, e.g., in cryptography (e.g., for generating cryptographic keys) or in simulations (e.g., in Monte Carlo methods), just to mention a few. No algorithm can produce truly random numbers. Instead, pseudorandom number generators (PRNGs) are used. These are deterministic algorithms producing numbers which we expect to resemble truly random ones in some sense. There are two classes of tests used to evaluate PRNGs, theoretical and statistical ones. Theoretical tests examine the intrinsic structure of a given generator, the sequence does not necessarily need to be generated. Two classical examples are the lattice test MARSAGLIA1972 and the spectral test described in Knuth2 (Section 3.3.4). See also LEcuyer92 for a description of some standard tests from this class. This category of tests is very specific to each family of generators e.g., some are designed only for linear congruential generators. On the other hand, the second class of tests – empirical tests – are conducted on a sequence generated by a PRNG and require no knowledge of how it was produced. The main goal of these tests is to check if the sequence of numbers
(or bits, depending on the actual implementation) produced by a PRNG has properties similar to those of a sequence generated truly at random. These tests try to find statistical evidence against the null hypothesis
stating that the sequence is a sample from independent random variables with uniform distribution. Any function of a finite number of uniformly distributed random variables, whose (sometimes approximate) distribution under hypothesisis known, can be used as a statistical test. Due to the popularity and significance of the problem, a variety of testing procedures have been developed in recent years. Such statistical tests aim at detecting various deviations in generated sequences, what allows for revealing flawed PRNGs producing predictable output. Some of the procedures encompass classical tools from statistics like the Kolmogorov-Smirnov test or the Pearson’s chi-squared test, which are used for comparing the theoretical and empirical distributions of appropriate statistics calculated for a PRNG’s output. It is also possible to adapt tests of normality like the Anderson-Darling or Shapiro-Wilk tests for appropriately transformed pseudorandom sequences. These methods exploit the properties of sequences of i.i.d. random variables. Based on the original sequence returned by the examined PRNG we are able to obtain realizations of random variables with known theoretical distributions. Some examples of probabilistic laws used in practice in this kind of tests can be found e.g., in Knuth2 . They include such procedures like the gap test, the permutation test and the coupon collector’s test, just to name a few (see Knuth2
for a more detailed treatment). These methods have also the advantage that they implicitly test the independence of the generator’s output. The main issue with such methods is that a single statistical test looks only at some specific property that holds for sequences of truly random numbers. Hence, for practical purposes bundles of diverse tests are created. Such a test bundle consists of a series of individual procedures based on various stochastic laws from probability theory. A PRNG is then considered as good if the pseudorandom sequences it produces pass all tests in a given bundle. Note that formally it proves nothing, but it increases the confidence in the simulation results. Thus, they are actually tests fornon-randomness, as pointed out in PareschiRS07 . Some examples of such test suites are Marsaglia’s Diehard Battery of Tests of Randomness from 1995, Dieharder developed by Brown et al. (see Dieharder ), TestU01 implemented by L’Ecuyer and Simard (see TestU01 ; TestU01_guide ) and NIST Test Suite NISTtests . The last one, designed by the National Institute of Standard and Technology, is currently considered as one of the state of the art test bundles. It is often used for the preparation of many formal certifications or approvals.
A result of a single statistical test is typically given in the form of a -value, which, informally speaking, represents the probability that a perfect PRNG would produce “less random” sequence than the sequence being tested w.r.t. the used statistic. We then reject if , where is the significance level (usually ) and accept if . Such an approach is usually called one level or first level test. Although the interpretation of a single -value has a clear statistical explanation, it is not quite obvious how to interpret the results of a test bundle, i.e., of multiple tests. Under the distribution of -values is uniform. However, in a test bundle several different tests are applied to the same output of a PRNG, hence the results are usually correlated. The documentation of the NIST Test Suite includes some clues on how to interpret the results of their bundle (Section 4.2 in Rukhin2010 ), but in the introduction it is frankly stated: “It is up to the tester to determine the correct interpretation of the test results”.
To disclose flaws of PRNGs, a very long sequence is often required. In such situations, the applicability of a statistical test can be limited (depending on the test statistic) by the memory size of the computer. An alternative approach is to use a so-calledtwo level (a term used e.g., in LEcuyer92 ) or second level (a term used e.g., in PareschiRS07 ; Pareschi2007 ) test. In this approach we take into account several results from the same test over disjoint sequences generated by a PRNG. We obtain several -values which are uniformly distributed under , what is tested by e.g., some goodness-of-fit test (with potentially different level of significance – NIST suggests to use – obtaining new “final” -value). The authors in Ecuyer2002_Sparse observed that this method may be comparable to a first level test in terms of the power of a test (informally speaking, it represents the probability of observing “less random” sequence than the sequence being tested under an alternative hypothesis , see Ecuyer2002_Sparse for details), but often it produces much more accurate results, as shown in PareschiRS07 . Roughly speaking, the accuracy is related to the ability, given a non-random PRNG, of recognizing its sequences as non-random (for details see PareschiRS07 ). We will follow this approach.
In the second level approach one has to take under consideration the approximation errors in the computation of a -value. For example, in a first level test one usually calculates a -value of a statistic which – under – is approximatelynormally distributed. The approximation comes then from the central limit theorem, which lets us substitute the distribution of a given sum with the standard normal distribution. These errors in calculations of individual -values may accumulate, resulting in an error of a -value in a second level test, thus making the test not reliable. Following PareschiRS07 we say that the second level test is not reliable when, due to errors or approximations in the computation of -values (in the first level), the distribution of -values is not uniform under . Fortunately, this approximation error can be bounded using the Berry-Essen inequality and the final error of a second level test can be controlled (see PareschiRS07 ; Pareschi2007 for a detailed example based on the binary matrix rank test). The influence of approximations on the computation of -values in a second level test was also considered in Matsumoto2002 ; Leopardi2009 . In this article we present a statistical test based on the arcsine law, in which at some point we approximate a distribution of some random variable with the arcsine distribution. We provide a Berry-Essen type inequality which upper bounds the approximation error, what allows us to control the reliability of our second level test.
An interesting approach for testing PRNGs was presented by Kim et al. in Kim2008 . The concept of their tests is based on the properties of a random walk (the gambler’s ruin algorithm) on the cyclic group with 0 being an absorbing state – more precisely, on the time till absorption. The authors in Kim2008 propose three different variants of the test. The general idea of the basic procedure is the following. For some fixed and , the output of a PRNG is treated as numbers from the unit interval and used to define a random walk starting in such that if and the process is in state , then it moves to , otherwise it moves to . The aim of this test is to compare the theoretical and the empirical distributions of the time to absorption in 0 when starting at . Based on the values of testing statistic, the PRNG is then either accepted or rejected. The authors reported some “hidden defects” in the widely used Mersenne Twister generator. However, one has to be very careful when dealing with randomness. It seems like re-seeding a PRNG with a fixed seed is an error which can lead to wrong conclusions. The criticism was raised by Ekkehard and Grønvik in Ekke10 , where the authors also showed that the properly performed tests of Kim et al. Kim2008 do not reveal any defects in the Mersenne Twister PRNG. Recently, the authors in Lorek2017a have proposed another gambler’s ruin based procedure for testing PRNGs. In their method they exploited formulas for winning probabilities for arbitrary sequences and , (i.e., the winning and losing probabilities depend on the current fortune) which are the parameters of the algorithm.
In recent years a novel kind of testing techniques has been introduced for more careful verification of generators. The core idea of this class of methods is based on an observation that the binary sequence produced by a PRNG, after being properly rescaled, can be interpreted as an one-dimensional random walk with , where . For random walks defined by truly random binary sequences a wide range of statistics have been considered over the years and a variety of corresponding stochastic laws have been derived (see e.g., Feller1 ). For a good PRNG we may expect that its output will behave like . Hence, the following idea comes to mind: choose some probabilistic law that holds for truly random bit sequences and compare the theoretical distribution of the corresponding statistic with the empirical distribution calculated for sequences produced by a given PRNG in independent experiments. This comparison can be done e.g., by computing the -value of an appropriate test statistic under the null hypothesis that the sequence generated by this PRNG is truly random.
Another concept named statistical distance based testing was suggested in Wang2015 . It relies on calculation of some statistical distances like e.g., total variation distance between the theoretical and empirical distributions for considered characteristics and rejecting a PRNG if the distance exceeds some threshold. We will also follow this approach, indicating the corresponding threshold. In Wang2015 the authors derive their test statistics from the law of iterated logarithm for random walks (the procedure is called the LIL test). The proposed by us procedure uses similar methodology and is based on the arcsine law. We made the code publicly available, see ArcsineTest_github . It includes the arcsine law based as well as the law of iterated logarithm based statistical tests, the implementation of many PRNGs (more than described in this article) including the Flawed generator (see Section 4) and the seeds we used.
Organization of the paper
In the following Section 2 we define a general notion of a PRNG and recall the aforementioned stochastic laws for random walks. The testing method along with the error analysis is described in Section 3. The concise report on experimental results (including the Flawed generator introduced in Section 4) is given in Section 5. In Section 6 we mention other implementations of the tests based on the arcsine law. We conclude in Section 7.
2 Pseudorandom generators and stochastic laws for random walks
2.1 Pseudorandom generators
The intuition behind pseudorandom number generator is clear. However, let us give a strict definition roughly following Asmussen and Glynn Asmussen2007 .
A Pseudorandom number generator (PRNG) is a 5-tuple , where is a finite state space, is a set of values, is a so-called seed, i.e., an initial state in the sequence , a function describes the transition between consecutive states and maps the generator’s state into the output.
Usually or for some , the latter one is used throughout the paper. Recall that LCG (linear congruential generator) is a generator which updates its state according to the formula . Thus, it is defined by three integers: a modulus , a multiplier , and an additive constant . In the case , the generator is called MCG (multiplicative congruential generator). For a detailed description of some commonly used PRNGs see the surveys Kroese11 ; LEcuyer2017 ; Niederreiter_QuasiMonte or the book Woyczynski98 .
It is clear that both the input and the output of a random number generator can be viewed as a finite sequence of bits. For a PRNG to be considered as good, the output sequences should have some particular property, namely each returned bit has to be generated independently with equal probability of being 0 and 1. We say that the sequence of bits is truly random if it is a realization of a Bernoulli process with success probability .
Given a PRNG returning integers from the set , we may obtain a pseudorandom binary sequence with any given length using the following simple procedure. Namely, as long as the bit sequence is not sufficiently long, generate the next pseudorandom number and append its binary representation (on bits) to the current content of . In the ideal model with being truly random number generator, such algorithm produces truly random bit sequences provided that is a power of 2. Indeed, for there is one to one correspondence between -bit sequences and the set . Hence, if each number is generated independently with uniform distribution on , then each combination of bits is equally likely and therefore each bit of the output sequence is independent and equal to 0 or 1 with probability .
However, this is not true for . It is easy to observe that in such a case the generator is more likely to output 0s and the generated bits are no longer independent. Thus, rather than simply outputting the bits of , one may instead take first bits from the binary representation of for some fixed . Such a method has the advantage that it can be easily adopted for an underlying generator returning numbers from the unit interval, what is common for many PRNG implementations.
2.2 Stochastic laws for random walks
Let be a Bernoulli process with a parameter , i.e., a sequence of independent random variables with identical distribution . A good PRNG should behave like a generator of Bernoulli process with (what we assume from now on). It will be, however, more convenient to consider the following transformed process
The sequence is -valued, the process is called a random walk.
The law of iterated logarithm
Of course . However, large values of occur with small probability and the values of are in practice in a much narrower range than .
The weak and the strong law of large numbers
the strong law of large numbersimply that where denotes the convergence in probability and denotes the almost sure convergence. Thus, the deviations of from 0 grow much slower than linearly. On the other hand the central limit theorem states that (where denotes the convergence in distribution), what is in some sense a lower bound on fluctuations of – they will leave the interval since we have (implied by 0-1 Kolmogorov’s Law, see e.g., Theorem 5.1 in Gut2005
). It turns out that the fluctuations can be estimated more exactly.
Theorem 2.2 (The law of iterated logarithm, Khintchine1924 , cf. also Chapter VIII.5 in Feller1 ).
For a random walk we have
Thus, to normalize dividing by is too strong and dividing by is too weak. The fluctuations of from 0 grow proportionally to .
To depict the law of iterated logarithm, we took output sequences from the Mersenne Twister MT19937 generator, each initialized with a random seed taken from http://www.random.org, where each output was of length . In Figure 1 we presented these 500 trajectories , where . Each trajectory is depicted by a single polyline. The darker the image the higher the density of trajectories. We can see that roughly corresponds to the fluctuations of . However, few trajectories after around billion steps are still outside . The law of iterated logarithm tells us that for appropriately large the trajectories will not leave with probability , what is not the case in Figure 1. It means that must be much larger than .
One could think that the following is a good test for randomness: fix some number, say
, and classify the considered PRNG as good if the difference between the number of ones and zeros never exceeds. The large difference may suggest that zeros and ones have different probabilities of occurrence. However, the law of iterated logarithm tells us that this reasoning is wrong. Indeed, we should expect some fluctuations and the absence of them means that a PRNG does not produce bits which can be considered random. This property of random walks was used by the authors in Wang2015 for designing a novel method of testing random number generators.
There is yet another interesting property. Define . The law of iterated logarithm implies that does not converge pointwise to any constant. However, it converges to 0 in probability. Let us fix some small . For almost all , with an arbitrary high probability the process will not leave . On the other hand, this tells us that the process will be outside this interval infinitely many times. This apparent contradiction shows how can our intuition be unreliable on phenomena taking place at infinity.
The arcsine law
The observations described previously imply that averaging every , it will spend half of its time above the -axis and half of its time below. However, the typical situation is counter-intuitive (at first glance): typically the random walk will either spend most of its time above or most of its time below the -axis. This is expressed in the Theorem 2.3 below (for reference see e.g., Feller1 ). Before we formulate the theorem, let us first introduce some notations. For a sequence , as defined in (1), let
where is the indicator function. is equal to 1 if the number of ones exceeds the number of zeros either at step or at step , and 0 otherwise (in a case of ties, i.e., , we look at the previous step letting ). In other words, corresponds to the situation in which the line segment of the trajectory of the random walk between steps and is above the -axis.
Theorem 2.3 (The arcsine law).
Let be a Bernoulli process. Define and ( is given in (2)). For we have
The probability is the chance that the random walk was above the -axis for at most fraction of the time. The limiting distribution is called the arcsine distribution. Its density function is given by
and the cumulative distribution function (cdf) is. The shape of the pdf clearly indicates that the fractions of time spent above and below -axis are more likely to be unequal than close to each other.
3 Testing PRNGs based on the arcsine law
In this Section we will show how to exploit the theoretical properties of random walks from the preceding discussion to design a practical routine for testing PRNGs. We describe our approach based on the arcsine law which we employ for experimental evaluation of several commonly used generators (the results are presented in Section 5). We also perform an error analysis of the proposed testing procedure, providing corresponding bounds on the approximation errors. Finally, we make some remarks on the reliability of our second level test.
3.1 The arcsine law based testing
The general idea of tests is the following. Take a sequence of bits generated by PRNG, rescale them as in (1) and compare the empirical distribution of
(a fraction of time instants at which ones prevail zeros) with its theoretical distribution assuming that truly random numbers were generated. In terms of hypothesis testing: given the null hypothesis that the bits in the sequence were generated independently and uniformly at random (vs. : that the sequence was not randomly generated), the distribution of follows the arcsine law (Theorem 2.3), i.e., we can conclude that for large we have
(we will be more specific on “” in Section 3.2). We follow the second level testing approach (cf. PareschiRS07 ; Pareschi2007 ), i.e., we take into account several results from the same test over different sequences. To test a PRNG we generate sequences of length each, thus obtaining realizations of the variable . Denoting by the value of -th simulation’s result (we call them a basic tests), we then calculate the corresponding -values
Under the distribution of , should be uniform on . We fix some partition of and count the number of -values within each interval. In our tests we will use an -element partition , where
Now we define the measures (the uniform measure on ), (the empirical measure on ), (the expected number of -values within ) and (the number of observed -values within ). For let
We perform the Pearson’s goodness-of-fit test, which uses the following test statistic
Under the null hypothesis, has approximately distribution. We calculate the corresponding -value
where has a distribution. Large values of – and thus small values of – let us suspect that a given PRNG is not good. Typically, we reject (i.e., we consider the test failed) if where is a predefined level of significance (for a second level test we use , as suggested by NIST). Note that the probability of rejecting when the sequence is generated by a perfect random generator (so-called Type I error) is exactly .
Another approach relies on the statistical distance based testing, which is the technique presented in Wang2015 . We consider the statistic
i.e., a total variation distance between the theoretical distribution and the empirical distribution . Similarly, large values of indicate that a given PRNG is not good. Concerning Type I error we will make use of the following lemma (see Lemma 3 in devroye1983 or its reformulation, Lemma 1 in Berend2012 ).
Assume and consider the partition . Then, for all we have
To summarize, for a given PRNG we generate sequences of length each. and we choose (and thus the partition ). We then calculate and together with its -value. We specify the thresholds for -value and indicating whether the test failed or not (the details are presented in Section 5). We denote the described procedure as the ASIN test.
Remark. Note that the described procedure for calculating and is equivalent to the following one. Instead of calculating -values of , we could directly count the number of falling into each interval and compare the empirical distribution with the theoretical one. To be more precise, for let
where for . Then statistics and can be rewritten as
We could also calculate just one -value of the statistic for a longer sequence (say, for ) – i.e., perform a first level test. However, as mentioned in Section 1, the second level approach produces more accurate results (roughly speaking, the accuracy is related to the ability, given a non-random PRNG, of recognizing its sequences as non-random, see details in PareschiRS07 ).
It is worth noting that the following approach can be applied when or are slightly outside the acceptance region (e.g., if ), what suggests rejecting , but is not a strong evidence). Namely, double the length of the sequence, take a new output from the PRNG and apply the test again. Repeat the procedure (at most some predefined number of times) until the evidence is strong enough (e.g., ) or is accepted (e.g., ). This method, called “automation of statistical tests on randomness”, was proposed and analyzed in Haramoto2009 .
3.2 Error analysis
3.2.1 Bounding errors in approximating -values in basic tests
In this subsection we will show a bound on the approximation error in (3). Recall that .
Fix a partition and an even . Let be the cdf of the empirical distribution of under (stating that the bits were generated uniformly at random), i.e., . Then we have
Proof. We will show that for fixed and such that we have
Let us assume that . Let denote the probability that during steps in the first steps the random walk was above 0-axis, i.e., . The classical results on a simple random walk state that
The standard proof of Theorem 2.3 (see, e.g., Chapter XII.8 in Feller2 ) shows that converges to . In the following, we will bound the difference . We will use a version of Stirling’s formula stating that for each there exists , , such that
Thus, we get
For any it holds that and for we have that . Note that , what is equivalent to , what holds for any . Hence,
Fix and assume furthermore that . The function achieves the minimum value at the endpoints of the considered interval, thus
We will estimate the approximation error in (3) in two steps. First, take two numbers such that . We have
The second kind of errors in probability estimates given by (3) is caused by approximating the sum by an integral. Let us consider an arbitrary function differentiable in the interval . Split into subintervals of length and let be an arbitrary point in the interval containing . Denote by and the maximum and the minimum value of on that interval, respectively. Using the Lagrange’s mean value theorem we obtain
For we have and Hence, in the considered interval we have
We also have
Taking we obtain
what justifies the approximation (3) for . To complete the analysis we need to investigate the errors “on the boundaries” of a unit interval, i.e., for (and, by symmetry, for ). We get
where the last inequality follows directly from the preceding calculations. ∎
Remark. Let be zero-average i.i.d. random variables with . Denote . The central limit theorem states that , a normal random variable (denote its cdf by ), is the limiting distribution of (denote its cdf by by ). It means that for large we can approximate by and the approximation error is bounded by the Berry-Essen inequality
where is a positive constant (in original paper Esseen42 it was shown that , in Tyurin2010 it was shown that ). Lemma 3.5 is thus a Berry-Essen type inequality for approximating by a random variable with cdf , tailored to our needs.
3.2.2 Reliability of the results from the second level test
Following Pareschi2007 , we say that a basic test (calculating ) is not reliable if, due to approximation errors in the computations of -values, the distribution of for truly random numbers is not uniform. We test the uniformity via and . Since we compare two continuous distributions, some discretization needs to be applied. In our testing procedure we use a partition for this, splitting the interval into intervals (i.e., the bins). Lemma 3.5 states that a maximum error in the computation of is bounded by (note that implicitly depends on ). It means that a -value that should belong to a given bin can be found in the neighboring ones only if the distance between and one of the endpoints of a given bin is less than . Thus, this is also the fraction of -values that can be found in wrong bins. The maximum propagated deviation is twice the error (since most bins have two neighbors), i.e.,
Under the distribution of the numbers in the bins is a multinomial distribution. Indeed, this is equivalent to throwing balls independently into bins, where the probability of choosing first and last bin is and
for all remaining bins. The variance of the ratio of number of balls in binis equal to , and for bin is equal to . We have , where is the expected statistical deviation of the ratio of -values found in a given bin. We expect that the error in approximating -values propagates into an additional deviation. If the deviation is smaller than the statistical deviation, i.e., if
then we say that the second level test is reliable. Note that the reliability of a test imposes a restriction on a relation between the length of a sequence used for each base test (i.e., ) and the number of basic tests (). Inequality (6) implies a lower bound on , namely
4 The Flawed PRNG
In this section we present – a family of PRNGs. The family depends on three parameters: (a PRNG, e.g., the Mersenne Twister), (a small integer parameter, e.g., ) and . generates the same output as for a fraction of all possible seeds. For the remaining fraction of seeds it outputs bits such that the corresponding walk of the length spends exactly half of the time ( steps) above zero and exactly half of the time below zero. In the following we will denote .
4.1 Dyck Paths
To generate walks with the aforementioned property we will use Dyck paths, i.e., walks starting and ending at 0 with the property that for each prefix the number of ones is not smaller than the number of zeros.
A sequence of bits is called a Dyck path if the corresponding walk fulfills and . A set of all Dyck paths of length is denoted by .
Thus, a Dyck path of length corresponds to a valid grouping of pairs of parentheses. We have ( is the -th Catalan number).
4.2 Sampling Dyck Paths
We are interested in generating Dyck paths uniformly at random. To achieve this goal we will use the following three ingredients.
(1) Walk sampling
Let be the set of sequences of bits such that the corresponding walk ends at , i.e., . One can easily sample a sequence
uniformly at random – it is enough to make a random permutation of the vector of bits, consisting of zeros and ones.
One can obtain a Dyck path of length from using Algorithm 1.
Observe that transforms into a Dyck path. This follows from simple observations:
has exactly zeros and ones;
since then and after is removed then has exactly bits equal to and bits equal to ;
from the definition of (which enforces in particular that ), the walk that corresponds to bits cannot go below .
An example of a transformation is presented in the Figure 2.