1 Introduction
Distribution testing is a fundamental statistical problem that arises in a wide range of practical applications. At its core the problem is to assess whether a dataset that is assumed to comprise samples from a known probability distribution is in fact consistent with that assumption. For example, if the end state of a computer simulation of a physical system is a set of points with an expected physicsprescribed distribution, then any detected deviation from that expected distribution could undermine confidence in the results obtained and possibly in the integrity of the simulation system itself.
Data integrity verification is a related application for distribution testing in which the objective is to detect evidence of tampering, e.g., humanaltered data. For example, many sources of numerical data produce numbers with first digits conforming to the BenfordNewcomb firstdigit distribution^{1}^{1}1This phenomenon is often referred to as “Benford’s Law”.
Manuscript received April 19, 2005; revised September 17, 2014. Published: Statistics Research Letters,
vol. 4, pp. 1117, 2015. [1, 2], while digits other than the first and last are uniformly sampled from [3]. Digits in humancreated numbers, by contrast, tend to exhibit high regularity with all elements of represented with nearly equal cardinality. Statistically identified deviations of this kind have been used to uncover acts of scientific misconduct and accounting fraud [4, 5, 6, 7, 8, 9], but there is an increasing need for highersensitivity tests.
There is of course no way to make an unequivocal binary assessment of whether a dataset of samples conforms to a given distribution assumption, but it is possible to devise statistical tests which can assign a rigorous likelihood estimate to the hypothesis that the dataset does (or does not) represent samples from the assumed distribution. In this paper we briefly review the most widelyused method for distribution testing, the chisquare (
) test, and then develop alternative tests based on the statistics of gapwidths between data items of consecutive rank. Our principal contribution is a maxgap test which is shown to provide superior sensitivity to regularity deviations from a uniform distribution that are relevant to data integrity testing [10, 11, 12]. We show that this test can be evaluated with the same optimal computational complexity (serial and parallel) as the conventional test and is therefore suitable for extremely largescale datasets.2 Chisquare Test
The test is a statistical measure that can be applied to a discrete dataset to assess the hypothesis that its elements were sampled from a particular distribution. More specifically, it is a histogrambased method to measure the goodnessoffit between the observed frequency distribution and the expected (theoretical) frequency distribution. The general procedure of the test includes the following steps:

Calculate the chisquare statistic, , which is a normalized sum of squared differences (deviations) between observed and expected frequencies.

Determine the degrees of freedom,
, of that statistic, which is essentially the number of frequencies reduced by the number of parameters of the fitted distribution.
An example of the complement of the cumulative distribution function of the distribution is shown in Fig. 1 with different degreesoffreedom values. For uniformity testing the procedure can be expressed as follows:

Given observations, construct an bin histogram. Let be the bin count for the bin (), which is the observed frequency distribution. As we are testing for uniformity, the expected frequency distribution .

Compute the chisquare test statistic:
(1) 
The number of degrees of freedom, df, is for this case because the counts for bins uniquely determine the count for the remaining bin.

Compute the complement of the cumulative distribution function of the distribution with and obtained from the previous steps. Compare this value with the significance level for the test result.
Despite being the de facto standard for assessing dataset consistency with respect to a given distribution assumption, the test is not optimally sensitive to the types of deviation from uniformity that arise in many data integrity applications. One example involves narrowband missing data resulting from a corrupted sensor or measurement process. Another example involves data that is generated from a nonrandom process and exhibits a higher degree of data regularity than is expected for a uniform distribution [14, 15]. Datasets of the latter kind are typical of artificial and humangenerated data, e.g., as in a forged dataset that has been tailored to include deviations that qualitatively resemble (to humans) uniform random deviates. In the following section we demonstrate the advantage of the proposed maxgap test over for narrowband and highregularity deviations from uniformity.
3 Maxgap Test
The maximum gap, or maxgap, for a dataset of real values is defined as the maximum difference between elements of consecutive rank, which can be determined from a sorted ordering of the dataset. The distribution of spacings between consecutiverank items in a dataset has been examined in the literature [16, 17, 18, 19], and we summarize here some of the results relevant to gap analysis. Assume we are given observations on the open unit interval which divide the interval into intervals whose lengths in ascending order are denoted by . For uniformity testing we are interested in , as it is the maxgap of the observations. The exact distribution of is [19]:
(2) 
where .
From the pvalue of the maxgap , denoted by , we can perform a maxgap test for uniformity by checking the condition for the onesided test, or for the twosided test, where is the significance level. When is large we may replace computation of the exact cumulative distribution of the maxgap in Eqn. (2) with the following asymptotic result [19]:
(3) 
where the expected value of is
(4) 
where is Euler’s constant.
An efficient maxgap test for uniformity can then be formalized as follows: Given observations , and a significance level , compute the maxgap of . Next, the pvalue of the statistics is calculated as:
(5) 
If the pvalue satisfies for the onesided test, or for the twosided test, the observations are deemed to pass the test. Otherwise the set of observations is assessed to be inconsistent with a uniformsampling hypothesis and fails the test.
In the next section, we present results of experiments comparing the relative sensitivities of the test and the maxgap test for, e.g., indentifying anomalous regularity in a presumeduniform distribution.
4 Experiments
In this section, we compare the maxgap test versus the most wellknown and commonly used test. We conducted four experiments involving datasets of samples, with the result for each experiment obtained as an average of one million independent tests. Sensitivity is assessed by comparing the respective values for the onesided forms of the two tests, where smaller values indicate greater sensitivity. The first experiment was performed using a dataset of samples from a true uniform distribution. As expected, the dataset passed both tests for uniformity with .
The second experiment examined sensitivity to the difference between a uniform distribution and a normal distribution with standard deviation sampled within a fixed interval . The distinctive shape of the normal distribution is realized within the interval when is small but flattens with increasing values and approaches uniformity. Both tests are equally sensitive for small , and both approach for large , but the test exhibits higher sensitivity for intermediate values (see Fig. 2). The latter is not surprising because the test is ideally sensitive to deviations from normality.
The third experiment examined sensitivity of the two tests to a uniform distribution with a narrowband exclusion (Fig. 3). This of course is a problem for which the maxgap test is ideally suited, and (4) reveals that superior sensitivity. What is possibly most interesting about the results is that the test provides only modest sensitivity even as the exclusion width approaches one percent of the distribution window.
The fourth experiment is the most relevant for data integrity applictions. It examined sensitivity to regularity in sample spacing. Anomalous distribution regularity is a common characteristic of humanaltered data because people typically underestimate the degree of natural “clustering” that is present in data sampled from a truly uniform distribution. As a consequence, humancreated or humanaltered data tends to have higher regularity, i.e., tends to be “more evenly distributed”, than what is expected for uniformlydistributed data. More generally, highregularity deviations from uniformity can arise from the unanticipated influence of a structured or nonrandom process, e.g., frequencycombing effects from a physical sensor or simulation artifacts resulting from a lowquality pseudorandom number generator.
A regularity parameter was used for this experiment by uniformly distributing samples within each of equalwidth subdivisions of the distribution interval. Thus represents a uniform sampling over the entire interval and produces a uniform distribution; and as increases to the spacing between samples becomes increasingly regular. Although uniform and highregularity distributions are difficult for humans to distinguish visually, Fig. 4 shows that the maxgap test provides significantly higher sensitivity than to subtle regularity deviations from uniformity.
5 MinGap Test
The onesided variants of the maxgap and tests were used because they provide a practical balance between high sensitivity and low false alarm rates, but the onesided or twosided of either test may provide the optimal tradeoff for the needs of a particular given application. In some applications the optimal tradeoff might be obtained from a mingap, , test. The mingap approximated distribution is given by [19]
(6) 
and its expected value is [19]
(7) 
A mingap test can be defined and performed analogously to the maxgap test and would be ideally suited for detecting spuriouslyreplicated data items. However, simpler nonstatistical methods can be applied to detect replicated data, so the potantial applications of the mingap test may be somewhat more limited than the maxgap test.
6 Computational Considerations
In terms of computational complexity, both and maxgap tests can be evaluated in optimal time and space. This complexity is achieved for maxgap by use of the Gonzalez algorithm [20, 21] to determine the maxgap in linear time without sorting. The Gonzalez algorithm performs a special binning which guarantees by the pigeonhole principle that the maxgap data items will be found as the maximum and minimum values, respectively, in consecutive nonempty bins. This algorithm allows the maxgap test to be evaluated in optimal time and space, i.e., the same as , and is as efficiently parallelizable^{2}^{2}2The maxgap and tests are both highly amenable to parallelization with time complexity on processors. as the test.
7 Discussion and Future Work
We have defined and developed a maxgap test for distinguishing deviations from uniformity in a 1D dataset of size . By using Gonzalez’s algorithm we have shown that this test can be performed with commensurate efficiency, both serial and in parallel, with the conventional test. Our experiments demonstrate that the maxgap test provides improved sensitivity in two particular applications of relevance to data integrity verification. More generally, the proposed maxgap and mingap tests are of potential value as alternatives or to complement the use of for distribution testing and discrimination.
There are many statistical tests for equality of distributions beyond the test such as the KolmogorovSmirnov test [24, 25, 26] and the Cramervon Mises test [27]. Of course there can be no test that is uniformly superior to all others for all possible distributions, but it appears that most of the standard tests examined in the literature would be challenged similarly to the test to distinguish uniform from regularly distributed data.
Potential future work could consider tests which jointly combine gap and statistics into a more sophisticated single test [28] which allows greater flexibility to optimize the sensitivity and false alarm tradeoff for problems of high practical interest, e.g., big data analytics and integrity verification. On the algorithmic side, we have pointed out that the Gonzalez algorithm does not generalize to higher dimensions; however, relatively efficient subquadratic algorithms do exist for solving the largest empty circle and largest empty rectangle problems in two dimensions [29, 30]. Tests on distributions could also potentially exploit information about the largest empty region of a Voronoi decomposition or the distribution of nearestneighbor distances from a Delaunay triangulation. In dimensions it may be possible to devise gaprelated statistical tests based on results from efficient algorithms for identifying approximations to the largest empty sphere or rectangle, but this is purely speculative. In higher dimensions it may be better to abandon gaptype statistics and focus on statistics gleaned from efficientlycomputable d and orthant (quad, octant, etc.) tree decompositions of point sets.
If computational efficiency is less of a concern, a perhaps more fruitful direction for highlysensitive distribution testing in high dimensions is to examine the length of the Euclidean minimum spanning tree (EMST) for a dataset. The expected length of the EMST of uniformlydistributed points can be determined using analysis similar to what has been described in this paper for estimating the expected values for the max and min gaps in 1D, and we conjecture that EMST length is likely to be more sensitive to many practically important types of deviations from uniformity than the conventional test. Such an EMST test would be computationally expensive (though subquadratic), but this cost could be justified in applications for which subtle deviations are critically important, e.g., highfidelity physics simulations.
References
 [1] M. Nigrini and J. Wells, Benford’s Law: Applications for Forensic Accounting, Auditing, and Fraud Detection, ser. Wiley Corporate F&A. Wiley, 2012. [Online]. Available: http://books.google.com/books?id=FdRPh787I7oC
 [2] C. Winter, M. Schneider, and Y. Yannikos, “Modelbased digit analysis for fraud detection overcomes limitations of benford analysis,” Seventh International Conference on Availability, Reliability and Security, vol. 0, pp. 255–261, 2012.
 [3] S. Dlugosz and U. MüllerFunk, “The value of the last digit: Statistical fraud detection with digit analysis,” Advances in Data Analysis and Classification, vol. 3, no. 3, pp. 281–290, 2009. [Online]. Available: http://dx.doi.org/10.1007/s1163400900485
 [4] R. Pirracchio, M. RescheRigon, S. Chevret, and D. Journois, “Do simple screening statistical tools help to detect reporting bias?” Annals of Intensive Care, vol. 3, no. 1, 2013. [Online]. Available: http://dx.doi.org/10.1186/21105820329
 [5] R. J. Bolton and D. J. Hand, “Statistical fraud detection: A review,” Statistical Science, vol. 17, no. 3, pp. 235–255, August 2002. [Online]. Available: http://dx.doi.org/10.1214/ss/1042727940
 [6] N. Kingston and A. Clark, Test Fraud: Statistical Detection and Methodology, ser. Routledge Research in Education. Taylor & Francis, 2014. [Online]. Available: http://books.google.com/books?id=3fzpAwAAQBAJ
 [7] M. Nigrini, Forensic Analytics: Methods and Techniques for Forensic Accounting Investigations, ser. Wiley Corporate F&A. Wiley, 2011. [Online]. Available: http://books.google.com/books?id=ct9CB4eJCXYC
 [8] A. Diekmann, Methodological Artefacts, Data Manipulation and Fraud in Economics and Social Science, ser. Jahrbücher für Nationalökonomie und Statistik. Lucius & Lucius, 2011. [Online]. Available: http://books.google.com/books?id=vzJlczjAz4sC
 [9] L. Leemann and D. Bochsler, “A systematic approach to study electoral fraud,” Electoral Studies, vol. 35, no. 0, pp. 33–47, 2014. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0261379414000390
 [10] Y. Fujii, “The analysis of 168 randomised controlled trials to test data integrity,” Anaesthesia, vol. 67, no. 6, pp. 669–670, 2012. [Online]. Available: http://dx.doi.org/10.1111/j.13652044.2012.07189.x
 [11] S. AlMarzouki, S. Evans, T. Marshall, and I. Roberts, “Are these data real? statistical methods for the detection of data fabrication in clinical trials,” BMJ, vol. 331, no. 7511, pp. 267–270, 2005.
 [12] U. Simonsohn, “Just post it the lesson from two cases of fabricated data detected by statistics alone,” Psychological science, vol. 24, no. 10, pp. 1875–1888, 2013.
 [13] M. Haggstrom. (2010) Complement of chisquare cumulative distribution. [Online]. Available: http://en.wikipedia.org/wiki/File:Chisquare_distributionCDFEnglish.png
 [14] J. H. Pitt and H. Z. Hill, “Statistical detection of potentially fabricated data: A case study,” ArXiv eprints, November 2013.
 [15] H. Z. Hill and J. H. Pitt, “Failure to replicate: A sign of scientific misconduct?” Publications, vol. 2, no. 3, pp. 71–82, 2014. [Online]. Available: http://www.mdpi.com/23046775/2/3/71
 [16] D. A. Darling, “On a class of problems related to the random division of an interval,” The Annals of Mathematical Statistics, vol. 24, no. 2, pp. 239–253, June 1953.
 [17] R. Pyke, “Spacings,” Journal of the Royal Statistical Society, pp. 395–449, 1965.
 [18] ——, “Spacings revisited,” pp. 417–427, 1972.
 [19] L. Holst, “On the lengths of the pieces of a stick broken at random,” Journal of Applied Probability, pp. 623–634, September 1980.
 [20] T. Gonzalez, “Algorithms on sets and related problems,” Department of Computer Science, University of Oklahoma, Norman, OK, Tech. Rep., 1975.
 [21] ——, “Clustering to minimize the maximum intercluster distance,” Theoretical Computer Science, vol. 38, pp. 293–306, 1985.
 [22] M. Golin, R. Raman, C. Schwarz, and M. Smid, “Simple randomized algorithms for closest pair problems,” Nordic Journal of Computing, vol. 2, no. 1, pp. 3–27, March 1995.
 [23] R. Lipton, “Rabin flips a coin,” in The P=NP Question and Gödel’s Lost Letter. Springer US, 2010, pp. 77–80.
 [24] Z. W. Birnbaum and F. H. Tingey, “Onesided confidence contours for probability distribution functions,” The Annals of Mathematical Statistics, vol. 22, no. 4, pp. 592–596, December 1951. [Online]. Available: http://dx.doi.org/10.1214/aoms/1177729550
 [25] W. Conover, Practical nonparametric statistics, ser. Wiley series in probability and statistics: Applied probability and statistics. Wiley, 1999. [Online]. Available: https://books.google.com/books?id=dYEpAQAAMAAJ
 [26] G. Marsaglia, W. W. Tsang, and J. Wang, “Evaluating kolmogorovś distribution,” Journal of Statistical Software, vol. 8, no. 18, pp. 1–4, 2003. [Online]. Available: http://www.jstatsoft.org/index.php/jss/article/view/v008i18
 [27] H. Cramer, “On the composition of elementary errors,” Scandinavian Actuarial Journal, vol. 1928, no. 1, pp. 13–74, 1928. [Online]. Available: http://dx.doi.org/10.1080/03461238.1928.10416862
 [28] D. Maynes, “Combining statistical evidence for increased power in detecting cheating,” in Presentation given at the annual meeting of the National Council of Measurement in Education, San, Diego, CA, April, 2009.
 [29] B. Chazelle, R. L. Drysdale, and D. T. Lee, “Computing the largest empty rectangle,” SIAM Journal of Computing, vol. 15, no. 1, pp. 300–315, February 1986.
 [30] A. Naamad, D. Lee, and W. Hsu, “On the maximum empty rectangle problem,” Discrete Applied Mathematics, vol. 8, no. 3, pp. 267–277, 1984.
Comments
There are no comments yet.