Review of Data Structures for Computationally Efficient Nearest-Neighbour Entropy Estimators for Large Systems with Periodic Boundary Conditions

by   Joshua Brown, et al.
Charles Sturt University

Information theoretic quantities are extremely useful in discovering relationships between two or more data sets. One popular method---particularly for continuous systems---for estimating these quantities is the nearest neighbour estimators. When system sizes are very large or the systems have periodic boundary conditions issues with performance and correctness surface, however solutions are known for each problem. Here we show that these solutions are inappropriate in systems that simultaneously contain both features and discuss a lesser known alternative solution involving Vantage Point trees that is capable of addressing both issues.


page 1

page 2

page 3

page 4


Periortree: An Extention of R-Tree for Periodic Boundary Conditions

Searching spatial data is an important operation for scientific simulati...

Efficient solutions for nonlocal diffusion problems via boundary-adapted spectral methods

We introduce an efficient boundary-adapted spectral method for peridynam...

A Method for Representing Periodic Functions and Enforcing Exactly Periodic Boundary Conditions with Deep Neural Networks

We present a simple and effective method for representing periodic funct...

Singular Diffusion with Neumann boundary conditions

In this paper we develop an existence theory for the nonlinear initial-b...

An Expedient Approach to FDTD-based Modeling of Finite Periodic Structures

This paper proposes an efficient FDTD technique for determining electrom...

Order reduction and how to avoid it when Lawson methods integrate reaction-diffusion boundary value problems

It is well known that Lawson methods suffer from a severe order reductio...

Unfitted Nitsche's method for computing band structures in phononic crystals with impurities

In this paper, we propose an unfitted Nitsche's method to compute the ba...

1 Introduction

Information Theory provides three useful metrics for determining relationships between data sets. Mutual Information ((Shannon, 1948)

measures the shared uncertainty between two variables and is often used as a marker for second-order phase transitions

(Matsuda et al., 1996; Harré & Bossomaier, 2009; Wicks et al., 2007). Meanwhile, Transfer Entropy ((Schreiber, 2000) and Global Transfer Entropy ((Barnett et al., 2013) track the flow of information from one or more data sets onto another, with Barnett et al. demonstrating that can be used as a predictor for an oncoming phase transition in the Ising (1925) spin model when system temperature is reduced over time.

If the underlying distribution for the data is known, these quantities can be straightforward to calculate. However, in general this distribution is unknown and can only be approximated from sampled data, thus requiring Entropy Estimators. Two well known approaches are the plug-in and nearest-neighbour (NN) estimators.

While simple, plug-in estimators tend to be less accurate, particularly for continuous variables (Ross, 2014) which are the focus of this paper. A high quality alternative is the Kozachenko-Leonenko entropy estimator (1987) which uses nearest neighbour statistics of realisations of a data series. Kraskov et al. (2004) extended this for estimating , with and estimators further derived by Gómez-Herrero et al. (2015). For higher dimensional data, NN estimators become even more appropriate as plug-in estimators experience a combinatorial explosion in storage requirements.

On the other hand, the time complexity for the plug-in estimator scales linearly on the number of points, , while NN estimators—using a direct method, where distances between each and every point are considered—scale on the order of , which becomes significant for large . This problem is widely encountered in the field of computer graphics and collision detection in computer games. The accepted solution for these scenarios are spatial partitioning data structures—such as well-known BSP Trees (Fuchs et al., 1980) and KD Trees (Bentley, 1975)—resulting in a “broad-phase” collision detection step, which provides the ability to discard large swathes of points from consideration with relatively little computation.

KD Trees are directly applicable to the nearest neighbour searches required by the NN estimators, and are commonly applied when data sets are large. However, when the data lies in a space with periodic boundaries—e.g., headings of observed objects, where —the basic KD Tree fails. While there are modifications to overcome this, it is shown later that ultimately a more appropriate data structure is needed and available in the Vantage Point (VP) Trees developed—independently—by Uhlmann (1991) and Yianilos (1993).

This paper serves to highlight the relative computational performance of the VP Tree compared with a modified KD Tree for the specific use case of measuring , and of large periodic systems via a NN estimator. This use case also presents a unique constraint in that the search structures are typically constructed and used just one time—meaning construction costs are just as important as search costs.

To establish the accuracy of the VP Tree approach, tests are performed against periodic and aperiodic canonical distributions as done by Kraskov et al. (2004). To further analyse the use of the VP Tree for large data sets with periodic boundary conditions, data sets—generated by the widely-used Vicsek model of self-propelled particles (Vicsek et al., 1995)—are analysed. Lastly, work by Gao et al. (2015) stresses that for accurate estimation, the NN estimator requires a number of points scaling exponentially with the true . A decimation of the Vicsek data is employed to investigate this issue.

The three aforementioned information theoretic metrics are widely used in the scientific community, where mutual information has been used in many applications, including the study of phase transitions in financial markets (Harré & Bossomaier, 2009), while the range of applications for transfer entropy is building steadily, from the study of physics of fundamental systems, such as the Ising spin model (Barnett et al., 2013) to cellular automata (Lizier et al., 2008), random Boolean networks(Lizier et al., 2012), and swarms (Wang et al., 2012). Ensuring the accuracy of the VP Tree method in periodic systems is important as it will allow processing of larger data sets with fewer computational resources—which is particularly important in the age of big data.

2 Methods

The most simple of the three quantities, , measures the amount of uncertainty common to two variables, say the headings of particles in a simulation. The input for an estimator for is a set of data points, , drawn from the two variables. instead measures the flow of information between two variables, that is, the reduction in uncertainty of given past knowledge of and . Estimators for thus requires a time series of data points . Finally, measures a similar statistic as , but is used in the case where is influenced by multiple variables , and therefore input to these estimators is -dimensional data set, .

The standard forms of the three metrics are (Shannon, 1948; Schreiber, 2000; Barnett et al., 2013):


where . and are pairwise quantities, and as such, can be calculated using just two variables—and further averaged over every pairwise combination of variables if the data set contains many interacting variables.

on the other hand is a global quantity, so each random variable is considered in turn for the target variable.

is thus the set of all influencing variables on the chosen target, .

Throughout this work, and are drawn from a variety of sources where: (a) analytic results are available, (b) prior results for metrics are available, or (c) statistics are unknown and hard to estimate. A collection of canonical distributions with analytic results and distributions with prior results are provided in Tables 1 and 2

. Amongst these distributions is the ubiquitous Gaussian distribution, as well as the Von Mises distribution which is accepted as the circular analogue of the Gaussian distribution. The Vicsek Model (described in §

2.2) is used to generate difficult-to-estimate data.

There are some optimisations available for depending on the nature of the data. If the variables of the data set are indistinguishable then can be calculated as a single ensemble calculation treating it as a multivariate calculation, rather than calculations. This technique is employed in §3.1. A second such optimisation is dependent on how variables influence each other. In the case of the Vicsek model used later, variables influence one another in an aggregate fashion from which a consensus variable, , can be constructed. Thus can be calculated as the 3-dimensional , rather than the above -dimensional form. This reduction is utilised in §3.2-3.3.

2.1 Nearest neighbour method

The plug-in estimation approach uses these data sets to determine the underlying distribution often via a histogram. As mentioned in the introduction, once this distribution is revealed it is often trivial to solve the above equations. However, working with continuous data presents certain issues for traditional plug-in histogram entropy estimation. For example, tuning the bin width parameter for discretising the continuous data is susceptible to the number of points available, which can introduce error at high or low noise if the bin width is too low or high, respectively. While adaptive binning can mitigate this somewhat, it is not always an option, particularly when the data covers the entire domain.

Instead, one can use Nearest-Neighbour Entropy Estimators. These estimators are based on the (continuous) statistics of the nearest neighbours to each data point and the distribution of points along marginal dimensions as introduced by Kozachenko & Leonenko (1987). In these methods, the max-norm distance, , is found for each point and its

th nearest neighbour (kNN search). A fixed range (FR) search is then performed to count the number of points within the (hyper-)

“ball” defined by from along the marginal spaces. These values are then used to compute the above information theoretic quantities. See Fig. 1 for an example of collecting nearest neighbour information.

Kraskov et al. (2004) provide the nearest neighbour estimator for as:


where is the digamma function, is the count of points within the distance from point in the marginal space (either or ), and is the average over all points.

Figure 1: Determining the th nearest neighbour— in this case—and for some point and then count the number of points in the marginals strictly within these bounds, with and , after Kraskov et al. (2004).

The extension to this for and is provided by Gómez-Herrero et al. (2015), and requires computing:


where and represent the state at , represents the current state of at and represents the set of variables used for the . is thus the number of points that exist within the distance from point when only considering the and coordinates of the data set.

Each of the above quantities requires kNN searches and either or FR searches. In a naïve approach, each of these searches is a operation, clearly making this approach more computationally expensive than simple plug-in estimators. However, by choosing more appropriate searching algorithms we can significantly reduce the cost of these estimators such that they become a feasible approach.

2.1.1 Underlying data structure

The naïve approach for finding the th nearest neighbours and the marginal neighbour counts for point is to calculate the distance between and every other point, and only keep those points passing the criteria—i.e., amongst the closest points, or within distance . This approach requires comparisons for all points, resulting in a computational complexity of and as such is not a suitable approach when is large.

A common solution to this problem in computer graphics are data structures such as the KD Tree (Bentley, 1975), which subdivide a space using lines to define front/back areas, as seen in Fig. 2. When considering point , one can quickly cull large sections of space—those containing points which cannot possibly be among the nearest neighbours, or within the bounds —from further consideration. This approach reduces the complexity to  (Friedman et al., 1977). The subdivision process creates a tree hierarchy which can be seen in Fig. 3.

Figure 2: KD Tree constructed via lines intersecting each point segregate the local area of the point into so-called front and back areas. In this example, point A is the root node in the binary tree, with points B and C being its immediate child nodes, as seen in Fig. 3.
Figure 3: Tree structure created by subdivision of space in KD Tree. All nodes descending from the left of A appear in front of A in the space, while those descending from the right of A appear behind A. Grey leaf nodes represent the areas enclosed by the subdivisions. An algorithm for traversing this tree, presented by Friedman et al. (1977), allows finding the th nearest neighbour in time, rather than .

However, when the data lies in a space with periodic boundary conditions a structure such as a KD Tree will incorrectly cull points. Since the space wraps, a point is both in front of and behind any other point, while the KD Tree only considers points up to the boundary and not beyond. Three potential methods for correcting this—using Images, a hybrid involving naïve search and KD Trees, and VP Trees—are discussed below.

The images method duplicates and shifts the data—thus creating cloned images—such that when the KD Tree is constructed from the combination of the original data set and all images, it has the illusion of periodicity. This requires duplications for a -dimensional space which can quickly become intractable, especially for . Note however, that typically only the shortest distance between points and are considered and as such this can be reduced to by only duplicating the closest region (half/quarter/eighth/etc.) as needed. Some specific cases—perhaps where direction is a factor—might require full duplications, however, this will not be considered here, due to both the difficulty in imagining such a scenario—implying the data is both circular and pairwise ordered—as well as the fact that the two following approaches are only capable of solving the shortest distance form.

The second method—a hybrid of duplication and naïve search—requires only a single duplication, but can degenerate to a naïve search for select points. The image in this method—queried separately—is shifted such that points originally at the corners are now in the centre of the space, and vice versa—i.e., each dimension is shifted by half of its width. When querying (kNN or FR), if it is closer to the boundary than —i.e., there are possibly points just on the other side of the boundary closer than —the query is repeated in the image. The boundary check is repeated on the image result, and if it fails again, the query steps down to a naïve search. This method requires less duplication than the previous method but requires more computation per point.

Instead of wrangling the KD Tree to handle periodicity, the third approach is to use a different structure. As already discussed, KD Trees cannot deal with periodic conditions because the nature of front and back areas in periodic spaces is ill-defined. VP Trees work instead by choosing a point, (the vantage point), and then subdividing the space into those points nearer to and those points farther from than a given threshold. Since distances—and thus near/far points—are trivial to calculate in a periodic space, this structure can be used to attain efficient searching without resorting to any tricks. To create a balanced tree, the threshold is chosen such that it divides the points in half. The subdivision process is repeated on each new subset with new vantage points and thresholds chosen from only those points in the subregion. Fig. 4 shows how this subdivision strategy works in an example space. Note that the hierarchical structure of this tree is the same as the KD Tree, just with the left children being near—instead of in front—and the right children being far.

Figure 4: Circles centred on vantage points delineating near/far subsets. Note that child circles do not cross the boundaries of parent circles, as vantage points only consider points in the same subset as itself. Shaded region represents the near subset of the root node (top left), accounting for periodic boundary conditions.

One drawback to this near/far dichotomy is that a new tree is needed for each set of marginals as the dimension reduction can change the distance between points. That is, if the max-norm distance between -dimensional realisations, , is the distance along dimension , then a marginal subset that does not include will have a different distance between and . Similar arguments apply to other norms.

Thus a separate tree is needed for each kNN and FR search—totalling three for and four for and . However, the FR searches are independent of each other, thus if the kNN search is performed and the -ball sizes stored the remaining trees can be used in serial. Since each tree is only used once (with all points processed), only one tree is ever needed at any given time, reducing the space requirements for this approach to , rather than or .

When performing an FR search in just one dimension (i.e., for counting and ), it was found that performing a binary search to find the range of elements covered by —and thus the neighbour count—was faster than VP Trees. As such in the one dimensional searches, all test cases revert to this simpler algorithm. This also reduces the number of VP Trees required to one for and three for and .

The KD Tree implementation used in this paper was provided by the ANN library (Mount & Arya, 2010), which is a standard package in the area of nearest neighbour searches with over 330 citations on Google Scholar, while the VP Tree implementation is a slightly modified version of the public domain code by Hanov (2012). Our software is provided online. Both implementations are written in C++.

It should also be noted that other structures and approaches exist for solving this issue. For example, the periodic KD Tree library (Varilly, 2014) (written in Python) instead duplicates the query points rather than the data points. While this halves the storage requirements, each query point needs to be processed times (in the typical case) or

times (in the general, if not odd, case), rather than 1 to 3 times as in the hybrid model. This can be an obstacle in the case of measuring information theoretic quantities where each data point is also a query point (i.e.,

rather than ). Another two structures capable of addressing the kNN problem are Locality Sensitive Hashing (LSH) (Zhang et al., 2013) and higher order Voronoi Diagrams (specifically, k-order) (Lee, 1982; Dwyer, 1991). For the purposes of estimating information theory quantities however, these algorithms will not perform significantly better than the KD Tree or VP Tree due to space requirements, the one-shot nature of the estimation (that is, only using each structure once), and the distribution of data points. As such, the authors shall simply note their existence (and better performance in other use cases) and consider just the KD Tree and VP Tree structures for the remainder of this paper.

2.2 Vicsek Model

The standard Vicsek Model (SVM) (Vicsek et al., 1995) is a simple model that is widely used in collective motion studies. The simplicity of the model allows it to easily scale to large system sizes (both in terms of time and space). As such, it easily generates a large number of points—with periodic boundary conditions—perfect for testing VP trees. In the model, a flock of particles move in a continuous space (off-lattice) with periodic boundary conditions of linear size . Particle density is defined as . Particles move with constant speed, , and are simulated for discrete time steps. Formally, the particles are updated according to:


where is the position of particle at time , is its velocity, which is constructed with speed and heading . defines the average angle of all particles within units of (including itself) and is a realisation of uniform noise on the range , where is our temperature variable and .

Particle alignment is measured via the instantaneous order (or disorder) using the parameter:


where . When , the flock is completely aligned and collective motion has been attained, while represents the disordered, chaotic state of particles moving with no alignment. is associated with no noise in the system—i.e., —while occurs at high noise, .

Note that while particle positions are indeed in a wrapped space, the random variables evaluated below are actually the heading angles of neighbouring (interacting) particles. That is, the quantities describe the various reductions in uncertainty of a particle’s heading given knowledge of its past heading and its neighbours’ headings, not their positions. To measure the information theoretic quantities—using the notation from Eqs. (4-6)— is defined as the heading of particle , as the heading of an interacting neighbour particle , and as the next heading of particle . In the case of and , which are pairwise quantities, this is repeated for each interacting neighbour for every time step . On the other hand, measures all interacting neighbours in one multi-dimensional variable , and thus generates just a single point. Particle indistinguishability is employed and repeated the above for all , coalescing all points into a single point cloud. The pairwise quantities generate a separate data point for all interacting pairs, which is not constant over time or noise. See Fig. 5 for a visualisation of how the number of interacting pairs changes as a function of noise temperature, . on the other hand generates a single point per for all of its interacting neighbours at time and thus contains precisely data points at all noise values.

Figure 5: Logarithmic scale of number of interactions, , for Vicsek system with parameters for

. Error bars represent standard deviation over


There is one major difficulty in calculating however. uses all interacting neighbours, , to particle at time step to define the multivariate —i.e., . is not constant which causes issues when collecting all realisations into a single point cloud for calculation. One solution is to pick an arbitrary value of —say the minimum, maximum or mean of —however these options will miss pertinent variables and thus give an inaccurate result, or will include irrelevant particles and thus increase the complexity of the problem. The alternative is to note that particles only influence each other via the consensus heading, , and to use it instead of each particle individually. The latter approach is used for all Vicsek results below.

Finally, to provide an example for the earlier claim regarding adaptive binning techniques for plug-in estimators, note that the average heading of Vicsek flocks will proceed on a random walk about the unit circle over time. As such, the angles generated by each particle will be roughly uniform over each marginal (although in the joint space, accumulates around the diagonal), degenerating adaptive bins to simple fixed-width bins.

3 Results

3.1 Accuracy of NN estimator

Gaussian 3.4 3.5 3.2
0.0 0.0 0.0
5.1 3.9 4.0
0.8304 0.1270 0.1816
LogNormal 3.4 3.4 3.0
0.0 0.0 0.0
5.1 3.7 4.0
0.9741 0.1314 0.1871
Cauchy 4.6 4.1 4.0
0.2242 0.0827 0.1251
7.2 4.6 4.9
1.0545 0.2098 0.3067
Uniform 3.4 3.4 3.2
0.0 0.0 0.0
Uniform Wrapped 3.4 3.4 3.2
0.0 0.0 0.0
Hahs & Pethel N/A 3.7 N/A
Von Mises a 0.3489 4.3 N/A N/A
b 0.4384 4.5 N/A N/A
c 0.1940 4.3 N/A N/A
Table 1: Estimated and exact (italicised) values to 4 decimal places for , , and using realisations for variables , , drawn from the listed bi- and multi-variate distributions and auto-correlated processes. Evaluation performed over

repetitions to establish the mean estimate—and thus bias—and standard error of the estimator († Multiplied by

for readability). Bias is dependent on number of realisations and covariance of random variables, with improved bias as realisations increase and covariance decreases. *Gaussian, LogNormal and Cauchy distributions performed with covariance,

(i.e., ) and . Parameters for Von Mises and Hahs & Pethel evaluations found in text.

To ensure the integrity of the VP Tree implementation of the NN estimator, , and are estimated from data series sampled from distributions with known closed form entropies. Kraskov et al. (2004) performed this test for against Gaussian data with multiple covariances, which is repeated and extended here. This establishes the accuracy of the estimator for the and as well. The estimator is also tested against the Log-Normal, Cauchy (Via a Student’s t distribution with degree of freedom) and Uniform distributions. For these tests—and all tests below— is set to as per the suggestion of Kraskov et al. (2004), who propose the choice of as it provides a good balance between statistical and systematic errors in the NN estimator. For larger systems, larger is allowed as systematic errors tend to 0. Simulations with confirmed no change in estimate, and so only is presented below.

To extend the test to account for periodic cases, all three metrics are tested against a wrapped uniform distribution and against the von Mises distribution (sine variant). Exact for the von Mises distribution is provided in Table 1 of Hnizdo et al. (2008), for three parameter sets: a) , b) , and c) , with for all three sets.

Lastly, the estimator is tested against the autoregressive process described by Hahs & Pethel (2013) (Example 1) which includes an analytical expression for

. Variances of the Gaussian noise terms in each process were

for and , respectively, with and , and an initial value of .

The estimated and exact values for these distributions can be seen in Table 1. Estimates were measured from realisations and repeated times. Note that as the number of realisations increases, the estimate values converge to the exact values, as observed in Kraskov et al. (2004). Closed form expressions for the information theoretic quantities can seen in Table 2.

Distribution Closed Form
Uniform Wrapped
Hahs & Pethel See Hahs & Pethel (2013)
Von Mises See Hnizdo et al. (2008)

Where is the scalar covariance between and and is the covariance matrix between the subscripted variables. Additionally, is the Gamma function while is the digamma function. Information sharing and flow are non-existent for uniform distributions by definition and thus become 0.

Table 2: Closed form analytic expressions for , , and . Note that closed form expressions for the Von Mises distribution and the Hahs & Pethel auto-regressive process require more context than can reasonably be provided here and as such the reader is referred to the original material for this context.

Given the results in Table 1, and the converging nature as the number of realisations is increased, it is determined that the VP Tree implementation of the NN estimator is accurate.

3.2 Comparison of underlying data structure

Yianilos (1993) and Friedman et al. (1977) provide Big analysis of VP Trees and KD Trees, respectively, showing both can be constructed and searched in time and use space. Here, estimation of the complexity of the hybrid KD Tree approach is provided, followed by empirical data comparing computation and space of the VP Tree and hybrid approaches.

As mentioned above, the metrics for the Vicsek model are calculated from a two or three dimensional space with periodic boundary conditions, with sides . To perform kNN searches the hybrid method constructs two KD Trees: partitions the points , while partitions the points —the original points shifted by along every dimension, with wrapping.

To refresh, the hybrid search works as follows: search the first tree, , for the th nearest neighbour to , which gives a distance, . If is closer than units to the boundary then the search progresses to the equivalent point in the second tree, , to find a corresponding distance, . If is also closer than units to the boundary, the search reverts to a naïve linear search over all points.

Thus areas can be defined. First is the total area, , containing all points. Second is the inner area, given by the box of sides centred in the space, containing all points successfully processed by in the first step. Following this is the border area, , covering those points that will process (successfully or otherwise). Finally, the naïve area, , contains those points that will ultimately be processed using the linear search, and is defined as the area within a max-norm distance of from the boundary midpoints, such that .

Due to the tiered approach of the hybrid method, the time complexity will follow the form


where is the proportion of points in the border area and is the proportion of points in the naïve area. Note that .

In the high noise Vicsek case, where is uniformly distributed with density the average distance to the th nearest neighbour is (Bhattacharyya & Chakrabarti, 2008; Gaboune et al., 1993; Hertz, 1909):


where . This allows further expansion of and .

For the high noise uniform case the proportions are simply and , which gives time complexity of


for the two-dimensional case and


for the three-dimensional case.

In other distributions, such as the low noise Vicsek model, and will change. In the specific case of low noise Vicsek data, points will accumulate along the and diagonals for and , respectively. This is ideal as it results in minimal points in —i.e., . Furthermore, the amount of points in will also be significantly reduced since only points near the corners—that is near —will be within the required threshold, with , noting that will not match Eq. (11) due to the distribution change.

Given that , it is clear that Eqs. (12) and (13) are of order and thus equivalent to the VP Tree. In practical terms, however, a non-zero and as well as requiring twice the construction time—which is itself—will result in lower processing throughput.

To demonstrate the difference between the two approaches, empirical data is provided below where Vicsek data sets are analysed against the number of interactions generated. Only and are investigated as these have variable , while has constant and thus does not provide any additional insight. As mentioned earlier, both algorithm implementations are provided with off the shelf software—with slight modification of the VP Tree implementation such that it uses contiguous memory allocation similar to the KD Tree implementation.

Fig. 6 shows the time taken to calculate and for a range of . While all four measurements seem to scale according to —as described above—the VP Tree methods are faster. The difference in running time for calculation of is strictly in finding the -ball sizes, as both FR searches are performed with the same binary search function. Calculation of is worse still due to relying on the hybrid method more, with computations taking almost ten hours with the hybrid method compared to only 30 minutes under the VP Tree method.

Figure 6: Time (s) vs measured for a Vicsek system with parameters: using VP Tree and hybrid methods.

Fig. 7 shows the memory usage for the same calculations as above. All four metrics are rigidly linear in their scaling—as expected—but there is a marked increase in memory usage between the two methods, with the hybrid method performing definitively worse with both metrics.

Figure 7: Memory usage (GB) vs measured for a Vicsek system with parameters: using VP Tree and hybrid methods.

3.3 Additional checks of numerical stability

Two additional checks for numerical stability were performed with the Vicsek data—a random shuffle and a data decimation. These additional checks were not performed for the canonical distributions as exact results are known for these.

In the first check, a random shuffle is employed on the coordinates corresponding to the variable(s) for all three metrics. This should eliminate most of the information sharing between data sets and thus result in . Results can be seen in Fig. 8.

Figure 8: Results of random shuffle (dashed) and decimation (solid bold) for a Vicsek system with parameters: . Results for unmodified data shown washed out. Shuffling has removed almost all information sharing between variables as evidenced by the near zero value for all three metrics over all noise. Decimation of data (Using random of available data) has little impact on results.

The decimation method is widespread and is employed to test the precision of the NN estimator implementation in a manner analogous to cross-validation (Witten & Frank, 2005). Specifically, data is generated from the Vicsek simulations, and a random subset containing 10% of the entire population is chosen, which were then processed with the NN estimators. This was repeated ten times per run. Figure 8 shows that there was not a significant impact from this decimation, in alignment with assertions by Gao et al. (2015).

3.4 Periodicity in data structure

Hnizdo et al. (2008) uses the ANN library for their KD Tree implementation, as here, in which they measure for the circular von Mises distribution. Interestingly however, they make no mention of the inability for KD Trees to handle periodic situations. Furthermore, their parameters for the von Mises distribution use a mean of for both random variables—with a periodic range of —meaning one would expect periodic artefacts to be quite pronounced as the distribution is centred on the border.

However, this is not the case. sets of realisations of parameter set c (Table 1) were tested with and without wrapping enabled in the VP Tree implementation. The estimated for the two tests were and for wrapped and non-wrapped VP trees, respectively, with an exact result of .

To see why the difference in results is not larger, consider not accounting for periodic conditions. This will only affect points near the boundary, where instead of counting the neighbours between —with wrapping handled—the process counts the neighbours in the range (or ), noting that will not necessarily be the same as since it is likely the th nearest neighbour will be a different point. In a data set centred around , even though many points are near the boundary, only a small subset will cross the boundary in either the kNN or FR searches. Additionally, the mirrored density on either side of the boundary means that for these points—likewise for . Note that the actual distance to the th nearest neighbour is not used in Eqs. (4-6), only the neighbour counts to which the digamma function is applied—which is approximately logarithmic for values over 10. These facts, combined with the result being averaged over all leads to the small change in result when periodic boundaries are ignored.

To test for the mirrored density assertion above, the data set is shifted such that the data on the boundary is asymmetric. This is achieved using . The estimate for wrapped trees is (rightfully) unaffected, however the non-wrapped trees estimate worsens, to .

4 Conclusion

In this paper, we discuss estimating information theoretic quantities for continuous variables using nearest neighbour estimators, specifically for systems with a large amount of circular data. The direct method of calculation is shown to be infeasible as it scales proportional to . While spatial partitioning structures can reduce this to , special care must be taken when dealing with periodic boundary conditions.

We discuss three data structures which are capable of handling these boundary conditions and have demonstrated that choosing a more appropriate data structure like the VP Tree can significantly increase performance in terms of time and memory used. Furthermore, as the VP Tree is fundamentally similar to the KD Tree, instead relying on distance between points rather than their specific topology, it should perform no worse than the KD Tree in any metric space on any manifold of a closed form, such as a sphere or Klein bottle.

We have also shown that employing a decimation technique to reduce the number of points processed does not have a significant impact on estimation.

By combining the considerations put forth in this paper, estimation of , , and for large systems with periodic boundary conditions is feasible and will show the same performance characteristics as systems with fewer constraints.


Joshua Brown would like to acknowledge the support of his Ph.D. program and this work from the Australian Government Research Training Program Scholarship.

The National Computing Infrastructure (NCI) facility provided computing time for the simulations under project e004, with part funding under Australian Research Council Linkage Infrastructure grant LE140100002.


  • Barnett et al. (2013) Barnett, L., Harré, M., Lizier, J., Seth, A. K., & Bossomaier, T. (2013). Information flow in a kinetic ising model peaks in the disordered phase. Phys. Rev. Lett., 111, 177203.
  • Bentley (1975) Bentley, J. L. (1975). Multidimensional binary search trees used for associative searching. Communications of the ACM, 18, 509–517.
  • Bhattacharyya & Chakrabarti (2008) Bhattacharyya, P., & Chakrabarti, B. K. (2008).

    The mean distance to the nth neighbour in a uniform distribution of random points: an application of probability theory.

    European Journal of Physics, 29, 639.
  • Dwyer (1991) Dwyer, R. A. (1991). Higher-dimensional voronoi diagrams in linear expected time. Discrete & Computational Geometry, 6, 343–367.
  • Friedman et al. (1977) Friedman, J. H., Bentley, J. L., & Finkel, R. A. (1977). An algorithm for finding best matches in logarithmic expected time. ACM Transactions on Mathematical Software (TOMS), 3, 209–226.
  • Fuchs et al. (1980) Fuchs, H., Kedem, Z. M., & Naylor, B. F. (1980). On visible surface generation by a priori tree structures. In ACM Siggraph Computer Graphics (pp. 124–133). ACM volume 14.
  • Gaboune et al. (1993) Gaboune, B., Laporte, G., & Soumis, F. (1993). Expected distances between two uniformly distributed random points in rectangles and rectangular parallelpipeds. Journal of the Operational Research Society, 44, 513–519.
  • Gao et al. (2015) Gao, S., Ver Steeg, G., & Galstyan, A. (2015). Efficient estimation of mutual information for strongly dependent variables. In AISTATS (pp. 277–286).
  • Gómez-Herrero et al. (2015) Gómez-Herrero, G., Wu, W., Rutanen, K., Soriano, M. C., Pipa, G., & Vicente, R. (2015). Assessing coupling dynamics from an ensemble of time series. Entropy, 17, 1958–1970.
  • Hahs & Pethel (2013) Hahs, D. W., & Pethel, S. D. (2013). Transfer entropy for coupled autoregressive processes. Entropy, 15, 767–788. doi:10.3390/e15030767.
  • Hanov (2012) Hanov, S. (2012). Vp trees: A data structure for finding stuff fast
  • Harré & Bossomaier (2009) Harré, M., & Bossomaier, T. (2009). Phase-transition – behaviour of information measures in financial markets. Europhysics Letters, 87, 18009.
  • Hertz (1909) Hertz, P. (1909). Über den gegenseitigen durchschnittlichen abstand von punkten, die mit bekannter mittlerer dichte im raume angeordnet sind. Mathematische Annalen, 67, 387–398.
  • Hnizdo et al. (2008) Hnizdo, V., Tan, J., Killian, B. J., & Gilson, M. K. (2008). Efficient calculation of configurational entropy from molecular simulations by combining the mutual-information expansion and nearest-neighbor methods. Journal of computational chemistry, 29, 1605–1614.
  • Ising (1925) Ising, E. (1925). Beitrag zur theorie des ferromagnetismus. Z. Phys., 31, 253–258.
  • Kozachenko & Leonenko (1987) Kozachenko, L., & Leonenko, N. N. (1987).

    Sample estimate of the entropy of a random vector.

    Problemy Peredachi Informatsii, 23, 9–16.
  • Kraskov et al. (2004) Kraskov, A., Stögbauer, H., & Grassberger, P. (2004). Estimating mutual information. Physical Review E, 69, 066138–066153. doi:10.1103/PhysRevE.69.066138.
  • Lee (1982) Lee, D.-T. (1982). On k-nearest neighbor voronoi diagrams in the plane. IEEE Transactions on Computers, 100, 478–487.
  • Lizier et al. (2012) Lizier, J. T., Atay, F. M., & Jost, J. (2012). Information storage, loop motifs, and clustered structure in complex networks. Physical Review E, 86, 026110+. doi:10.1103/physreve.86.026110.
  • Lizier et al. (2008) Lizier, J. T., Prokopenko, M., & Zomaya, A. Y. (2008). Local information transfer as a spatiotemporal filter for complex systems. Physical Review E, 77, 026110+.
  • Matsuda et al. (1996) Matsuda, H., Kudo, K., Nakamura, R., Yamakawa, O., & Murata, T. (1996). Mutual information of Ising systems. Int. J. Theor. Phys., 35, 839–845.
  • Mount & Arya (2010) Mount, D. M., & Arya, S. (2010). Ann: A library for approximate nearest neighbor searching (2006).
  • Ross (2014) Ross, B. C. (2014). Mutual information between discrete and continuous data sets. PloS one, 9, e87357.
  • Schreiber (2000) Schreiber, T. (2000). Measuring information transfer. Phys. Rev. Lett., 85, 461–464.
  • Shannon (1948) Shannon, C. (1948). A mathematical theory of communication. Bell System Technical Journal, 27, 379–423, 623–656.
  • Uhlmann (1991) Uhlmann, J. K. (1991). Satisfying general proximity / similarity queries with metric trees. Information Processing Letters, 40, 175 – 179. doi:
  • Varilly (2014) Varilly, P. (2014). Periodic kdtree. URL:
  • Vicsek et al. (1995) Vicsek, T., Czirók, A., Ben-Jacob, E., Cohen, I., & Shochet, O. (1995). Novel type of phase transition in a system of self-driven particles. Physical Review Letters, 75, 1226–1229. doi:10.1103/PhysRevLett.75.1226.
  • Wang et al. (2012) Wang, X. R., Miller, J. M., Lizier, J. T., Prokopenko, M., & Rossi, L. F. (2012). Quantifying and tracing information cascades in swarms. PLoS ONE, 7, e40084+. doi:10.1371/journal.pone.0040084.
  • Wicks et al. (2007) Wicks, R. T., Chapman, S. C., & Dendy, R. (2007). Mutual information as a tool for identifying phase transitions in dynamical complex systems with limited data. Physical Review E, 75. doi:10.1103/PhysRevE.75.051125.
  • Witten & Frank (2005) Witten, I. H., & Frank, E. (2005).

    Data Mining: Practical machine learning tools and techniques

    Morgan Kaufmann.
  • Yianilos (1993) Yianilos, P. N. (1993). Data structures and algorithms for nearest neighbor search in general metric spaces. In SODA (pp. 311–21). volume 93.
  • Zhang et al. (2013) Zhang, Y.-m., Huang, K., Geng, G., & Liu, C.-l. (2013). Fast knn graph construction with locality sensitive hashing. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases (pp. 660–674). Springer.