1 Introduction
Information Theory provides three useful metrics for determining relationships between data sets. Mutual Information () (Shannon, 1948)
measures the shared uncertainty between two variables and is often used as a marker for secondorder phase transitions
(Matsuda et al., 1996; Harré & Bossomaier, 2009; Wicks et al., 2007). Meanwhile, Transfer Entropy () (Schreiber, 2000) and Global Transfer Entropy () (Barnett et al., 2013) track the flow of information from one or more data sets onto another, with Barnett et al. demonstrating that can be used as a predictor for an oncoming phase transition in the Ising (1925) spin model when system temperature is reduced over time.If the underlying distribution for the data is known, these quantities can be straightforward to calculate. However, in general this distribution is unknown and can only be approximated from sampled data, thus requiring Entropy Estimators. Two well known approaches are the plugin and nearestneighbour (NN) estimators.
While simple, plugin estimators tend to be less accurate, particularly for continuous variables (Ross, 2014) which are the focus of this paper. A high quality alternative is the KozachenkoLeonenko entropy estimator (1987) which uses nearest neighbour statistics of realisations of a data series. Kraskov et al. (2004) extended this for estimating , with and estimators further derived by GómezHerrero et al. (2015). For higher dimensional data, NN estimators become even more appropriate as plugin estimators experience a combinatorial explosion in storage requirements.
On the other hand, the time complexity for the plugin estimator scales linearly on the number of points, , while NN estimators—using a direct method, where distances between each and every point are considered—scale on the order of , which becomes significant for large . This problem is widely encountered in the field of computer graphics and collision detection in computer games. The accepted solution for these scenarios are spatial partitioning data structures—such as wellknown BSP Trees (Fuchs et al., 1980) and KD Trees (Bentley, 1975)—resulting in a “broadphase” collision detection step, which provides the ability to discard large swathes of points from consideration with relatively little computation.
KD Trees are directly applicable to the nearest neighbour searches required by the NN estimators, and are commonly applied when data sets are large. However, when the data lies in a space with periodic boundaries—e.g., headings of observed objects, where —the basic KD Tree fails. While there are modifications to overcome this, it is shown later that ultimately a more appropriate data structure is needed and available in the Vantage Point (VP) Trees developed—independently—by Uhlmann (1991) and Yianilos (1993).
This paper serves to highlight the relative computational performance of the VP Tree compared with a modified KD Tree for the specific use case of measuring , and of large periodic systems via a NN estimator. This use case also presents a unique constraint in that the search structures are typically constructed and used just one time—meaning construction costs are just as important as search costs.
To establish the accuracy of the VP Tree approach, tests are performed against periodic and aperiodic canonical distributions as done by Kraskov et al. (2004). To further analyse the use of the VP Tree for large data sets with periodic boundary conditions, data sets—generated by the widelyused Vicsek model of selfpropelled particles (Vicsek et al., 1995)—are analysed. Lastly, work by Gao et al. (2015) stresses that for accurate estimation, the NN estimator requires a number of points scaling exponentially with the true . A decimation of the Vicsek data is employed to investigate this issue.
The three aforementioned information theoretic metrics are widely used in the scientific community, where mutual information has been used in many applications, including the study of phase transitions in financial markets (Harré & Bossomaier, 2009), while the range of applications for transfer entropy is building steadily, from the study of physics of fundamental systems, such as the Ising spin model (Barnett et al., 2013) to cellular automata (Lizier et al., 2008), random Boolean networks(Lizier et al., 2012), and swarms (Wang et al., 2012). Ensuring the accuracy of the VP Tree method in periodic systems is important as it will allow processing of larger data sets with fewer computational resources—which is particularly important in the age of big data.
2 Methods
The most simple of the three quantities, , measures the amount of uncertainty common to two variables, say the headings of particles in a simulation. The input for an estimator for is a set of data points, , drawn from the two variables. instead measures the flow of information between two variables, that is, the reduction in uncertainty of given past knowledge of and . Estimators for thus requires a time series of data points . Finally, measures a similar statistic as , but is used in the case where is influenced by multiple variables , and therefore input to these estimators is dimensional data set, .
The standard forms of the three metrics are (Shannon, 1948; Schreiber, 2000; Barnett et al., 2013):
(1) 
(2) 
(3) 
where . and are pairwise quantities, and as such, can be calculated using just two variables—and further averaged over every pairwise combination of variables if the data set contains many interacting variables.
on the other hand is a global quantity, so each random variable is considered in turn for the target variable.
is thus the set of all influencing variables on the chosen target, .Throughout this work, and are drawn from a variety of sources where: (a) analytic results are available, (b) prior results for metrics are available, or (c) statistics are unknown and hard to estimate. A collection of canonical distributions with analytic results and distributions with prior results are provided in Tables 1 and 2
. Amongst these distributions is the ubiquitous Gaussian distribution, as well as the Von Mises distribution which is accepted as the circular analogue of the Gaussian distribution. The Vicsek Model (described in §
2.2) is used to generate difficulttoestimate data.There are some optimisations available for depending on the nature of the data. If the variables of the data set are indistinguishable then can be calculated as a single ensemble calculation treating it as a multivariate calculation, rather than calculations. This technique is employed in §3.1. A second such optimisation is dependent on how variables influence each other. In the case of the Vicsek model used later, variables influence one another in an aggregate fashion from which a consensus variable, , can be constructed. Thus can be calculated as the 3dimensional , rather than the above dimensional form. This reduction is utilised in §3.23.3.
2.1 Nearest neighbour method
The plugin estimation approach uses these data sets to determine the underlying distribution often via a histogram. As mentioned in the introduction, once this distribution is revealed it is often trivial to solve the above equations. However, working with continuous data presents certain issues for traditional plugin histogram entropy estimation. For example, tuning the bin width parameter for discretising the continuous data is susceptible to the number of points available, which can introduce error at high or low noise if the bin width is too low or high, respectively. While adaptive binning can mitigate this somewhat, it is not always an option, particularly when the data covers the entire domain.
Instead, one can use NearestNeighbour Entropy Estimators. These estimators are based on the (continuous) statistics of the nearest neighbours to each data point and the distribution of points along marginal dimensions as introduced by Kozachenko & Leonenko (1987). In these methods, the maxnorm distance, , is found for each point and its
th nearest neighbour (kNN search). A fixed range (FR) search is then performed to count the number of points within the (hyper)
“ball” defined by from along the marginal spaces. These values are then used to compute the above information theoretic quantities. See Fig. 1 for an example of collecting nearest neighbour information.Kraskov et al. (2004) provide the nearest neighbour estimator for as:
(4) 
where is the digamma function, is the count of points within the distance from point in the marginal space (either or ), and is the average over all points.
The extension to this for and is provided by GómezHerrero et al. (2015), and requires computing:
(5) 
(6) 
where and represent the state at , represents the current state of at and represents the set of variables used for the . is thus the number of points that exist within the distance from point when only considering the and coordinates of the data set.
Each of the above quantities requires kNN searches and either or FR searches. In a naïve approach, each of these searches is a operation, clearly making this approach more computationally expensive than simple plugin estimators. However, by choosing more appropriate searching algorithms we can significantly reduce the cost of these estimators such that they become a feasible approach.
2.1.1 Underlying data structure
The naïve approach for finding the th nearest neighbours and the marginal neighbour counts for point is to calculate the distance between and every other point, and only keep those points passing the criteria—i.e., amongst the closest points, or within distance . This approach requires comparisons for all points, resulting in a computational complexity of and as such is not a suitable approach when is large.
A common solution to this problem in computer graphics are data structures such as the KD Tree (Bentley, 1975), which subdivide a space using lines to define front/back areas, as seen in Fig. 2. When considering point , one can quickly cull large sections of space—those containing points which cannot possibly be among the nearest neighbours, or within the bounds —from further consideration. This approach reduces the complexity to (Friedman et al., 1977). The subdivision process creates a tree hierarchy which can be seen in Fig. 3.
However, when the data lies in a space with periodic boundary conditions a structure such as a KD Tree will incorrectly cull points. Since the space wraps, a point is both in front of and behind any other point, while the KD Tree only considers points up to the boundary and not beyond. Three potential methods for correcting this—using Images, a hybrid involving naïve search and KD Trees, and VP Trees—are discussed below.
The images method duplicates and shifts the data—thus creating cloned images—such that when the KD Tree is constructed from the combination of the original data set and all images, it has the illusion of periodicity. This requires duplications for a dimensional space which can quickly become intractable, especially for . Note however, that typically only the shortest distance between points and are considered and as such this can be reduced to by only duplicating the closest region (half/quarter/eighth/etc.) as needed. Some specific cases—perhaps where direction is a factor—might require full duplications, however, this will not be considered here, due to both the difficulty in imagining such a scenario—implying the data is both circular and pairwise ordered—as well as the fact that the two following approaches are only capable of solving the shortest distance form.
The second method—a hybrid of duplication and naïve search—requires only a single duplication, but can degenerate to a naïve search for select points. The image in this method—queried separately—is shifted such that points originally at the corners are now in the centre of the space, and vice versa—i.e., each dimension is shifted by half of its width. When querying (kNN or FR), if it is closer to the boundary than —i.e., there are possibly points just on the other side of the boundary closer than —the query is repeated in the image. The boundary check is repeated on the image result, and if it fails again, the query steps down to a naïve search. This method requires less duplication than the previous method but requires more computation per point.
Instead of wrangling the KD Tree to handle periodicity, the third approach is to use a different structure. As already discussed, KD Trees cannot deal with periodic conditions because the nature of front and back areas in periodic spaces is illdefined. VP Trees work instead by choosing a point, (the vantage point), and then subdividing the space into those points nearer to and those points farther from than a given threshold. Since distances—and thus near/far points—are trivial to calculate in a periodic space, this structure can be used to attain efficient searching without resorting to any tricks. To create a balanced tree, the threshold is chosen such that it divides the points in half. The subdivision process is repeated on each new subset with new vantage points and thresholds chosen from only those points in the subregion. Fig. 4 shows how this subdivision strategy works in an example space. Note that the hierarchical structure of this tree is the same as the KD Tree, just with the left children being near—instead of in front—and the right children being far.
One drawback to this near/far dichotomy is that a new tree is needed for each set of marginals as the dimension reduction can change the distance between points. That is, if the maxnorm distance between dimensional realisations, , is the distance along dimension , then a marginal subset that does not include will have a different distance between and . Similar arguments apply to other norms.
Thus a separate tree is needed for each kNN and FR search—totalling three for and four for and . However, the FR searches are independent of each other, thus if the kNN search is performed and the ball sizes stored the remaining trees can be used in serial. Since each tree is only used once (with all points processed), only one tree is ever needed at any given time, reducing the space requirements for this approach to , rather than or .
When performing an FR search in just one dimension (i.e., for counting and ), it was found that performing a binary search to find the range of elements covered by —and thus the neighbour count—was faster than VP Trees. As such in the one dimensional searches, all test cases revert to this simpler algorithm. This also reduces the number of VP Trees required to one for and three for and .
The KD Tree implementation used in this paper was provided by the ANN library (Mount & Arya, 2010), which is a standard package in the area of nearest neighbour searches with over 330 citations on Google Scholar, while the VP Tree implementation is a slightly modified version of the public domain code by Hanov (2012). Our software is provided online. Both implementations are written in C++.
It should also be noted that other structures and approaches exist for solving this issue. For example, the periodic KD Tree library (Varilly, 2014) (written in Python) instead duplicates the query points rather than the data points. While this halves the storage requirements, each query point needs to be processed times (in the typical case) or
times (in the general, if not odd, case), rather than 1 to 3 times as in the hybrid model. This can be an obstacle in the case of measuring information theoretic quantities where each data point is also a query point (i.e.,
rather than ). Another two structures capable of addressing the kNN problem are Locality Sensitive Hashing (LSH) (Zhang et al., 2013) and higher order Voronoi Diagrams (specifically, korder) (Lee, 1982; Dwyer, 1991). For the purposes of estimating information theory quantities however, these algorithms will not perform significantly better than the KD Tree or VP Tree due to space requirements, the oneshot nature of the estimation (that is, only using each structure once), and the distribution of data points. As such, the authors shall simply note their existence (and better performance in other use cases) and consider just the KD Tree and VP Tree structures for the remainder of this paper.2.2 Vicsek Model
The standard Vicsek Model (SVM) (Vicsek et al., 1995) is a simple model that is widely used in collective motion studies. The simplicity of the model allows it to easily scale to large system sizes (both in terms of time and space). As such, it easily generates a large number of points—with periodic boundary conditions—perfect for testing VP trees. In the model, a flock of particles move in a continuous space (offlattice) with periodic boundary conditions of linear size . Particle density is defined as . Particles move with constant speed, , and are simulated for discrete time steps. Formally, the particles are updated according to:
(7)  
(8) 
where is the position of particle at time , is its velocity, which is constructed with speed and heading . defines the average angle of all particles within units of (including itself) and is a realisation of uniform noise on the range , where is our temperature variable and .
Particle alignment is measured via the instantaneous order (or disorder) using the parameter:
(9) 
where . When , the flock is completely aligned and collective motion has been attained, while represents the disordered, chaotic state of particles moving with no alignment. is associated with no noise in the system—i.e., —while occurs at high noise, .
Note that while particle positions are indeed in a wrapped space, the random variables evaluated below are actually the heading angles of neighbouring (interacting) particles. That is, the quantities describe the various reductions in uncertainty of a particle’s heading given knowledge of its past heading and its neighbours’ headings, not their positions. To measure the information theoretic quantities—using the notation from Eqs. (46)— is defined as the heading of particle , as the heading of an interacting neighbour particle , and as the next heading of particle . In the case of and , which are pairwise quantities, this is repeated for each interacting neighbour for every time step . On the other hand, measures all interacting neighbours in one multidimensional variable , and thus generates just a single point. Particle indistinguishability is employed and repeated the above for all , coalescing all points into a single point cloud. The pairwise quantities generate a separate data point for all interacting pairs, which is not constant over time or noise. See Fig. 5 for a visualisation of how the number of interacting pairs changes as a function of noise temperature, . on the other hand generates a single point per for all of its interacting neighbours at time and thus contains precisely data points at all noise values.
There is one major difficulty in calculating however. uses all interacting neighbours, , to particle at time step to define the multivariate —i.e., . is not constant which causes issues when collecting all realisations into a single point cloud for calculation. One solution is to pick an arbitrary value of —say the minimum, maximum or mean of —however these options will miss pertinent variables and thus give an inaccurate result, or will include irrelevant particles and thus increase the complexity of the problem. The alternative is to note that particles only influence each other via the consensus heading, , and to use it instead of each particle individually. The latter approach is used for all Vicsek results below.
Finally, to provide an example for the earlier claim regarding adaptive binning techniques for plugin estimators, note that the average heading of Vicsek flocks will proceed on a random walk about the unit circle over time. As such, the angles generated by each particle will be roughly uniform over each marginal (although in the joint space, accumulates around the diagonal), degenerating adaptive bins to simple fixedwidth bins.
3 Results
3.1 Accuracy of NN estimator
Metric  

Distribution*  
Gaussian  3.4  3.5  3.2  
0.0  0.0  0.0  
5.1  3.9  4.0  
0.8304  0.1270  0.1816  
LogNormal  3.4  3.4  3.0  
0.0  0.0  0.0  
5.1  3.7  4.0  
0.9741  0.1314  0.1871  
Cauchy  4.6  4.1  4.0  
0.2242  0.0827  0.1251  
7.2  4.6  4.9  
1.0545  0.2098  0.3067  
Uniform  3.4  3.4  3.2  
0.0  0.0  0.0  
Uniform Wrapped  3.4  3.4  3.2  
0.0  0.0  0.0  
Hahs & Pethel  N/A  —  3.7  N/A  —  
0.1651  
Von Mises  a  0.3489  4.3  N/A  —  N/A  — 
0.3488  
b  0.4384  4.5  N/A  —  N/A  —  
0.4384  
c  0.1940  4.3  N/A  —  N/A  —  
0.1928 
repetitions to establish the mean estimate—and thus bias—and standard error of the estimator († Multiplied by
for readability). Bias is dependent on number of realisations and covariance of random variables, with improved bias as realisations increase and covariance decreases. *Gaussian, LogNormal and Cauchy distributions performed with covariance,
(i.e., ) and . Parameters for Von Mises and Hahs & Pethel evaluations found in text.To ensure the integrity of the VP Tree implementation of the NN estimator, , and are estimated from data series sampled from distributions with known closed form entropies. Kraskov et al. (2004) performed this test for against Gaussian data with multiple covariances, which is repeated and extended here. This establishes the accuracy of the estimator for the and as well. The estimator is also tested against the LogNormal, Cauchy (Via a Student’s t distribution with degree of freedom) and Uniform distributions. For these tests—and all tests below— is set to as per the suggestion of Kraskov et al. (2004), who propose the choice of as it provides a good balance between statistical and systematic errors in the NN estimator. For larger systems, larger is allowed as systematic errors tend to 0. Simulations with confirmed no change in estimate, and so only is presented below.
To extend the test to account for periodic cases, all three metrics are tested against a wrapped uniform distribution and against the von Mises distribution (sine variant). Exact for the von Mises distribution is provided in Table 1 of Hnizdo et al. (2008), for three parameter sets: a) , b) , and c) , with for all three sets.
Lastly, the estimator is tested against the autoregressive process described by Hahs & Pethel (2013) (Example 1) which includes an analytical expression for
. Variances of the Gaussian noise terms in each process were
for and , respectively, with and , and an initial value of .The estimated and exact values for these distributions can be seen in Table 1. Estimates were measured from realisations and repeated times. Note that as the number of realisations increases, the estimate values converge to the exact values, as observed in Kraskov et al. (2004). Closed form expressions for the information theoretic quantities can seen in Table 2.
Distribution  Closed Form 

Uniform  
Uniform Wrapped  
Hahs & Pethel  See Hahs & Pethel (2013) 
Von Mises  See Hnizdo et al. (2008) 
Where is the scalar covariance between and and is the covariance matrix between the subscripted variables. Additionally, is the Gamma function while is the digamma function. Information sharing and flow are nonexistent for uniform distributions by definition and thus become 0.
Given the results in Table 1, and the converging nature as the number of realisations is increased, it is determined that the VP Tree implementation of the NN estimator is accurate.
3.2 Comparison of underlying data structure
Yianilos (1993) and Friedman et al. (1977) provide Big analysis of VP Trees and KD Trees, respectively, showing both can be constructed and searched in time and use space. Here, estimation of the complexity of the hybrid KD Tree approach is provided, followed by empirical data comparing computation and space of the VP Tree and hybrid approaches.
As mentioned above, the metrics for the Vicsek model are calculated from a two or three dimensional space with periodic boundary conditions, with sides . To perform kNN searches the hybrid method constructs two KD Trees: partitions the points , while partitions the points —the original points shifted by along every dimension, with wrapping.
To refresh, the hybrid search works as follows: search the first tree, , for the th nearest neighbour to , which gives a distance, . If is closer than units to the boundary then the search progresses to the equivalent point in the second tree, , to find a corresponding distance, . If is also closer than units to the boundary, the search reverts to a naïve linear search over all points.
Thus areas can be defined. First is the total area, , containing all points. Second is the inner area, given by the box of sides centred in the space, containing all points successfully processed by in the first step. Following this is the border area, , covering those points that will process (successfully or otherwise). Finally, the naïve area, , contains those points that will ultimately be processed using the linear search, and is defined as the area within a maxnorm distance of from the boundary midpoints, such that .
Due to the tiered approach of the hybrid method, the time complexity will follow the form
(10) 
where is the proportion of points in the border area and is the proportion of points in the naïve area. Note that .
In the high noise Vicsek case, where is uniformly distributed with density the average distance to the th nearest neighbour is (Bhattacharyya & Chakrabarti, 2008; Gaboune et al., 1993; Hertz, 1909):
(11) 
where . This allows further expansion of and .
For the high noise uniform case the proportions are simply and , which gives time complexity of
(12) 
for the twodimensional case and
(13) 
for the threedimensional case.
In other distributions, such as the low noise Vicsek model, and will change. In the specific case of low noise Vicsek data, points will accumulate along the and diagonals for and , respectively. This is ideal as it results in minimal points in —i.e., . Furthermore, the amount of points in will also be significantly reduced since only points near the corners—that is near —will be within the required threshold, with , noting that will not match Eq. (11) due to the distribution change.
Given that , it is clear that Eqs. (12) and (13) are of order and thus equivalent to the VP Tree. In practical terms, however, a nonzero and as well as requiring twice the construction time—which is itself—will result in lower processing throughput.
To demonstrate the difference between the two approaches, empirical data is provided below where Vicsek data sets are analysed against the number of interactions generated. Only and are investigated as these have variable , while has constant and thus does not provide any additional insight. As mentioned earlier, both algorithm implementations are provided with off the shelf software—with slight modification of the VP Tree implementation such that it uses contiguous memory allocation similar to the KD Tree implementation.
Fig. 6 shows the time taken to calculate and for a range of . While all four measurements seem to scale according to —as described above—the VP Tree methods are faster. The difference in running time for calculation of is strictly in finding the ball sizes, as both FR searches are performed with the same binary search function. Calculation of is worse still due to relying on the hybrid method more, with computations taking almost ten hours with the hybrid method compared to only 30 minutes under the VP Tree method.
Fig. 7 shows the memory usage for the same calculations as above. All four metrics are rigidly linear in their scaling—as expected—but there is a marked increase in memory usage between the two methods, with the hybrid method performing definitively worse with both metrics.
3.3 Additional checks of numerical stability
Two additional checks for numerical stability were performed with the Vicsek data—a random shuffle and a data decimation. These additional checks were not performed for the canonical distributions as exact results are known for these.
In the first check, a random shuffle is employed on the coordinates corresponding to the variable(s) for all three metrics. This should eliminate most of the information sharing between data sets and thus result in . Results can be seen in Fig. 8.
The decimation method is widespread and is employed to test the precision of the NN estimator implementation in a manner analogous to crossvalidation (Witten & Frank, 2005). Specifically, data is generated from the Vicsek simulations, and a random subset containing 10% of the entire population is chosen, which were then processed with the NN estimators. This was repeated ten times per run. Figure 8 shows that there was not a significant impact from this decimation, in alignment with assertions by Gao et al. (2015).
3.4 Periodicity in data structure
Hnizdo et al. (2008) uses the ANN library for their KD Tree implementation, as here, in which they measure for the circular von Mises distribution. Interestingly however, they make no mention of the inability for KD Trees to handle periodic situations. Furthermore, their parameters for the von Mises distribution use a mean of for both random variables—with a periodic range of —meaning one would expect periodic artefacts to be quite pronounced as the distribution is centred on the border.
However, this is not the case. sets of realisations of parameter set c (Table 1) were tested with and without wrapping enabled in the VP Tree implementation. The estimated for the two tests were and for wrapped and nonwrapped VP trees, respectively, with an exact result of .
To see why the difference in results is not larger, consider not accounting for periodic conditions. This will only affect points near the boundary, where instead of counting the neighbours between —with wrapping handled—the process counts the neighbours in the range (or ), noting that will not necessarily be the same as since it is likely the th nearest neighbour will be a different point. In a data set centred around , even though many points are near the boundary, only a small subset will cross the boundary in either the kNN or FR searches. Additionally, the mirrored density on either side of the boundary means that for these points—likewise for . Note that the actual distance to the th nearest neighbour is not used in Eqs. (46), only the neighbour counts to which the digamma function is applied—which is approximately logarithmic for values over 10. These facts, combined with the result being averaged over all leads to the small change in result when periodic boundaries are ignored.
To test for the mirrored density assertion above, the data set is shifted such that the data on the boundary is asymmetric. This is achieved using . The estimate for wrapped trees is (rightfully) unaffected, however the nonwrapped trees estimate worsens, to .
4 Conclusion
In this paper, we discuss estimating information theoretic quantities for continuous variables using nearest neighbour estimators, specifically for systems with a large amount of circular data. The direct method of calculation is shown to be infeasible as it scales proportional to . While spatial partitioning structures can reduce this to , special care must be taken when dealing with periodic boundary conditions.
We discuss three data structures which are capable of handling these boundary conditions and have demonstrated that choosing a more appropriate data structure like the VP Tree can significantly increase performance in terms of time and memory used. Furthermore, as the VP Tree is fundamentally similar to the KD Tree, instead relying on distance between points rather than their specific topology, it should perform no worse than the KD Tree in any metric space on any manifold of a closed form, such as a sphere or Klein bottle.
We have also shown that employing a decimation technique to reduce the number of points processed does not have a significant impact on estimation.
By combining the considerations put forth in this paper, estimation of , , and for large systems with periodic boundary conditions is feasible and will show the same performance characteristics as systems with fewer constraints.
Acknowledgements
Joshua Brown would like to acknowledge the support of his Ph.D. program and this work from the Australian Government Research Training Program Scholarship.
The National Computing Infrastructure (NCI) facility provided computing time for the simulations under project e004, with part funding under Australian Research Council Linkage Infrastructure grant LE140100002.
References
 Barnett et al. (2013) Barnett, L., Harré, M., Lizier, J., Seth, A. K., & Bossomaier, T. (2013). Information flow in a kinetic ising model peaks in the disordered phase. Phys. Rev. Lett., 111, 177203.
 Bentley (1975) Bentley, J. L. (1975). Multidimensional binary search trees used for associative searching. Communications of the ACM, 18, 509–517.

Bhattacharyya & Chakrabarti (2008)
Bhattacharyya, P., & Chakrabarti, B. K.
(2008).
The mean distance to the nth neighbour in a uniform distribution of random points: an application of probability theory.
European Journal of Physics, 29, 639.  Dwyer (1991) Dwyer, R. A. (1991). Higherdimensional voronoi diagrams in linear expected time. Discrete & Computational Geometry, 6, 343–367.
 Friedman et al. (1977) Friedman, J. H., Bentley, J. L., & Finkel, R. A. (1977). An algorithm for finding best matches in logarithmic expected time. ACM Transactions on Mathematical Software (TOMS), 3, 209–226.
 Fuchs et al. (1980) Fuchs, H., Kedem, Z. M., & Naylor, B. F. (1980). On visible surface generation by a priori tree structures. In ACM Siggraph Computer Graphics (pp. 124–133). ACM volume 14.
 Gaboune et al. (1993) Gaboune, B., Laporte, G., & Soumis, F. (1993). Expected distances between two uniformly distributed random points in rectangles and rectangular parallelpipeds. Journal of the Operational Research Society, 44, 513–519.
 Gao et al. (2015) Gao, S., Ver Steeg, G., & Galstyan, A. (2015). Efficient estimation of mutual information for strongly dependent variables. In AISTATS (pp. 277–286).
 GómezHerrero et al. (2015) GómezHerrero, G., Wu, W., Rutanen, K., Soriano, M. C., Pipa, G., & Vicente, R. (2015). Assessing coupling dynamics from an ensemble of time series. Entropy, 17, 1958–1970.
 Hahs & Pethel (2013) Hahs, D. W., & Pethel, S. D. (2013). Transfer entropy for coupled autoregressive processes. Entropy, 15, 767–788. doi:10.3390/e15030767.
 Hanov (2012) Hanov, S. (2012). Vp trees: A data structure for finding stuff fast http://stevehanov.ca/blog/index.php?id=130.
 Harré & Bossomaier (2009) Harré, M., & Bossomaier, T. (2009). Phasetransition – behaviour of information measures in financial markets. Europhysics Letters, 87, 18009.
 Hertz (1909) Hertz, P. (1909). Über den gegenseitigen durchschnittlichen abstand von punkten, die mit bekannter mittlerer dichte im raume angeordnet sind. Mathematische Annalen, 67, 387–398.
 Hnizdo et al. (2008) Hnizdo, V., Tan, J., Killian, B. J., & Gilson, M. K. (2008). Efficient calculation of configurational entropy from molecular simulations by combining the mutualinformation expansion and nearestneighbor methods. Journal of computational chemistry, 29, 1605–1614.
 Ising (1925) Ising, E. (1925). Beitrag zur theorie des ferromagnetismus. Z. Phys., 31, 253–258.

Kozachenko & Leonenko (1987)
Kozachenko, L., & Leonenko, N. N.
(1987).
Sample estimate of the entropy of a random vector.
Problemy Peredachi Informatsii, 23, 9–16.  Kraskov et al. (2004) Kraskov, A., Stögbauer, H., & Grassberger, P. (2004). Estimating mutual information. Physical Review E, 69, 066138–066153. doi:10.1103/PhysRevE.69.066138.
 Lee (1982) Lee, D.T. (1982). On knearest neighbor voronoi diagrams in the plane. IEEE Transactions on Computers, 100, 478–487.
 Lizier et al. (2012) Lizier, J. T., Atay, F. M., & Jost, J. (2012). Information storage, loop motifs, and clustered structure in complex networks. Physical Review E, 86, 026110+. doi:10.1103/physreve.86.026110.
 Lizier et al. (2008) Lizier, J. T., Prokopenko, M., & Zomaya, A. Y. (2008). Local information transfer as a spatiotemporal filter for complex systems. Physical Review E, 77, 026110+.
 Matsuda et al. (1996) Matsuda, H., Kudo, K., Nakamura, R., Yamakawa, O., & Murata, T. (1996). Mutual information of Ising systems. Int. J. Theor. Phys., 35, 839–845.
 Mount & Arya (2010) Mount, D. M., & Arya, S. (2010). Ann: A library for approximate nearest neighbor searching (2006).
 Ross (2014) Ross, B. C. (2014). Mutual information between discrete and continuous data sets. PloS one, 9, e87357.
 Schreiber (2000) Schreiber, T. (2000). Measuring information transfer. Phys. Rev. Lett., 85, 461–464.
 Shannon (1948) Shannon, C. (1948). A mathematical theory of communication. Bell System Technical Journal, 27, 379–423, 623–656.
 Uhlmann (1991) Uhlmann, J. K. (1991). Satisfying general proximity / similarity queries with metric trees. Information Processing Letters, 40, 175 – 179. doi:http://dx.doi.org/10.1016/00200190(91)90074R.
 Varilly (2014) Varilly, P. (2014). Periodic kdtree. URL: https://github.com/patvarilly/periodic_kdtree.
 Vicsek et al. (1995) Vicsek, T., Czirók, A., BenJacob, E., Cohen, I., & Shochet, O. (1995). Novel type of phase transition in a system of selfdriven particles. Physical Review Letters, 75, 1226–1229. doi:10.1103/PhysRevLett.75.1226.
 Wang et al. (2012) Wang, X. R., Miller, J. M., Lizier, J. T., Prokopenko, M., & Rossi, L. F. (2012). Quantifying and tracing information cascades in swarms. PLoS ONE, 7, e40084+. doi:10.1371/journal.pone.0040084.
 Wicks et al. (2007) Wicks, R. T., Chapman, S. C., & Dendy, R. (2007). Mutual information as a tool for identifying phase transitions in dynamical complex systems with limited data. Physical Review E, 75. doi:10.1103/PhysRevE.75.051125.

Witten & Frank (2005)
Witten, I. H., & Frank, E.
(2005).
Data Mining: Practical machine learning tools and techniques
. Morgan Kaufmann.  Yianilos (1993) Yianilos, P. N. (1993). Data structures and algorithms for nearest neighbor search in general metric spaces. In SODA (pp. 311–21). volume 93.
 Zhang et al. (2013) Zhang, Y.m., Huang, K., Geng, G., & Liu, C.l. (2013). Fast knn graph construction with locality sensitive hashing. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases (pp. 660–674). Springer.