Fast Spatial Autocorrelation
Physical or geographic location proves to be an important feature in many data science models, because many diverse natural and social phenomenon have a spatial component. Spatial autocorrelation measures the extent to which locally adjacent observations of the same phenomenon are correlated. Although statistics like Moran's I and Geary's C are widely used to measure spatial autocorrelation, they are slow: all popular methods run in Ω(n^2) time, rendering them unusable for large data sets, or long time-courses with moderate numbers of points. We propose a new S_A statistic based on the notion that the variance observed when merging pairs of nearby clusters should increase slowly for spatially autocorrelated variables. We give a linear-time algorithm to calculate S_A for a variable with an input agglomeration order (available at https://github.com/aamgalan/spatial_autocorrelation). For a typical dataset of n ≈ 63,000 points, our S_A autocorrelation measure can be computed in 1 second, versus 2 hours or more for Moran's I and Geary's C. Through simulation studies, we demonstrate that S_A identifies spatial correlations in variables generated with spatially-dependent model half an order of magnitude earlier than either Moran's I or Geary's C. Finally, we prove several theoretical properties of S_A: namely that it behaves as a true correlation statistic, and is invariant under addition or multiplication by a constant.
READ FULL TEXT