Estimation of entropy measures for categorical variables with spatial correlation

by   Linda Altieri, et al.

Entropy is a measure of heterogeneity widely used in applied sciences, often when data are collected over space. Recently, a number of approaches has been proposed to include spatial information in entropy. The aim of entropy is to synthesize the observed data in a single, interpretable number. In other studies the objective is, instead, to use data for entropy estimation; several proposals can be found in the literature, which basically are corrections of the estimator based on substituting the involved probabilities with proportions. In this case, independence is assumed and spatial correlation is not considered. We propose a path for spatial entropy estimation: instead of correcting the global entropy estimator, we focus on improving the estimation of its components, i.e. the probabilities, in order to account for spatial effects. Once probabilities are suitably evaluated, estimating entropy is straightforward since it is a deterministic function of the distribution. Following a Bayesian approach, we derive the posterior probabilities of a multinomial distribution for categorical variables, accounting for spatial correlation. A posterior distribution for entropy can be obtained, which may be synthesized as wished and displayed as an entropy surface for the area under study.



There are no comments yet.


page 5


SpatEntropy: Spatial Entropy Measures in R

This article illustrates how to measure the heterogeneity of spatial dat...

Measuring heterogeneity in urban expansion via spatial entropy

The lack of efficiency in urban diffusion is a debated issue, important ...

Objective Bayesian Analysis for the Differential Entropy of the Gamma Distribution

The use of entropy related concepts goes from physics, such as in statis...

Bayesian Estimation of Multinomial Cell Probabilities Incorporating Information from Aggregated Observations

In this note, we consider the problem of estimating multinomial cell pro...

Categorical Distributions of Maximum Entropy under Marginal Constraints

The estimation of categorical distributions under marginal constraints s...

Estimating the Entropy of Linguistic Distributions

Shannon entropy is often a quantity of interest to linguists studying th...

Learning Interesting Categorical Attributes for Refined Data Exploration

This work proposes and evaluates a novel approach to determine interesti...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Shannon’s entropy is a successful measure in many fields, as it is able to synthesize several concepts in a single number: entropy, information heterogeneity, surprise, contagion. The entropy of a categorical variable with outcomes is


where is the probability associated to the -th outcome (coverthomas). The flexibility of such index and its ability to describe any kind of data, including categorical variables, motivates its diffusion across applied fields such as geography, ecology, biology and ladscape studies. Often, such disciplines deal with spatial data, and the inclusion of spatial information in entropy measures has been the target of intensive research (see, e.g., batty76, batty76, leibovici09, leibovici09, leibovici14, leibovici14). In several case studies, the interest lies in describing and synthesizing what is observed. This is usually not a simple task: large amounts of data require advanced computational tools, qualitative variables have limited possibilities, and, when data are georeferenced, spatial correlation should be accounted for. When it comes to measuring the entropy of spatial data, we suggest an approach proposed in nostro, which allows to decompose entropy into a term quantifying the spatial information, and a second term quantifying the residual heterogeneity.

In other cases, though, the aim lies in estimating the entropy of a phenomenon, i.e. in making inference rather than description. Under this perspective, a stochastic process is assumed to generate the data according to an unknown probability function and an unknown entropy. One realization of the process is observed and employed to estimate such entropy. The standard approach relies on the so-called ’plug-in’ estimator, presented in paninski, which substitutes probabilities with observed relative frequencies in the computation of entropy:


where is the relative amount of observations of category over data. It is the non-parametric as well as the maximum likelihood estimator (paninski), and performs well when is known (antos). For unknown or infinite , (2) is known to be biased; the most popular proposals at this regard consist of corrections of the plug-in estimator: see, for example, the Miller-Madow (miller) and the jack-knifed corrections (efron). Recently, zhang12

proposed a non-parametric solution with faster decaying bias and upper limit for the variance when

. Under a Bayesian framework, the most widely known proposal is the NSB estimator (nemenman), improved by archer

as regards the prior distribution. In all the above mentioned works, independence among realizations is assumed. Other approaches, linked to machine learning methods, directly estimate entropy relying on the availability of huge amounts of data


Two main limits concern entropy estimation. Firstly, the above mentioned proposals referring to (2) focus on correcting or improving the performance of (2), while different perspectives might be considered. Secondly, no study is available about estimating entropy for variables presenting spatial association: the assumption of independence is never relaxed, while spatial entropy studies do not consider inference.

In this paper, we take a different perspective to entropy estimation, which moves the focus from the index itself to its components. As can be seen from (1

), entropy is a deterministic function of the probability mass function (pmf) of the variable of interest. Therefore, once the pmf is properly estimated, the subsequent steps are straightforward. In the case of categorical variables following, e.g., a multinomial distribution, the crucial point is to estimate the distribution parameters. A Bayesian approach allows to derive the pmf of such distribution, and can be extended to account for spatial correlation among categories. After obtaining a posterior distribution for the parameters, the posterior distribution of entropy is also available as a transformation. Thus, a point estimator of entropy can be, e.g., the mean of the posterior distribution of the transformation; credibility intervals and other syntheses may be obtained via the standard tools of Bayesian inference. This approach can be used for non-spatial settings as well; in the spatial context, coherently with standard procedures for variables linked to areal and point data, the estimation output is a smooth spatial surface for the entropy over the area under study.

The paper is organized as follows. Section 2 summarizes the methodology for Bayesian spatial regression and shows how to obtain the posterior distribution and the Bayesian estimator of entropy. Then, Section 3 assesses the performance of the proposed method on simulated data for different spatial configurations. Lastly, Section 4 discusses the main results.

2 Bayesian spatial entropy estimation

For simplicity of presentation, we focus on the binary case. Let

be a binary variable with

and ; consider a series of realizations indexed by , each carrying an outcome . This may be thought of as a -variate variable, or alternatively as a sequence of variables , which are independent, given the distribution parameters and any effects modelling them. For a generic , the simplest model is:


in absence of random effects, where are the covariates associated to the -th unit.

To the aim of including spatial correlation, consider realizations from a binary variable over the two-dimensional space, where identifies a unit via its spatial coordinates. Let us consider the case of realizations over a regular lattice of size , where identifies each cell centroid. The sequence is now no longer independent, but spatially correlated. In order to define the extent of the correlation for grid data, the notion of neighbourhood must be introduced, linked to the assumption that occurrences at certain locations are influenced by what happens at surrounding locations, i.e. their neighbours. The simplest way of representing a neighbourhood system is via an adjacency matrix: for spatial units, is a square matrix such that when unit and unit are neighbours, and otherwise; in other words, if , the neighbourhood of area , and diagonal elements are all zero by default. In the remainder of the paper, the word ’adjacent’ is used accordingly to mean ’neighbouring’, even when this does not correspond to a topological contact. The most common neighbourhood systems for grid data are the ’4 nearest neighbours’, i.e. a neighbourhood formed by the 4 pixels sharing a border along the cardinal directions, and the ’12 nearest neighbours’, i.e. two consequent pixels along each cardinal direction plus the four ones along the diagonals.

Auto-models provide a way of including spatial correlation: they construct a statistical model that explains a response via the response values of its neighbours. It is thus developed by combining logistic regression model with autocorrelation effects. They were initially developed for the analysis of plant competition experiments and then extended to spatial data in general. BESAG (1974) proposed to model spatial dependence among random variables directly (rather than hierarchically) and conditionally (rather than jointly). The autologistic model for spatial data with binary responses emphasises that the explanatory variables are the surrounding array variables themselves; a joint Markov random field is imposed for the binary data. A recent variant of this model, substituting Equation (4) with (5), is proposed by CARAGEA and Kaiser (2009):


where parametrizes dependence on the neighbourhood and, in the simplest case, only includes an intercept. Parameter represents the expected probability of success in the situation of spatial independence.

An analogous and even more recent formulation of the model [REFERENCE], in absence of covariate information, is


where is a spatial effect with a structured covariance matrix , which depends on a precision parameter and a dependence parameter quantifying the strength and type of the correlation between neighbouring units. is the adjacency matrix reflecting the neighbourhood structure, and is a diagonal matrix, where each element contains the row sums of .

2.1 Entropy estimation

The estimation of the parameters for Bayesian spatial logit regression models may proceed via MCMC methods or the INLA approach. We exploit the latter

(rue) and obtain a posterior distribution for the parameters of the probability of success for each grid cell. A synthesis, such as the posterior mean, is chosen in order to obtain an estimate for over each cell. Such estimate is used for the computation of a local estimated entropy value for each pixel:


This way, an entropy surface is obtained for estimating the process entropy, whose smoothness may be tuned by the neighbourhood choice, or by the introduction of splines for the spatial effect. Any other surface can be obtained following the same approach for different aims, e.g. for plotting the entropy standard error or the desired credibility interval extremes.

3 Simulation study

To the aim of assessing the performance of the proposed entropy estimator, we generate binary data on a grid under two spatial configurations: clustered and random. Figure 1 shows an example of the generated datasets. The underlying model is model (6), with a 12 nearest neighbour structure for , and for the two scenarios, respectively. For each scenario, 200 datasets are generated with varying values for so that the expectation of in a situation of independence varies between and ; values for differ across replicates but are constant across scenarios, so that the proportion of pixels of each type is comparable.

Figure 1: Clustered scenario (left) and random scenario (right) - example with .

Results show that fitting the model over the generated data leads to good estimates for the s. For all scenarios, the estimated parameters are very close to the true ones, which are always included within the 95% credibility intervals. An example is shown in Figure 2.

Figure 2: Posterior distributions for the model parameters, true values (vertical line) and 95% credibility intervals (dashed vertical lines).

The proposed approach is able to produce good estimates for the probabilities of success, ensuring the goodness of the estimates for spatial entropy, which is a function of such probabilities.

Obtaining the entropy surface proceeds as follows. First, the posterior distribution for each is synthesized with the posterior mean. This way, for each scenario and replicate we obtain a single number for on every cell. Then, an entropy value is computed over each cell following Equation (7) and a smooth spatial function is produced. An example shown in Figure 3, where values range from (dark areas in the figure) to (white areas). The clustered situation (left panel) shows a smoothly varying surface. By comparing the left panels of Figure 1 and 3, one can see that the entropy surface takes low values in areas where pixels are of the same type: white pixels in the top-right part in Figure 1, and black pixels in the bottom-right part of Figure 1, correspond to the darker areas of Figure 3 where entropy values are low. In the areas where white and black pixels mix, the entropy surface tends to higher values (whiter areas in Figure 3). The random configuration (right panel of Figure 3) has a constant entropy close to the maximum ; this is expected, as there is no spatial correlation influencing the entropy surface in this scenario. Therefore, such spatial functions properly estimates the entropy of the underlying spatial process. Thanks to the availablity of the marginal posterior distribution of all parameters, any other useful information or synthesis is straightforward to compute.

Figure 3: Example of estimated entropy surface for the two scenarios.

4 Concluding remarks

In this paper, we describe an approach to entropy estimation which starts from rigorous posterior evaluation of its components, i.e. the probabilities. This way, we frame entropy within the theory of Bayesian models for spatial data, thus assembling all the available results in this field.

Results from the simulation study enforce the validity of the approach in providing good estimates for the distribution parameters and, consequently, for entropy. The flexibility of the Bayesian paradigm allows to synthesize the posterior distribution of entropy as wished, in order to answer different potential questions.

Our procedure ensures realistic results, since, when the behaviour of a spatial process is under study, the basic hypothesis is that it is not constant but smoothly varying over space. In the same spirit, an appropriate spatial entropy measure is not a single number, rather it has to be allowed to vary over space as a smooth function.