A maximum value for the Kullback-Leibler divergence between quantum discrete distributions

by   Vincenzo Bonnici, et al.

This work presents an upper-bound for the maximum value that the Kullback-Leibler (KL) divergence from a given discrete probability distribution P can reach. In particular, the aim is to find a discrete distribution Q which maximizes the KL divergence from a given P under the assumption that P and Q have been generated by distributing a fixed discretized quantity. In addition, infinite divergences are avoided. The theoretical findings are used for proposing a notion of normalized KL divergence that is empirically shown to behave differently from already known measures.


page 1

page 2

page 3

page 4


On the Properties of Kullback-Leibler Divergence Between Gaussians

Kullback-Leibler (KL) divergence is one of the most important divergence...

KL Divergence Estimation with Multi-group Attribution

Estimating the Kullback-Leibler (KL) divergence between two distribution...

Concentration and Confidence for Discrete Bayesian Sequence Predictors

Bayesian sequence prediction is a simple technique for predicting future...

Convergence of Langevin MCMC in KL-divergence

Langevin diffusion is a commonly used tool for sampling from a given dis...

On the Upper Bound of the Kullback-Leibler Divergence and Cross Entropy

This archiving article consists of several short reports on the discussi...

Bregman Divergence Bounds and the Universality of the Logarithmic Loss

A loss function measures the discrepancy between the true values and the...

Kullback-Leibler control for discrete-time nonlinear systems on continuous spaces

Kullback-Leibler (KL) control enables efficient numerical methods for no...

1 Introduction

The Kullback-Leilber divergence (KL), also called entropic divergence, is a widely used measure for comparing two discrete probability distributions [7]

. Such a divergence is derived from the notion of entropy, and it aims at evaluating the amount of information that is gained by switching from one distribution to another. The applications of the divergence ranges in several scientific area, for example, for testing random variables

[1, 9, 2], for selecting the right sample size [4], for optimizing sampling in bioinformatics [11] or for analysing magnetic resonance imagines [18]. However, it has two important properties that also act as a limitation to its applicability. It can not be used as a metric because it is not symmetric, in fact, , being and two probability distributions. Moreover, its value is 0 if equal distributions are compared, but it is shown to not have an upper-bound limit to its possible value. One of the reasons is because it results into an infinite divergence if the probability of a specific event is equal to 0 in but is greater than 0 in . However, even if infinite divergence are discarded, an upper-bound for the entropic divergence can not be established.

The search for bounded divergences is an important topic in Information Theory and some attempts have been done in the past years, for example by the so called Jensen-Shannon divergence (JS) [10]

. Its main goal is to provided a notion of symmetric divergence, but it is shown to be upper-bounded by the value 1 if the base of the used logarithm is 2. Its is a metric but its values are not uniformly distributed within the range

, how it is empirically shown in this study. Kullback-Leibler and Jensen-Shannon measures are in the class of -divergences [15]

which aim at representing the divergence as an average of the odds ratio given by

and that is weighted by a function . Each divergence has specific meaning and behaviours, and the relations among different types of -divergence is a well-studied topic [16]. The Hellinger distance [6] is one of the most used measure among the -divergences, together with Kl and JS. It avoids infinite divergences by definition and it is bounded between 0 and 1.

The present work aims at providing that, given a discrete probability distribution , there exists a distribution which maximizes the entropic divergence from . Thus, for each distribution , an upper-bound to the divergence form can be obtained by constructing . The assumptions are that infinite divergences must be avoided and that the two distributions must be formed by distributing a discretized quantity. In addition, it is shown that if monotonically decreasing ordered distributions are taken into account for , then assumes the same shape for every of those distributions. The fact that the ordering of probability distributions does not affect the value of the Kullback-Leibler divergence implies that the same (reordered) can be used for maximizing the divergence of every distributions. These theoretical results allow the introduction of a notion of entropic divergence that is normalized in the range , independently from the base of the used logarithm. Such a measure is compared with the more common used notions of divergence, and distance, between distributions by showing that it behaves in a very specific way. In addition, it is empirically shown that its values are better distributed in the range w.r.t. the compared measures.

2 An upper-bound to the entropic divergence.

A distribution can be defined as a function which distributes a given discrete quantity to a finite set of cells. Thus, , for , and . We refer to such kind of distributions as discrete multiplicity distributions. They are discrete because their domain is a set of finite discrete elements, the cells. Multiplicity because of the type of their codomain, in fact they assign a multiplicity value to each element of the domain. Discrete multiplicity distributions are commonly transformed into discrete probability (frequency) distributions by converting them to a distribution such that the sum of its outcomes equals 1. Thus, a discrete probability distribution is obtained by diving the assigned quantity for the total quantity, thus . With analogy to Ferrers diagrams [13], the distributed quantity is a finite set of dots which are assigned to cells. In this context, the notion of quantum discrete distribution relates to the fact that a distribution is defined on a discrete domain and that the assigned values are formed by quanta, namely discretized unitary pieces of information.

It has to be noticed that what is defined here is a special type of discrete probability distributions. In fact, in general it is not required that a probability distributions is sourced by a discrete quantity

distributed over a finite set of cells. Such a type of distribution is of great importance in the field of Computer Science, where probabilities are estimated by looking at frequencies calculated from discrete quantities, for example for representing biological information

[14, 12, 19].

Given two probability distributions the entropic divergence, also called the Kullback–Leibler (KL) divergence from the authors who discovered it [7], aims at measuring the information gain from one distribution to another. For two probability distributions, and , that are defined on the same domain , the divergence of from is defined as:


. The divergence is not symmetric, thus , and the possible values that in general it can takes ranges between and . In fact, the divergence is if the two distributions equal in their outcomes, namely . It has no upper bound as it has been shown by the Gibbs’ inequality [3]. However, such an affirmation has been shown by comparing two general distributions and by stating that the entropic divergence is a difference between the two quantities and , which implies


, and thus



Here, we are interested in finding a distribution that maximizes the value for a given fixed distribution , under the assumption that and have been generated by distributing a limited quantity . The assumption is of crucial importance in order to obtain, in practical situations, an upper bound to the divergence from a given distribution .

The general concept of distribution, and thus of probability distribution, is independent from a given ordering of the elements in . However, for practical issues, an ordering of the domain set is usually assumed. Therefore, the set is assumed to be an ordered set such that , for . In addition, the value associated to the -th elements is here referred to as , rather than its parenthesized version . Consequently, the KL formula can be rewritten as:


. However, it has to be pointed out that the ordering does not affect, in any way, the value of the KL divergence.

If the ordering is not taken into account, the total number of distinct distributions that can be formed by arranging a quantity in distinct cells is , namely the binomial coefficient with parameters and . The aim of the present study is to show that for each of these distributions, when it is converted into a probability distribution, there exists one and only one other distribution that maximizes the value of the entropic divergence.

Two distributions and are considered equal, thus not distinct, if . The introduction of the ordering decreases the possible ways of arranging the dots into cells, thus the total number of possible distributions decreases too. In fact, multiple unordered distributions may result equal after the ordering. The number of distinct ordered distribution is equivalent to the number of partitions of the integer (because the constrain is that each cell must have at least one element) such that each addend does not exceeds the value . Such a number can be obtained by the recursive formula with , , for or , where is the number of cells and is the number of distributed dots [17].

However, since the ordering does not affect the KL value, the discovering of one diverging distribution that maximizes the KL value of a given order distribution can be reused for the generating unordered distribution, and for all the ordered distributions generated from it.

Given a set of ordered distributions that are obtained by rearrangements form the same unordered distribution, a monotonically decreasing order is taken into account for choosing the distribution that is representative for the set. The order is applied to the values , thus for the resultant distribution it happens that . Moreover, if then no distinction is made between the two positions and . Thus, the goal is, given a monotonic decreasing distribution , to define the shape of the distribution which maximized the entropic divergence to .

In order to avoid infinite divergences, it is required that the compared distributions, and , must be defined on the set set and that for each cell the two distribution are non-zero valued, namely and for . This constraint together with the discretization of the quantity that is distributed to the cells implies that at each cell at least a quantity equal to 1 is assigned. Thus, for every . Thus, for constructing the distribution , the quantity that must be arranged is .

Intuitively, distributions that have a shape similar to , for example other monotonically decreasing distributions, result into a low divergence. In contrast, distributions that have a completely rearranged shape, w.r.t. , should show high entropic divergences. One of these shapes should be the distribution that is specular to relatively to the middle position . However, here it is shown that the distributions that maximizes the divergence takes a very different shape.

The entropic divergence is a sum of terms in the form . If a negative contribution is given to the sum because of the logarithmic function, while for positive contributions are given. Thus, the aim is to reduce the number of positions with negative contributions. In addition, each term is mediated by the factor, thus it is preferable to assign positive contributions to the greatest values. On the contrary, negative contributions should be assigned to the smallest values. This means that, if is monotonically decreasing ordered (form left to right), then positive contributions should be on the left side of the distributions, and negative terms should be on the right side. Furthermore, the greater is w.r.t. , the higher is the value of the divergence. This translates to try to increase as much as possible the difference between the greatest values and their corresponding counterparts. Of course, reducing the quantity that is assigned to the initial positions of results into increasing the quantity that is assigned to the right positions of it. If and are not the same distribution (which implies a divergence equal to 0), negative contributions are unavoidably present. Thus, a further goal is to reduce the number of terms with negative contribution.

All of these considerations lead to the intuition that the distribution that maximizes the entropic divergence is the one that minimizes the quantity assigned to positions from 1 to , and that assigns all the remaining amount to the last position. Since the minim amount of quantity is equal to 1, then such a distribution assigns the remaining quantity to the last position . In what follows, it is shown that if is monotonically decreasing ordered then such a distributional shape maximizes the entropic divergence independently from how the quantity is distributed in . This fact also implies that such a maximization is independent from the ordering of . In fact, it is only necessary that the quantity is assigned to the position , rather that , where is minimal. However, the ordering is helpful to prove the initial statement.

From here on, the maximizing distribution is referred to as and any other competitor distribution is referred to as . The prof that the entropic divergence from to is greater than the divergence from any other distribution is split into three parts. Firstly, two special cases are taken into account and the results obtained for them are exploited for proving the general case.

The first special case is presented in Figure 1. A total amount of elements are arranged into cells in order to compose the distributions. As introduced above, the distribution has a monotonically deceasing order and the distributions assign a quantity of to the last cell. The special case is represented by the distribution which assigns a quantity of 2 to the -th position and a quantity of to the last position. For all the distributions, for every cell, a minimal quantity of 1 is assigned. The goal is to show that:



Figure 1: First special case. Each element is represented as a dot that is assigned to one of the cells. A total of 11 elements are assigned to a total of 5 cells, for each of the three distributions, , and that are present in the case.

A first consideration is that from position to , the two divergences have identical contributions, thus they can be ignored in the comparison. Therefore, it has to be proven that:



By construction, and , while and . Thus Equation 6 can be written as:


, which equals


, and therefore, by removing equal terms from the left and right sides of the inequality,


, that is


, therefore, since and by removing equal terms,



For this specific case, the difference between and is given by a single element. However, since is ordered, it can be assumed that there is a discretized gap between the two positions such that , for . Thus, the inequality can be written, by also changing the verse of it, as









It can be assumed that , for a given factor . Which implies that can be greater or smaller that . In addition, equals . Thus the inequality can be written as


, and, therefore


. If , which is always true because a minimum amount of 1 is assigned to each cell and the two distribution must be different, then is always less than . This implies that is always less than or equal to zero. Thus, independently from the value of , that must be in any case , the inequality is always satisfied.

More in general, Equation 17 can be written as:


, because a given quantity , that is at least and at most , is moved from position to position . If Equation 18 is always verified, it implies that independently of how the quantity is arranged in the last two positions, the distribution is the one that maximizes the entropic divergence. In addition, it also implies two other assertions. The first assertion is that if the number of cells is equal to then is always the maximizing distribution. The second assertion is that if the quantity is moved from the last cell to a specific other cell, not necessary the second-last, the is still the maximizing distribution. In fact, the inequality is independent from the specific cell position and it only requires that and that , thus which means that must be greater than . This consideration highlights the fact that is the distribution that assigns all the available quantity to the cell having the smallest probability in , thus it is independent from the ordering.

In what follows, the prof that Equation 18 is always true is given. In the equation, we can put and thus, in order to assert than the result of the logarithm must be always less the 0, it has to be shown that


. The determinant is given by that is: equal to 0 for , which is impossible because ; less than 0 for that is still impossible because ; greater than 0 for . Thus, the determinant is always greater than 0 and the inequality is less than 0 which means that it admits two solutions and such that it is true for . The two solutions are given by . The determinant can also be written as . For practical applications, the determinant can be approximated to , thus the inequality is satisfied for , namely that is always true because by definition.

Moving forward, the final goal is to show that is maximizing the divergence w.r.t any possible distribution that is obtained by arranging the quantity to all the cells. A quantity equal to is assigned to each cell of , such that and . It is recalled that a minimum quantity of 1 must be assigned din order to avoid infinite divergences. The following inequality must be verified:


, that is


. The left side of the inequality is composed of a series of terms each of which equals , and the entire inequality can be written as


, that is



Since is ordered, for each position it happens that , namely . The inequality can be written as


. The arguments of the logarithms are always grater than 1, thus the values of the logarithms are always positive. Moreover, the factors that multiply the logarithms are always positive, because they are probabilities. The inequality can be written as


, with and . Taking into account the fact that the sum of logarithms is greater than the logarithm of the sum, it is trivial to show that the inequality is always satisfied.

3 A notion of normalized divergence and its relation with other measures.

This section investigates the relation between a proposed notion of normalized Kullback-Leibler divergence and other measures of divergence and distance, that are the common unnormalized Kullback-Leibler divergence, the symmetric entropic divergence and the generalized Jaccard distance. The investigations are empirically conducted by computationally generating the distributions for the comparison. The source code for generating the unordered and ordered distributions, together with the computational experiments, is available at the following link https://github.com/vbonnici/KL-maxima.

The generalized Jaccard similarity is a measure that results suitable for comparing multiplicity distributions. It if defined as:


. It can be shown that such a measure ranges from 0 to 1, both included. The minimum value is reached when the two distributions have no multiplicity in common, which means that when and vice versa. It reaches the maximum value when the two distributions have equal values. It is a notion of similarity therefore it is in contrast with the meaning of the entropic divergence. Thus, for the purpose of this study, it is converted as in order to have it as a notion of distance.

The generalized Jaccard distance is directly applied to multiplicity distributions, while entropic divergences are applied after converting the distributions into probability/frequency distributions. The retrieving of the maximizing distribution is exploited in order to normalize the entropic divergence in the range , both included. Thus, given two distributions and , the normalized entropic divergence is calculated as


, where is the distribution for which the maximum entropic divergence from is reached. Such a maximizing distribution is built by exploiting the results obtained in the previous sections. Namely, it distributes a minimum value of 1 to each cell and the remaining quantity is assigned to the cell for which the value in is the minimum.

Furthermore, as explained before, the proposed divergence is compare with the unnormalized one, namely , and with the common used symmetric divergence, also called Jensen–Shannon divergence (JS). The JS divergence is defined as


, with , and it is known to be upper-bounded by if the base of the logarithm is 2 [10].

Another important divergence is the Hellinger distance that is defined as


, and it can also be written as . Important properties of such a divergence is that it implicitly avoids infinite divergences and it is bounded in the range .

Unordered distributions are built by using the computational procedure, then two-by-two comparisons are performed. A scatter plot is made by using the two measures, for example the generalized Jaccard distance and the normalized entropic divergence, for locating each two-by-two comparison. The chart is also equipped with two histograms located aside of the axes that reports the number of instances that falls within a given range of values.

Figure 2: Relation between the proposed normalized Kullback-Leibler divergence and (a) unnormalized Kullback-Leibler divergence; (b) symmetric Kullback-Leibler divergence; (c) generalized Jaccard distance; (d) Hellinger distance.

Figure 2 reports the relations between the proposed normalized divergence and the other investigated measures. Calculations were performed by setting a number of cells equal to 5 and a total distributed quantity of 15. The experiment generated 1001 unordered distributions, of which 30 were monotonically decreasing ordered. Thus, a total of two-by-two distribution comparisons were performed.

The higher is the value of the entropic divergence the less the normalized measure is related to the other -divergences (see Figure 2 (a), (b) and (d)). However, the proposed measure is more correlated with the non-symmetric divergence, rather than the other measures. Pearson correlation coefficient [8] reaches a value of 0.97 between the proposed divergence and the unnormalized one, and a correlation value of 0.96 between the proposed measure and the symmetric divergence. Table 1 reports the complete list of Pearson’s correlations between the compared measures.

Figure 2

(c) shows the relation between the proposed measure and the generalized Jaccard distance. The two measures are strongly correlated, in fact the shape of the plotted dots runs along the diagonal of the chart. Moreover, a correlation equal to 0.90 is obtained by calculating the Pearson correlation between two vectors, one with the values of the generalized Jaccard distance and the other one with the values of the normalized Kullback-Leibler divergence, such that the values of the two vectors in a specific position correspond to the same compared distributions. However, some important differences emerge. The generalized Jaccard distance is influenced by the fact that values closed to 1 can not be reached because the compared distributions have no term equal to 0. In fact, the maximum observable distance is 0.8. In addition, the obtained distances form clusters in specific portions of the chart. This behaviour directly emerges form Equation

26 since the Jaccard distance tends to flatten the punctual comparison among the element in the domain of the distribution into a sum of values. Moreover, the distance between such clusters decreases on approaching the value 0.8. In contrast, values of the normalized Kullback-Leibler divergence are spread from 0 to 1 without forming any visible cluster.

In order to investigate such a clustering phenomenon, the distance between consecutive values of the two measures has been taken into account. Given a set of comparisons, a vector of size

is built from the values of the specific measure on such comparisons. The vector is sorted and runs within the vector reporting the same value are substitute with one single value The difference between adjacent positions of the vector are extracted. Then, The mean and the standard deviation are computed. The elimination of the runs on the vector of the generalized Jaccard measure decreases the size of the vector from

to 11, as it can be observed on the figure. The distances of the generalized Jaccard measure have a mean equal to 0.08 and a standard deviation of 0.02. On the contrary, the distances of the normalized entropic divergence have an average of 0.00004 and a standard deviation of 0.0005. Thus, it seems that the divergence is not forming clusters.

Lastly, the histograms of the values of two measures both form a shape similar to a Poisson distribution. The normalized entropic divergence histogram has a mode of about 0.15 and its tail tends to the divergence value 1. In contrast, the histogram of the generalized Jaccard distance as a mode of about 0.5 and its tail goes in the opposite direction, thus it tends to the value 0.

Normalized Kullback-Leibler Kullback-Leibler 0.9893
Normalized Kullback-Leibler Jensen-Shannon divergence 0.9888
Normalized Kullback-Leibler Generalized Jaccard distance 0.9549
Normalized Kullback-Leibler Hellinger distance 0.9881
Kullback-Leibler Jensen-Shannon divergence 0.9926
Kullback-Leibler Generalized Jaccard distance 0.9232
Kullback-Leibler Hellinger distance 0.9932
Jensen-Shannon divergence Generalized Jaccard distance 0.9441
Jensen-Shannon divergence Hellinger distance 0.9999
Hellinger distance Generalized Jaccard distance 0.9411
Table 1: Pearson’s correlation among the investigated measures on two-by-two comparisons of ordered distributions generated by distributing a quantity of 15 to 5 cells.

Entropic divergences, as well as other measures, can be used for prioritizing elements w.r.t. their deviance form randomness or, generically, from a background model. Thus, it can be interesting to study how the rank assigned to elements, based on their divergence, changes when the four different measures are used. In what follows, the uniform distribution is used as background model and the measure of divergence from it is calculated for the set of ordered distributions that can be formed by taking into account the same quantity that is distributed in the uniform shape. For the experiments, a number of cells equal to 8 and a total quantity of 32 has been taken into account. In this way, the uniform distribution assigns a quantity of 4 to each cells. The difference w.r.t. the previous experiments, where 5 cells and 15 elements are considered, is due to the fact that the previous experiment generates only 30 distinct ordered distributions which is a relatively small number. On the contrary, a setup with 8 cells and 32 elements generates a high number of unordered distributions (2,629,575) that leads to a huge number of two-by-two comparisons. As a pro, the new setup generates 919 ordered distributions, that can be considered a sufficient amount for draw experimental conclusions.

Firstly, the correlation between the measures and the properties of the compared distribution is investigated. Entropy, coefficient of variation, skewness and Kurtosis’s index are the considered properties. It has to be noticed that some values of the skewness and Kurtosis statistics may appear unexpected, however such a unexpected behaviour is due to the fact that relatively small discrete distributions are taken into account. In addition the generate distributions are more similar to exponential distributions rather than unimodal ones. For example, only positive values of skewness are expected because the examined distributions are monotonically ordered, however the distribution which values are

has a skewness of 0 because mean, mode and median of the distribution have the same value. The distribution has a negative skewness because the mode (1) smaller than the mean (4).

Figure 3 shows the relation between the four investigated measures and the entropy of the ordered distribution that is compared with the uniform distribution. The simple Kullback-Leibler divergence is the measure which better correlates with the entropy, followed by the proposed normalized divergence. Table 4 reports the correlations between the measures and the entropy. The numeric correlations confirm what is shown by the graphics.

Figure 3: Scatter plots generated by putting in relation four of the investigated measures and the entropy of the set of monotonically ordered distributions, generated with 8 cells and 32 dots, and the corresponding uniform distribution.

Figure 4 shows the relation between the four measures and the coefficient of variation of the ordered distribution that is compared with the uniform one. Pearson’s correlations are reported in Table 4. Differently form entropy-related correlations, the proposed normalized measure is the one which better correlate with the coefficient of variation, followed by the unormalized entropic divergence. In addition, differently form the unormalized Kullbac-Leibler divergence and the Jensen-Shannon divergence, the proposed normalized divergence forms a sigmoid curve rather than an exponential trend.

Figure 4: Scatter plots generated by putting in relation four of the investigated measures and the coefficient of variation of the set of monotonically ordered distributions, generated with 8 cells and 32 dots, and the corresponding uniform distribution.

Entropy and coefficient of variation are the distributional properties that better correlate with the investigated measures. In fact, how it can be seen from Figures 5 and 6, and from Table 4, the skewness and the Kurtosis’s index of the compared unordered distribution weakly correlate with the measures. However, both distributional properties form shapes similar to grids when they are plotted. This behaviour is possibly due to the discrete nature of the compared distributions.

Figure 5: Scatter plots generated by putting in relation four of the investigated measures and the Kurtosis index of the set of monotonically ordered distributions, generated with 8 cells and 32 dots, and the corresponding uniform distribution.
Figure 6: Scatter plots generated by putting in relation four of the investigated measures and the skewness of the set of monotonically ordered distributions, generated with 8 cells and 32 dots, and the corresponding uniform distribution.

Histograms on Figures 4, 5 and 6 are also useful for studying the range of values that the investigated measures can assume and how those values are distributed. In fact, this properties can be extracted by looking at the histograms on the right side of each chart. The proposed normalized divergence ranges form 0 to circa 0.5, because one of the two compared distribution is always the uniform distribution. In fact, the monotonically ordered distribution that more diverges from the uniform distribution is the one which assigns all the available quantity to the first cell. Such a distribution is completely opposed to and the uniform distribution is exactly in the middle of them. Thus the divergence from the distribution to the uniform one is half of the divergence from . Table 2 shows the maximum value that each investigated measure reaches at varying the number of cells and dots with which distributions are built from. All the measures have a minimum value of 0 because the uniform distribution is among the distribution that are compared to itself. The proposed normalized divergence takes values that are closed to 0.5 but never equal to such a value. The reason resides in the discretized nature of the compared distributions. However, some pattern emerge from the table. In fact, the values of the measures are directly related to the number of dots that are distribute. The smaller is the number of dots, the higher is the value of the proposed normalized measure. This behaviour is opposite to the one of the other three measures which increase their value on increasing the number of distributed dots. Intuitively, the distribution which maximizes the divergence/distance from the uniform distribution is the one which assigns all the available dots to the first cell, thus it is specular to . This intuition is also confirmed by computational experiments. The fact that the measure takes different values depends from the ratio between the dots that are assigned to the first cell and the number of cells. For example, the uniform distributions obtained for 6 cells and 12 dots and for 7 cells and 14 dots are almost the same. In fact, both of them assign 2 dots to each cell. However, the number of available dots, after assigning one dots to each cell, is 6 in the first case and 7 in the second case. Thus the difference between the two generalized Jaccard distance is versus because, except for the first cell, all the other cell carry a value of for both configurations, and the configuration with 7 cells has an additional cell This difference, notably, leads to a different resulting value. Similar considerations can be made for the other measures.

Cells Dots Norm. KL Unnorm. KL Jensen-Shannon Hellinger Gen. Jaccard
6 12 0.5078 0.6376 0.1395 0.0989 0.5882
6 18 0.4498 1.0876 0.2399 0.1719 0.7143
6 24 0.4297 1.3629 0.3046 0.2201 0.7692
6 30 0.4164 1.5480 0.3500 0.2546 0.8000
7 14 0.5151 0.7143 0.1518 0.1082 0.6000
7 21 0.4687 1.2057 0.2578 0.1857 0.7273
7 28 0.4502 1.5038 0.3257 0.2364 0.7826
7 35 0.4374 1.7033 0.3731 0.2726 0.8136
8 16 0.5233 0.7831 0.1622 0.1161 0.6087
8 24 0.4845 1.3103 0.2727 0.1973 0.7368
8 32 0.4672 1.6280 0.3429 0.2500 0.7925
8 40 0.4546 1.8397 0.3919 0.2876 0.8235
9 18 0.5315 0.8455 0.1711 0.1230 0.6154
9 27 0.4981 1.4043 0.2851 0.2072 0.7442
9 36 0.4815 1.7391 0.3573 0.2616 0.8000
9 45 0.4691 1.9614 0.4076 0.3002 0.8312
10 20 0.5394 0.9027 0.1789 0.1291 0.6207
10 30 0.5098 1.4897 0.2958 0.2158 0.7500
10 40 0.4938 1.8395 0.3696 0.2716 0.8060
10 50 0.4815 2.0713 0.4208 0.3112 0.8372
Table 2: Maximum values of the five investigated measure by varying number of cells and dots which distributions are formed by.

The difference in how the measures spread the values along the range form 0 to the maximum value is summarized in Table 3. Each experiment regards a specific number of cells and dots, as for the previous analysis. As a measure of spread the average value divided by the maximum value is used. the closest to 0.5 is the resultant measurement the more spread the values should be. On the contrary, if the measurement tends to 0, then the values are more concentrated towards the 0, and, similarly, they are concentrated towards the maximum if the measurement tends to 1. The proposed normalized divergence is the one which better tend to 0.5 with an average value of 0.4296 along the complete set of experiments. The unnormalized KL tends to 0 more than the Jensen-Shannon divergence, that is in contrast with the mode observed in the figures, and the generalized Jaccard distance tends more to the maximum value with an average of 0.6.

Cells Dots Norm. KL Unnorm. KL Jensen-Shannon Hellinger Gen. Jaccard
6 12 0.5301 0.4089 0.4418 0.4379 0.6203
6 18 0.4403 0.3217 0.3540 0.3495 0.5979
6 24 0.3987 0.2865 0.3159 0.3110 0.5848
6 30 0.3739 0.2672 0.2941 0.2886 0.5766
7 14 0.5277 0.3987 0.4365 0.4315 0.6277
7 21 0.4396 0.3139 0.3523 0.3466 0.6069
7 28 0.3965 0.2778 0.3131 0.3071 0.5910
7 35 0.3709 0.2583 0.2910 0.2846 0.5815
8 16 0.5318 0.3931 0.4379 0.4314 0.6483
8 24 0.4390 0.3066 0.3505 0.3436 0.6144
8 32 0.3956 0.2711 0.3117 0.3046 0.5967
8 40 0.3694 0.2514 0.2891 0.2818 0.5860
9 18 0.5332 0.3871 0.4369 0.4292 0.6578
9 27 0.4404 0.3017 0.3508 0.3427 0.6218
9 36 0.3961 0.2660 0.3114 0.3033 0.6022
9 45 0.3692 0.2462 0.2885 0.2803 0.5906
10 20 0.5362 0.3832 0.4384 0.4294 0.6708
10 30 0.4421 0.2977 0.3514 0.3423 0.6284
10 40 0.3972 0.2621 0.3119 0.3029 0.6076
10 50 0.3696 0.2422 0.2886 0.2796 0.5951
avg 0.4349 0.3071 0.3483 0.3414 0.6103
Table 3: Average divided by maximum value of the five investigated measures by varying number of cells and dots with which distributions are formed by.

Lastly, the difference in the ranking produced by the four measures has been investigated. Experimental results were obtained by using 8 cells and 32 dots. The uniform distribution was compared to the set of monotonically decreasing ordered distributions, as for the previous experiment. Then, distribution were ranked depending on the value each measure assigned to them. Figure 7 show the comparison between the normalized entropic divergence and the three other measures in assigning the rank to the distributions. Each point, in one of the three plots, is a given distribution which coordinates, in the Cartesian plane, are given by the rank assigned by the two compared measures. These charts give an idea of how different a ranking can be when different measures are applied. A mathematical way for comparing rankings is the Spearman’s rank correlation coefficient [5], which values are reported in Table 4. The reported correlations may appear significantly high, however, there is a discordance between the measures from circa 0.05 to 0.001, which means that from 5% to 0.1% of the elements are ranked differently. Such a difference may, for example, lead to different empirical p-values, which may change the results of a study.

Figure 7: Scatter plots obtained by taking into account the rank assigned by the proposed normalized Kullback-Leibler and the other investigated measures. The complete set of monotonically ordered distributions generated with 8 cells and 32 dots was used for extracting the rankings.
Normalized Kullback-Leibler Entropy -0.9892
Kullback-Leibler Entropy -0.9999
Jensen-Shannon divergence Entropy -0.9804
Generalized Jaccard distance Entropy -0.9232
Hellinger distance Entropy -0.9932
Normalized Kullback-Leibler Coefficient of variation 0.9872
Kullback-Leibler Coefficient of variation 0.9832
Jensen-Shannon divergence Coefficient of variation 0.9678
Generalized Jaccard distance Coefficient of variation 0.9181
Hellinger distance Coefficient of variation 0.9649
Normalized Kullback-Leibler Skewness 0.6343
Kullback-Leibler Skewness 0.6096
Jensen-Shannon divergence Skewness 0.6554
Generalized Jaccard distance Skewness 0.5143
Hellinger distance Skewness 0.5475
Normalized Kullback-Leibler Kurtosis 0.4715
Kullback-Leibler Kurtosis 0.4795
Jensen-Shannon divergence Kurtosis 0.5170
Generalized Jaccard distance Kurtosis 0.2622
Hellinger distance Kurtosis 0.3995
Normalized Kullback-Leibler Kullback-Leibler 0.9989
Normalized Kullback-Leibler Jensen-Shannon divergence 0.9909
Normalized Kullback-Leibler Generalized Jaccard distance 0.9695
Normalized Kullback-Leibler Hellinger distance 0.9905
Kullback-Leibler Jensen-Shannon divergence 0.9947
Kullback-Leibler Generalized Jaccard distance 0.9695
Kullback-Leibler Hellinger distance 0.9946
Jensen-Shannon divergence Generalized Jaccard distance 0.9742
Jensen-Shannon divergence Hellinger distance 1.0000
Hellinger distance Generalized Jaccard distance 0.9728
Table 4: Pearson’s and Spearman rank correlations among the investigated measures on comparing ordered distributions with the uniform one generated by distributing a quantity of 32 to 8 cells.

4 Conclusion

This study shows that given a probability distribution , that has been built via a discretized process, there exists another distribution that maximizes the entropic divergence form , if infinite divergences are avoided. Furthermore, the shape of such a distribution is here characterized. This result is used for providing a notion of entropic divergence that is normalized between 0 and 1, and empirical evaluation of such a divergence w.r.t. other common used measures are reported. The evaluation shows that the proposed divergence has its own specific behaviour, on varying the properties of the compared distributions, that differ from already known measures.

If we think about quantum theory, the real word is made of discretized quantities called quanta (singular quantum). The quantum theory is strongly based on the concept of probability distribution, that is for example used to define the probability of an electron to be in a certain place in a given moment. Thus, quantum probability distributions are in their essence multiplicity distributions. This consideration implies that the applicability of the findings presented in the current study can be of a wider range of applications.


  • [1] Ikuo Arizono and Hiroshi Ohta. A test for normality based on kullback—leibler information. The American Statistician, 43(1):20–22, 1989.
  • [2] Dmitry I Belov and Ronald D Armstrong. Automatic detection of answer copying via kullback-leibler divergence and k-index. Applied Psychological Measurement, 34(6):379–392, 2010.
  • [3] Pierre Brémaud. An introduction to probabilistic modeling. Springer Science & Business Media, 2012.
  • [4] Bertrand S. Clarke. Asymptotic normality of the posterior in relative entropy. IEEE Transactions on Information Theory, 45(1):165–176, 1999.
  • [5] Wayne W Daniel et al. Applied nonparametric statistics. Houghton Mifflin, 1978.
  • [6] Ernst Hellinger. Neue begründung der theorie quadratischer formen von unendlichvielen veränderlichen. Journal für die reine und angewandte Mathematik (Crelles Journal), 1909(136):210–271, 1909.
  • [7] Solomon Kullback and Richard A Leibler. On information and sufficiency. The annals of mathematical statistics, 22(1):79–86, 1951.
  • [8] Joseph Lee Rodgers and W Alan Nicewander. Thirteen ways to look at the correlation coefficient. The American Statistician, 42(1):59–66, 1988.
  • [9] Yulin Li and Liuxia Wang. Testing for homogeneity in mixture using weighted relative entropy. Communications in Statistics—Simulation and Computation®, 37(10):1981–1995, 2008.
  • [10] Jianhua Lin. Divergence measures based on the shannon entropy. IEEE Transactions on Information theory, 37(1):145–151, 1991.
  • [11] Xiaodong Lin, Jennifer Pittman, and Bertrand Clarke. Information conversion, effective samples, and parameter size. IEEE transactions on information theory, 53(12):4438–4456, 2007.
  • [12] Vincenzo Manca. Infobiotics. Springer, 2013.
  • [13] Sriram Pemmaraju and Steven Skiena. Computational Discrete Mathematics: Combinatorics and Graph Theory with Mathematica®. Cambridge university press, 2003.
  • [14] Luca Pinello, Giosuè Lo Bosco, Bret Hanlon, and Guo-Cheng Yuan. A motif-independent metric for dna sequence specificity. BMC bioinformatics, 12(1):408, 2011.
  • [15] Alfréd Rényi et al. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics. The Regents of the University of California, 1961.
  • [16] Igal Sason and Sergio Verdú. -divergence inequalities. IEEE Transactions on Information Theory, 62(11):5973–6006, 2016.
  • [17] Richard P Stanley. Enumerative combinatorics volume 1 second edition. Cambridge studies in advanced mathematics, 2011.
  • [18] Ihar Volkau, KN Bhanu Prakash, Anand Ananthasubramaniam, Aamer Aziz, and Wieslaw L Nowinski. Extraction of the midsagittal plane from morphological neuroimages using the kullback–leibler’s measure. Medical Image Analysis, 10(6):863–874, 2006.
  • [19] Federico Zambelli, Francesca Mastropasqua, Ernesto Picardi, Anna Maria D’Erchia, Graziano Pesole, and Giulio Pavesi. Rnentropy: an entropy-based tool for the detection of significant variation of gene expression across multiple rna-seq experiments. Nucleic acids research, 46(8):e46–e46, 2018.