    # Jensen-Shannon Divergence as a Goodness-of-Fit Measure for Maximum Likelihood Estimation and Curve Fitting

The coefficient of determination, known as R^2, is commonly used as a goodness-of-fit criterion for fitting linear models. R^2 is somewhat controversial when fitting nonlinear models, although it may be generalised on a case-by-case basis to deal with specific models such as the logistic model. Assume we are fitting a parametric distribution to a data set using, say, the maximum likelihood estimation method. A general approach to measure the goodness-of-fit of the fitted parameters, which we advocate herein, is to use a nonparametric measure for model comparison between the raw data and the fitted model. In particular, for this purpose we put forward the Jensen-Shannon divergence (JSD) as a metric, which is bounded and has an intuitive information-theoretic interpretation. We demonstrate, via a straightforward procedure making use of the JSD, that it can be used as part of maximum likelihood estimation or curve fitting as a measure of goodness-of-fit, including the construction of a confidence interval for the fitted parametric distribution. We also propose that the JSD can be used more generally in nonparametric hypothesis testing for model selection.

## Code Repositories

### pyjsd

Python implementation of the Jensen-Shannon divergence

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We assume a general scenario, where we have some data from which we derive an empirical distribution that is fitted with maximum likelihood  or curve fitting  to some, possibly parametric distribution .

The coefficient of determination, 

, is a well-known measure of goodness-of-fit for linear regression models. Despite its wide use, in its original form, it is not fully adequate for nonlinear models,

, where the author recommends to define as a comparison of a given model to the null model, claiming that this view allows for the generalisation of . Further, in  the inappropriateness of for non-linear models is clearly demonstrated via a series of Monte Carlo simulations. In , a novel

measure based on the Kullback-Leibler divergence

 was proposed as a measure of goodness-of-fit for regression models in the exponential family.

Alternative nonparametric methods have also been proposed. In particular, the Akaike information criterion () and its counterpart the Bayesian information criterion () [4, 41], are widely used estimators for model selection. Both AIC and BIC are asymptotically valid maximum likelihood estimators, with penalty terms to discourage overfitting. The likelihood ratio test is also an established method for model selection between a null model and an alternative maximum likelihood model [42, 27]. Despite the popularity of maximum likelihood methods, there is some controversy in their application as goodness-of-fit tests .

The Jensen-Shannon divergence () [28, 9] is a symmetric form of the nonparametric Kullback-Leibeler divergence 

, providing a measure of distance between two probability distributions. It has been employed in a wide range of applications such as detecting edges in digital images

, measuring the similarity of texts 

, comparison of genomes in bioinformatics , distinguishing between quantum states in physics  and as a measure of distance between distributions in a social setting .

Here we apply the as an alternative measure of goodness-of-fit of a parametric distribution, acting as the model, to an empirical distribution, which comprises the raw data. The provides a direct measure of goodness-of-fit without the need of the maximum value of the likelihood function, as used in the AIC and BIC, or any linearity assumptions of the model being fitted, as often made when using .

The rest of the paper is organised as follows. In Section 2, we introduce the and some of its characteristics. In Section 3, we define the as a measure of goodness-of-fit within the context of distribution fitting and define the notion of the factor. In Section 4, we describe some experiments we did, with simulated data in Subsection 4.1 and empirical data in Subsection 4.2, to test the viability of using the as a measure of goodness-of-fit. Finally, in Section 5, we give our concluding remarks.

## 2 The Jensen-Shannon Divergence

Let and be finite distributions, and be a mixture distribution of and . Then the between and , denoted by , is given by

 JSD(P,Q)=H(M)−12H(P)−12H(Q), (1)

where the entropy of a distribution , denoted by , is defined as

 H(P)=n∑i=1pilnpi. (2)

We note that the is bounded between and , and can thus be readily normalised; for convenience we will assume that the is normalised. Moreover, the may still be used if and are improper, i.e. their sum is less than one; see for example . We further note that in order for the to be a metric we need to take the square root of , and thus whenever we compute the value of the we will assume that its square root is taken.

The intuition behind the is as follows. We have knowledge of both distributions and and we would like to know how distant they are from each other. In order to do so, we set up a simple experiment, where we take a sample of length one from the mixture distribution . Now, how much information do we gain from observing ? The answer is exactly . If we are none the wiser, i.e. could have equally come from or , then , and for all intents and purposes we consider to be equal to . On the other hand, we may have some information about whether comes from and , and the information we gain on a scale between and is exactly .

Now, let

be a discrete random variable associated with

, and let

be a binary indicator random variable, which is associated with

if and with if . As in , it can be shown that

 JSD(P,Q)=MI(X,Z), (3)

where

 MI(X,Z)=H(X)−H(X|Z), (4)

and is the conditional entropy of conditioned on .

It is worth noting that, asymptotically,

is distributed as a quarter of a Chi-squared distribution

 with degrees of freedom , i.e.,

 JSD(P,Q)≈14n∑i=1(pi−qi)2qi, (5)

where the right-hand side of (5

) is the Chi-squared goodness-of-fit test statistic

. When the number of degrees of freedom is large a Normal approximation of the Chi-squared statistic is often used ; also see  for an application of this approximation.

The extends to cumulative distributions in natural manner by replacing probability mass functions with their cumulative counterparts. More specifically, this is formalised on using the extension of the Kullback-Leibler divergence  in [45, Definition 2.1] to cumulative distributions, and the fact that

 JSD(P,Q)=12KLD(P,M)+12KLD(Q,M), (6)

where the Kullback-Leibler of two distributions, and , denoted by is defined as

 KLD(P,Q)=n∑i=1pilnpiqi. (7)

An important fact to note is that the square root of the cumulative version of the is also a metric , and thus it essentially possesses the same properties as the non-cumulative , but with a different normalisation constant. Moreover, it is often advantageous to use the cumulative distribution instead of the probability mass function as it may be easier to interpret and manipulate, and it also acts to smooth the data. For these reasons, we prefer to employ the cumulative in our experiments, and from now on, for convenience, will not make any distinction between the two versions, refering to both simply as the .

In the experiments we will make use of the bootstrap method , which is a technique for computing a confidence interval that relies on random resampling with replacement from a given sample data set. The bootstrap method is usually nonparametric, making no distributional assumptions about the data set employed.

## 3 The Jsd as a Goodness-of-Fit Measure

Making use of the as a measure of goodness-of-fit is quite straightforward. Assume that is a sample from a parametric distribution , with parameters , and that is fitted with maximum likelihood  or curve fitting  to an empirical distribution, .

The goodness-of-fit of the distribution , with parameters , to the empirical distribution is now defined as

 JSD(P(D,ϕ),^P), (8)

where is a finite distribution, which is distributed according with parameters . We note that (8) does not restrict the , and so it is also possible to measure one empirical distribution against any another.

The Bayes factor



is a method for model comparison, taking the ratio of the models representing the likelihood of the data under the alternative hypothesis and likelihood of the data under the null hypothesis. In particular, the Bayes factor is advocated as an alternative method for null hypothesis significance testing, which depends only on the data and considers the models arising from both the null and alternative hypotheses

.

The JSD factor is reformulation of the Bayes factor with the , defined as

 JSD(P(D1,ϕ1),^P)JSD(P(D0,ϕ0),^P), (9)

which is the odds ratio of choosing the alternative hypothesis,

, in preference to the null hypothesis, .

## 4 Experiments and Analysis

To assess the use of the as a goodness-of-fit measure we provide experimental results with simulated data of various parametric distributions including the Uniform, Normal, Log-normal, Exponential, Gamma, Beta, Weibull, Pareto  and -Gaussian  distributions. Our methodology for the experiments with simulated data (see Subsection 4.1) was as follows:

1. First we generated a data set, say , of size from a given distribution, say , with chosen parameters, say , which was then taken to be the empirical distribution.

2. We then considered to be distributed according to a hypothesised distribution, , where may not be the same as , and used the maximum likelihood method to obtain the parameters of , say , assuming its distribution was . (Obviously, if , then is expected to be very close to .)

3. Next, assuming that was distributed according to with parameters , we generated a second data set, , from distribution with parameters .

4. Finally, we evaluated as a measure of the goodness-of-fit of , with parameters , to , and computed a 95% confidence interval for the from 1000 bootstrap resamples using the basic bootstrap percentile method [8, Section 5.3.1].

For the experiments with empirical data sets we followed the same methodology, with the difference that the data set was an empirical data set rather than a generated one.

For each set of experiments we followed the methodology described above for several possible alternative parametric distributions, , and then computed the factor between the best and a lower performing distribution. As mentioned towards the end of Section 2 the cumulative version of the was used in all the experiments.

The tables showing the results are given in the appendix at the end of the paper. For Normal and Log-normal distributions, the first parameter is the mean and the second the standard deviation, while for Gamma and Weibull distributions, the first parameter is the shape and the second the scale. On the other hand, for the Uniform distribution the first parameter is the lower bound and the second the upper bound, for the Beta distribution the first parameter is

and the second , while for the -Gaussian the first parameter is the shape () and the second the scale (), as in (10). The Exponential and Pareto distributions are characterised by a single parameter. In addition, the lower bound of the 95% bootstrap confidence interval is denoted by lb and the upper bound by ub.

### 4.1 Experiments with Simulated Data

We now provide commentary on the results for the simulated data, shown in Tables 4, 5, 6, 7 8, 9 and 10, which are given in an appendix at the end of the paper. We note that all the computations described in this subsection were carried out using the Matlab software package. Table 2 summarises, for all experiments, the factors between the best (highlighted in bold) and second best (highlighted in italics) performing distributions. In all cases, apart from experiment 6 for the Beta distribution with and , shown in Table 9, the factor overwhelmingly supports the given distribution, as we would expect. The reason for the relatively low factor in this particular case is the known fact that the Beta distribution can be approximated by the Normal distribution when and are large .

It is evident that the larger the size of the data set the more accurate the will be. In Table 1 we demonstrate how the accuracy of the increases while the data set size increases, when both the given and hypothesised distribution are Normal with mean and standard deviation ; it is also noticeable that the maximum likelihood estimation of parameters converges to the correct one as the size of the data increases. In Figure 1 we show that the decrease of the follows a power-law distribution with an exponent of approximately 0.5, which is , where represents the data size.

In the first experiment the given distribution was Normal with mean and standard deviation , and the hypothesised distributions were Normal and Uniform. The factor between the s of the Normal and Uniform distributions is , which can be derived from Table 4. In the second experiment the given distribution was Log-normal with mean and standard deviation , and the hypothesised distributions were Normal, Uniform, Log-normal, Gamma and Weibull. The factor between the

s of the Log-normal distribution, which is the smallest, and the Gamma distribution, whose

is the closest to it, is , which can be derived from Table 5. In the third experiment the given distribution was Gamma with shape and scale , and the hypothesised distributions were Normal, Uniform, Log-normal, Weibull and Gamma. The factor between the s of the Gamma distribution, which is the smallest, and the Weibull distribution whose is the closest to it, is , which can be derived from Table 6. In the fourth experiment the given distribution was Gamma with shape and scale , and the hypothesised distributions were Normal, Uniform, Log-normal, Weibull and Gamma. The factor between the s of the Gamma distribution, which is the smallest, and the Log-normal distribution whose is the closest to it, is , which can be derived from Table 7. In the fifth experiment the given distribution was Beta with parameters and , and the hypothesised distributions were Normal, Log-normal, Gamma, Weibull and Beta. The factor between the Beta distribution which is the smallest, and the Normal distribution, whose is the closest to it, is , which can be derived from Table 8. In the sixth experiment the given distribution was Beta with parameters and , and the hypothesised distributions were Normal, Log-normal, Gamma, Weibull and Beta. The factor between the Beta distribution, which is the smallest, and the Normal distribution, whose is the closest to it, is , which can be derived from Table 9. In the seventh and final experiment the given distribution was Beta with parameters and , and the hypothesised distributions were Normal, Log-normal, Gamma, Weibull and Beta. The factor between the Beta distribution, which is the smallest, and the Normal distribution, whose is the closest to is, is , which can be derived from Table 10.

### 4.2 Analysis of Empirical Data

We now provide commentary on the results for the empirical data sets, shown in Tables 11, 12 and 13, which are given in an appendix at the end of the paper. We note that all the computations described in this subsection were carried out using Python. Table 3 summarises, for all three data sets, the factors between the best (highlighted in bold) and a lower performing distribution.

The first empirical data set we consider, contains detailed voting results of party vote shares in different polling stations, during the Lithuanian parliamentary election of 1992 (the data was obtained from ). Note that we consider only the top three parties and have renormalised the original data so that the total vote share of the top three parties would sum to one in each polling station. This data set was first considered in , where an agent-based model generating the Beta distribution, and reasonably well reproducing detailed election results, was proposed. In  a statistical comparison between the four commonly used distributions in sociophysics, the Normal, Log-normal, Beta and Weibull, was carried out using the Watanabe-Akaike information criterion (WAIC) , which is a generalisation of the AIC. The comparison concluded that the Beta and Weibull distributions provide the best fits for the empirical data. However, their respective WAIC scores were within each other’s confidence intervals, and therefore no final conclusion was made. Here we also obtain a similar result, the Beta and Weibull distributions clearly have the overall best scores, however, as before, their confidence intervals overlap (see Table 11). As was noted in , the Beta and Weibull distributions are similar when the observed mean is close to

and the observed variance is reasonably small. In the empirical analysis this similarity is further increased when the sample size is small. In addition, for the estimated parameter values, the Gamma and Weibull distributions behave similarly when

. Therefore, we report the factor between the best performing distribution (highlighted in bold) and the next best distribution which is neither a Beta, Gamma or Weibull distribution (see Table 3).

The second data set we consider contains the log-returns of two different exchange rates. We consider the BTC/JPY exchange rate during the time period between July 4, 2017 and July 4, 2018 (the data was obtained from ) on the bitFlyer exchange, as well as the EUR/USD exchange rate during the time period between June 1, 2000 and September 1, 2010 (the data was obtained from ). We consider the daily and one minute log-returns. For this data set we use moving block bootstrap  with a block size of one day.

In the econophysics literature it is commonly accepted that the log-returns are power-law distributed . One of the commonly used fits for the log-returns is so-called

, which we add to our analysis for this empirical data set. Here we use the following parametrization of -Gaussian distribution :

 pλ,x0(x)=Γ(λ2)√πx20Γ(λ−12)(x20x20+x2)λ2, (10)

which is equivalent to Student’s t-distribution .

However, as can be seen in Table 12, we find that the Gamma and Weibull distributions noticeably outperform the

-Gaussian distribution. Performance of the Gamma and Weibull distributions is similar, due to the fact that for the estimated parameter values both of these distributions behave reasonably similarly; for these parameter values they are reasonably close to the Exponential distribution. Therefore, we report the

factor between the best performing distribution (highlighted in bold) and the next best distribution, which is neither a Gamma or Weibull distribution (see Table 3).

For our fourth sample in the second empirical data set, i.e. the EUR/USD one minute log-returns, unexpectedly the Log-normal and -Gaussian distributions had the best performance. Though they are far from being similar for the considered observable value range and parameter values, they, most likely attained similar scores due to the shape of the empirical distribution. The Log-normal distribution seems to represent smaller log-returns well, while the -Gaussian is better at describing the tail events.

The third data set we consider is the European soccer data set , which contains thousand matches played in European national championships throughout 2008–2016. From this data set we have extracted five random teams and computed inter–goal times for each team. We have treated goals scored during extra time as scored on the 45th minute (if scored during the first half) and the 90th minute (if scored during the second half). In this analysis we have added the Exponential and Pareto distributions. For the estimated parameter values the Gamma and Weibull distributions behave similarly to the Exponential distribution. Note that the shape parameter values of the Gamma and Weibull distributions are very close to and the respective scale parameter values are similar. In this case it is known that Gamma and Weibull distributions are equivalent to the Exponential distribution with the appropriate scale parameter value. We therefore report the factor between the best performing distribution (highlighted in bold) and the next best distribution, which is neither an Exponential, Gamma or Weibull distribution (see Table 3). We observe that for the ELC sample, the obtained factor is the lowest and the score is the largest. This is most likely due to this team having played opponents with a larger variety of skill. In particular, it played in the top and the second tiers of the national championship during the considered time period, resulting in a goal scoring rate with a higher variation.

## 5 Concluding remarks

We have proposed the Jensen-Shannon divergence () as a goodness-of-fit measure for data fitted with maximum likelihood estimation or curve fitting. Our experiments with simulated and empirical data in Section 4, for a variety of parametric distributions, shows that for simulated data the method is unequivocal in its preference for the true distribution (see Subsection 4.1), and for empirical data the method is effective in selecting the more likely distributions from a selection of hypothesised distributions (see Subsection 4.2).

As we have shown in Section 2 the has a precise information-theoretic meaning, and the factor has an intuitive meaning in terms of an odds ratio, in analogy to the Bayes factor. Moreover, the implementation of the as a measure of goodness-of-fit or for model comparison is relatively straightforward; see  for a Python implementation of the .

Ultimately more experience with empirical data sets is needed for a definitive assessment of how the performs in practice.

## References

•  R. Anderson-Sprecher. Model comparisons and . The American Statistician, 48:113–117, 1994.
•  Bitcoincharts.com.
•  J. Borges and M. Levene. Testing the predictive power of variable history web usage. Soft Computing, 11:717–727, 2006.
•  K.P. Burnham and D.R. Anderson. Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach. Springer-Verlag, New York, second edition, 2002.
•  A.C. Cameron and F.A.G. Windmeijer. An R-squared measure of goodness of fit for some common nonlinear regression models. Journal of Econometrics, 77:329–342, 1997.
•  R. Cont. Empirical properties of asset returns: Stylized facts and statistical issues. Quantitative Finance, 1:1–14, 2001.
•  T.M. Cover and J.A. Thomas. Elements of Information Theory. Wiley Series in Telecommunications. John Wiley & Sons, Hoboken, New Jersey, second edition, 2006.
•  A.C. Davison and D.V. Hinkley. Bootstrap Methods and their Applications. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge, UK, 1997.
•  D. Endres and J. Schindelin. A new metric for probability distributions. IEEE Transactions on Information Theory, 49:1858–1860, 2003.
•  T. Fenner, M. Levene, and G. Loizou. A multiplicative process for generating the rank-order distribution of UK election results. Quality & Quantity, 52:1069––1079, 2018.
•  J. Fox.

Applied Regression Analysis and Generalized Linear Models

.
Sage Publications, Thousand Oaks, Ca., 3rd edition, 2016.
•  R.J. Freund, W.J. Wilson, and P. Sa.

Regression Analysis: Statistical Modeling of a Response Variable

.
Academic Press, San Diego, CA., second edition, 2006.
•  J.F. Gómez-Lopera, J. Martínez-Aroza, A.M. Robles-Pérez, and R. Román-Roldán. An analysis of edge detection by using the Jensen-Shannon divergence. Journal of Mathematical Imaging and Vision, 13:35–56, 2000.
•  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Proceedings of Advances in Neural Information Processing Systems (NIPS) 27, pages 2672–2680, Montreal, 2014.
•  I. Grosse, P. Bernaola-Galván, P. Carpena, R. Román-Roldán, J. Oliver, and H.E. Stanley. Analysis of symbolic sequences using the Jensen-Shannon divergence. Physical Review E, 65:041905–1 – 041905–16, 2002.
•  J. Heinrich. Pitfalls of goodness-of-fit from likelihood. In Proceedings of the Conference on Statisical Problems in Particle Physics, Astrophysics and Cosmology (PHYSTAT2003), pages 52–55, Stanford, Ca., 2003.
•  HistData.com.
•  A.F. Jarosz and J. Wiley. What are the odds? A practical guide to computing and reporting Bayes factors. Journal of Problem Solving, 7:Article 2, 8pp, 2014.
•  N.L. Johnson, S. Kotz, and N. Balkrishnan. Continuous Univariate Distributions, Volume 1, chapter 18 Chi-square distributions including Chi and Rayleigh, pages 415–493. Wiley Series in Probability and Mathematical Statistics. John Wiley & Sons, New York, NY, second edition, 1994.
•  R.E. Kass and A.E. Raftery. Bayes factors. Journal of the American Statistical Association, 90:773–795, 1995.
•  A. Kononovicius. Lithuanian parliamentary election data.
•  A. Kononovicius. Empirical analysis and agent-based modeling of the Lithuanian parliamentary elections. Complexity, Article ID 7354642:15 pages, 2017.
•  A. Kononovicius. Modeling of the parties’ vote share distributions. Acta Physica Polonica A, 133:1450–1458, 2018.
•  A. Kononovicius and M. Levene. PyJSD: Python implementation of the Jensen-Shannon divergence.
•  J.-P. Kreiss and S.N. Lahiri. Bootstrap methods for time series. In T.S. Rao, S.S. Rao, and C.R. Rao, editors, Handbook of Statistics Volume 30, Time Series Analysis: Methods and Applications, pages 3–26. North-Holland, Oxford, 2012.
•  K. Krishnamoorthy. Handbook of Statistical Distributions with Applications. CRC Press, Boca Raton, FL, second edition, 2015.
•  F. Lewis, A. Butler, and L. Gilbert. A unified approach to model selection using the likelihood ratio test. Methods in Ecology and Evolution, 2:155–162, 2011.
•  J. Lin. Divergence measures based on the Shannon entropy. IEEE Transactions on Information Theory, 37:145–151, 1991.
•  A.P. Majtey, P.W. Lamberti, and D.P. Prato. Jensen-Shannon divergence as a measure of distinguishability between mixed quantum states. Physical Review A, 72:052310–1–052310–6, 2005.
•  H. Mathien. European soccer dataset.
•  A. Mehria, M. Jamaati, and H. Mehri. Word ranking in a single document by Jensen–Shannon divergence. Physical Letters A, 379:1627–1632, 2015.
•  H. Motulsky. Intuitive Biostatistics. Oxford University Press, Oxford, 1995.
•  I.J. Myung. Tutorial on maximum likelihood estimation. Journal of Mathematical Psychology, 47:90–100, 2003.
•  S. Nadarajaha and S. Kotz. On the -type distributions. Physica A, 377:465–468, 2007.
•  H.-V. Nguyen and J. Vreeken. Non-parametric Jensen-Shannon divergence. In

Proceedings of European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD)

, pages 173–189, Porto, 2015.
•  D.B. Peizer and J.W. Pratt. A normal approximation for binomial, F, beta, and other common, related tail probabilities, I. Journal of the American Statistical Association, 63:1416–1456, 1968.
•  G.E. Sims, S.-R. Jun, G.A. Wu, and S.-H. Kim. Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. Proceedings of the National Academy of Sciences of the United States of America, 106:2677––2682, 2009.
•  A.-N. Spiess and N. Neumeyer. An evaluation of as an inadequate measure for nonlinear models in pharmacological and biochemical research: a Monte Carlo approach. BMC Pharmacology, 10, 2010. 11 pages.
•  C. Tsallis. Economics and finance: -Statistical stylized features galore. Entropy, 19(9):457, 2017.
•  V. Voinov, M. Nikulin, and N. Balakrishnan. Chi-Squared Goodness of Fit Tests with Applications. Academic Press, London, UK, 2013.
•  S.I. Vrieze. Model selection and psychological theory: A discussion of the differences between the Akaike information criterion (AIC) and the Bayesian information criterion (BIC). Psychological Methods, 17:228–243, 2012.
•  Q.H. Vuong. Likelihood ratio tests for model selection and non-nested hypotheses. Econometrica, 57:307–333, 1989.
•  S. Watanabe. A widely applicable Bayesian information criterion. Journal of Machine Learning Research, 14:867–897, 2013.
•  E. Wilson and M. Hilftery. The distribution of chi-square. Proceedings of the National Academy of Sciences of the United States of America, 17:684–688, 1931.
•  G. Yari and A. Saghafi. Unbiased Weibull modulus estimation using differential cumulative entropy. Communications in Statistics – Simulation and Computation, 41:1372–1378, 2012.