Simple models for macro-parasite distributions in hosts

02/23/2022
by   Gonzalo Maximiliano Lopez, et al.
0

Negative binomial distribution is the most used distribution to model macro-parasite burden in hosts. However reliable maximum likelihood parameter estimation from data is far from trivial. No closed formula is available and numerical estimation requires sophisticated methods. Using data from the literature we show that simple alternatives to negative binomial, like zero-inflated geometric or hurdle geometric distributions, produce a good and even better fit to data than negative binomial distribution. We derived closed simple formulas for the maximum likelihood parameter estimation which constitutes a significant advantage of these distributions over negative binomial distribution.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

01/24/2021

Numerical issues in maximum likelihood parameter estimation for Gaussian process regression

This article focuses on numerical issues in maximum likelihood parameter...
08/03/2021

MCEM and SAEM Algorithms for Geostatistical Models under Preferential Sampling

The problem of preferential sampling in geostatistics arises when the ch...
01/30/2014

A Generalized Probabilistic Framework for Compact Codebook Creation

Compact and discriminative visual codebooks are preferred in many visual...
06/12/2020

Fast Maximum Likelihood Estimation and Supervised Classification for the Beta-Liouville Multinomial

The multinomial and related distributions have long been used to model c...
06/03/2018

Data-Free/Data-Sparse Softmax Parameter Estimation with Structured Class Geometries

This note considers softmax parameter estimation when little/no labeled ...
02/25/2018

The maximum negative hypergeometric distribution

An urn contains a known number of balls of two different colors. We desc...

1 Introduction

Macroparasites usually present over-dispersed distributions where few hosts account for most of the parasites in the population (see, for example, [4][14]).

The most used distribution is the negative binomial distribution ([2][15]) which provide and accurate description of the observations.

However in many cases the negative binomial distribution (or other similar distributions) cannot account for the “excess” of zeros observed. A simple solution widely used is to consider zero-inflated distributions [9][6][7].

The negative binomial distribution is a two parameters distribution, which usually are the mean burden of parasites in the host population and the inverse dispersion parameter which is related with the degree of the over-dispersion [2]. Moreover, it can be shown that the limiting distribution of the distribution, as , is a Poisson () distribution and if the distribution is a geometric () distribution.

However one problem with the negative binomial distribution is parameter estimation from the observations. The method of moments estimation is simple but not always precise

[3]. Maximum likelihood estimation (MLE) provides one of the best parameters estimation ([11]) but for the negative binomial distribution there is no a closed formula for the parameter estimates in terms of the observations and should be obtained numerically which present some complexities respect to the parameters numerical estimation for other distributions [5][1].

In this article we show alternatives to the negative binomial distribution which describe the observations equally well (and in some cases provide a most precise description) but for which we can compute a formula for the maximum likelihood parameters estimation values.

2 Zero-inflated(deflated) distributions and hurdle distributions

Macroparasites usually show over-dispersed distribution where few individuals have most of the parasite in the host population. It is said that distributions follow the 20-80 rule, 20% percent of the individuals account for the 80% of the parasite burden [18]. Negative binomial distribution, among others, offers such over-dispersed distribution but in some cases an excess of zeros is observed for which the distributions fail to account. A simple and widely used solution is to consider zero-inflated (in some cases zero-deflated) distributions and hurdle distributions [16][10].

2.1 Zero-inflated(deflated) distributions

A discrete random variable

follows a zero-inflated(deflated) distribution if its probability mass function

is given by

(1)

where we denote by

to the vector of parameters of the associated distribution

and then

When we have a zero-deflated distribution.

If is the probability generating function of the distribution , then the probability generating function of the corresponding zero-inflated(deflated) distribution is given by

(2)

From the probability generating function we may obtain the mean and the variance for the distribution straightforwardly

(3)

where are the mean and variance of the distribution . The coefficient of dispersion, or variance-to-mean ratio is therefore ,

(4)

where is the variance-to-mean ratio for the associated distribution . As expected, zero-inflated distributions are more over-dispersed than the associated distribution.

2.2 Hurdle distributions.

Another common way to account for an excess of zeros are the hurdle distributions,

(5)

If is the probability generating function for the associated distribution , then the probability generating function for the hurdle distribution is

(6)

from where we obtain the mean and the variance of the hurdle distribution as

(7)

where , and , are the mean and variance of the associated distribution . Finally the variance-to-mean ratio for the hurdle distribution is given by

(8)

If the hurdle distribution are more over-dispersed than the associated distribution.

3 Parameter estimation and Maximum likelihood

A simple, but in general inaccurate, way to fit the parameters of a distribution from a sample consist in the use of the sample moments. For example, the negative binomial distribution has two parameters which can be expressed in terms of the two first moments. While this method is quite simple, do not provide a reliable fit [3] [11]. Usually, the method of choice is maximum likelihood [11]. However, maximum likelihood estimation (MLE) not always produce a closed formula and parameters need to be estimated numerically [2][5][1]. In the following we pose the problem of maximum likelihood parameter estimation for zero-inflated and hurdle distributions.

3.1 Maximum likelihood estimation for zero-inflated(deflated) and hurdle distributions

The problem of parameter estimation by maximum likelihood for zero-inflated and hurdle distributions is presented in the following.

3.1.1 Zero-inflated(deflated) distribution

We denote by to the vector of parameters of the associated distribution and therefore the set of parameters for the zero-inflated(deflated) distribution is (). If are observations, is the number of observations with zero counts, then the log-likelihood function is given by

(9)

Maximizing for () we obtain the following system,

(10)

where we denote by .

3.1.2 Hurdle distributions

The log-likelihood function for a hurdle distribution is given by

(11)

and therefore the parameter values are obtained from the system

(12)

4 Maximum likelihood for the negative binomial distribution and two simple alternatives.

The probability mass function for the negative binomial distribution is given by

(13)

where y . Differentiating the log-likelihood function partially and setting them equal to zero yields the following system of likelihood equations

(14)

where is the digamma function. Substituting in the second equation the value of obtained from the first equation, gives:

(15)

This equation cannot be solved for in a closed form and must be solved numerically. Interative technique as Newton-Raphson method can be used, but this method may fail to find the MLE value. An analysis of the literature indicates that finding the MLE value is a challenge, since we could not obtain the root or obtain more than one for the equation (15) [3][11][5][17][13].

4.1 Zero-inflated geometric distribution

The geometric distribution is a special case of the negative binomial distribution for , and then, its probability mass function is given by

(16)

where and . The mean is while the variance is .


From (1) the corresponding zero-inflated(deflated) distribution is

(17)

Mean and variance is given by (3),

(18)

The variance-to-mean ratio for the zero-inflated(deflated) geometric distribution is always greater than one, and therefore this distribution is always over-dispersed.

4.1.1 Maximum likelihood estimation

System of likelihood equations is given according to (10) by

(19)

therefore, the best parameter estimations result

(20)

which are in terms of the observed sample mean , the sample size and the number of zeros in the sample .

4.2 Hurdle geometric distribution

The hurdle distribution for the associated geometric distribution is given by

(21)

Mean and variance are obtained in a straightforward way,

(22)

The variance-to-mean ratio for the hurdle geometric distribution is always greater than one, and therefore this distribution is always over-dispersed.

4.2.1 Maximum likelihood estimation

According to (12) the system of likelihood equations is given by

(23)

therefore, the best parameter estimations result

(24)

which are in terms of the observed sample mean , the sample size and the number of zeros in the sample .

Note that the pair is a maximum of log-likelihood function (9). By the change of variables the probability mass functions of the zero-inflated geometric (zig) and hurdle geometric (hg) models coincide and the corresponding Akaike’s information criterion values will be the same. Due to this, the fit of the data by both models coincide. In what follows we will only carry out the analysis of the zero-inflated geometric model.

5 Some examples

5.1 Study of frequency distribution of Ascaris lumbricoides infection

The nematode Ascaris lumbricoides is one of the more commons intestinal parasites of humans. Highly prevalent in tropical and temperate populations where poverty and lack of sanitation is common [12].

Burden of infestation is computed using the number of parasites in each host of the sample [14]. At present time this type of studies are not longer conducted and we will use the results from Seo [14]. He studied six rural populations in Korea where an endemic situation was observed.

The samples showed over-dispersed distributions which could be accurately fitted by a negative binomial distribution (see Figure 1).

Figure 1: Fitting the parasite counts data (black) by NB (red) and ZIG (blue) distribution for Seo data set [14]. Except in the first case, the simple zero-inflated geometric distribution fit the data as well as the negative binomial distribution (see Table 1)

In Figure 1 we show the observed (black) and expected values of the fitted models (negative binomial and zero-inflated geometric). In addition, Table 1 includes the maximum likelihood estimations, the chi-squared statistics and their corresponding -values, and Akaike’s information criterion (AIC). As we see, the zero-inflated geometric distribution fit the data as well as the negative binomial distribution in most cases. The AIC results show that the models negative binomial and zero-inflated geometric are similar. Indeed, the fit in Figure 1 improves the results obtained by negative binomial distribution in samples E and F. Using AIC, the model zero-inflated geometric showed the best performance in samples C, E and F. Hence, the ZIG() is a suitable candidate model to fit such data.

5.2 Parasite distribution in crabs

Crofton [4] contributed significanly to the study of parasite distributions in hosts. In his works he observed that over-dispersion is one of the main characteristics of parasite host distributions. Analyzing data from Hynes and Nicholas [8] of parasite infestation of the crustacean Gammarus pulex, by the parasite acanthocephalan Polymorphus minutus. Crofton show that negative binomial distribution provides a good fit to the data.

In Figure 2 we compare the fit to the data obtained by Crofton using the negative binomial distribution and the zero-inflated(deflated) geometric distribution. In most cases our simple proposal provides a better fit than the negative binomial distribution.

Based on the AIC and the chi-square goodness-of-fit test reported in Table 2 we conclude that the zero-inflated(deflated) geometric distribution provides a fit of the data as good as the negative binomial distribution.

Figure 2: Fitting the parasite counts data (black) by NB (red) and ZIG (blue) distribution for data in parasite in crabs [4]. Zero-inflated(deflated) geometric distribution fit the data as well or better than the negative binomial distribution (see Table 2)

6 Discussion and Conclusions

Negative binomial distribution is widely used to describe parasite burden in populations. It provides good fits of the observations which can be improved by the corresponding zero-inflated(deflated) distribution [4] [14].

However parameters estimation is far from trivial and maximum likelihood estimates must be found always numerically. Simple numerical methods (as the Newton method) are not easy to implement as there is not a closed expression for the derivative of the Gamma function and fail if the starting value is not chosen appropriately [1].

Negative binomial distribution may also fail to fit the zero counts [4]. This may problem may be overcome using the zero-inflated negative binomial distribution but in this case parameters estimation by maximum likelihood is even more complex and also the AIC criteria penalize models with a larger number of parameters.

For zero-inflated geometric distribution or hurdle geometric distribution we found simple formulas for the maximum likelihood parameter estimates.

In the examples analyzed in this work zero-inflated geometric distribution or hurdle geometric distributions present in most cases a similar fit to the data than the negative binomial distribution. In the few cases this distribution significantly improve the fit. The AIC results show that the models are similar.

However the major advantage of this distributions is not a little improvement in the data fitting but the fact that simple formula is provided for the maximum likelihood parameter estimates.

A simple formula for the distribution’s parameters, avoiding the use of complex numerical methods for parameter estimation, may result of practical convenience for many researchers working in the area which are not familiar with programming and the numerical implementation of algorithms.

For all the models considered not good fit of the tail of the distribution is observed. However this fact does not necessarily indicates the need to consider other distributions. Because samples are small, rare events in the tail, when observed acts like outliers given too much weight to this rare observation.

Aknowledgements

This work was partially supported by grant CIUNSA 2018-2467. JPA is a member of the CONICET. GML is a doctoral fellow of CONICET.

Conflict of Interest

The authors have declared no conflict of interest.

References

  • [1] U. Bandara, R. Gill, and R. Mitra (2019) On computing maximum likelihood estimates for the negative binomial distribution. Statistics & Probability Letters 148, pp. 54–58. Cited by: §1, §3, §6.
  • [2] C. Bliss and R. Fisher (1953) Fitting the negative binomial distribution to biological data. Biometrics 9 (2), pp. 176–200. Cited by: §1, §1, §3.
  • [3] S. Clark and J. Perry (1989) Estimation of the negative binomial parameter by maximum quasi-likelihood. Biometrics 45 (1), pp. 309–316. Cited by: §1, §3, §4.
  • [4] H. Crofton (1971) A quantitative approach to parasitism. Parasitology 62 (2), pp. 179–193. Cited by: §1, Figure 2, §5.2, §6, §6, Table 2.
  • [5] H. Dai, Y. Bao, and M. Bao (2013) Maximum likelihood estimate for the dispersion parameter of the negative binomial distribution. Statistics & Probability Letters 83 (1), pp. 21–27. Cited by: §1, §3, §4.
  • [6] W. Greene (1994) Accounting for excess zeros and sample selection in poisson and negative binomial regression models. NYU Working Paper No. EC-94-10. Cited by: §1.
  • [7] D. Hall (2000) Zero-inflated poisson and binomial regression with random effects: a case study. Biometrics 56 (4), pp. 1030–1039. Cited by: §1.
  • [8] H. Hynes and W. Nicholas (1963) The importance of the acanthocephalan polymorphus minutus as a parasite of domestic ducks in the united kingdom. Journal of Helminthology 37 (3), pp. 185–198. Cited by: §5.2.
  • [9] D. Lambert (1992) Zero-inflated poisson regression, with an application to defects in manufacturing. Technometrics 34 (1), pp. 1–14. Cited by: §1.
  • [10] Y. Min and A. Agresti (2005) Random effect models for repeated measures of zero-inflated count data. Statistical Modelling 5 (1), pp. 1–19. Cited by: §2.
  • [11] W. Piegorsch (1990) Maximum likelihood estimation for the negative binomial dispersion parameter. Biometrics 46 (3), pp. 863–867. Cited by: §1, §3, §4.
  • [12] R. Pullan and S. Brooker (2012) The global limits and population at risk of soil-transmitted helminth infections in 2010. Parasites & vectors 5 (81), pp. 1–14. Cited by: §5.1.
  • [13] K. Saha and S. Paul (2005) Bias-corrected maximum likelihood estimator of the negative binomial dispersion parameter. Biometrics 61 (1), pp. 179–185. Cited by: §4.
  • [14] B. Seo, S. Cho, and J. Chai (1979) Frequency distribution of ascaris lumbricoides in rural koreans with special reference on the effect of changing endemicity. Korean J Parasitol 17 (2), pp. 105–113. Cited by: §1, Figure 1, §5.1, §6, Table 1.
  • [15] D. Shaw, B. Grenfell, and A. Dobson (1998) Patterns of macroparasite aggregation in wildlife host populations. Parasitology 117 (6), pp. 597–610. Cited by: §1.
  • [16] A.H. Welsh, R.B. Cunningham, C.F. Donnelly, and D.B. Lindenmayer (1996) Modelling the abundance of rare species: statistical models for counts with extra zeros. Ecological Modelling 88 (1), pp. 297–308. External Links: ISSN 0304-3800 Cited by: §2.
  • [17] L. Willson, J. Folks, and J. Young (1984) Multistage estimation compared with fixed-sample-size estimation of the negative binomial parameter k. Biometrics 40 (1), pp. 109–117. Cited by: §4.
  • [18] M. Woolhouse, C. Dye, J. Etard, T. Smith, J. Charlwood, G. Garnett, P. Hagan, J. Hii, P. Ndhlovu, R. Quinnell, et al. (1997) Heterogeneities in the transmission of infectious agents: implications for the design of control programs. Proceedings of the National Academy of Sciences 94 (1), pp. 338–342. Cited by: §2.

7 Tables

Theoretical
distribution
Calculated
parameters
Samples
A B C D E F
() () () () () ()
NB 1.0167 2.8235 2.3125 2.5106 6.6410 4.6102
0.3546 0.4240 0.5761 0.5893 0.8726 0.6193
chi-squared statistic 50.0660 25.8883 6.8092 17.4885 15.6914 11.2927
-value 0.0556 0.8699 0.3547 0.4747 0.7911
AIC 1438.0235 574.5621 131.7761 198.1410 235.0388 310.0239
ZIG 0.3653 0.2687 0.2011 0.1819 0.0971 0.1581
0.3843 0.2057 0.2568 0.2458 0.1197 0.1544
chi-squared statistic 213.1743 43.2591 6.8197 22.2336 14.4552 10.6473
-value 0.0003 0.8693 0.1358 0.5648 0.8307
AIC 1458.0335 579.2921 131.5360 198.1585 233.3277 308.9315
df 16 16 12 16 16 16
Table 1: Parameters of NB and ZIG distributions calculated from observed Seo data [14] and results of chi-squared test and AIC
Theoretical
distribution
Calculated
parameters
Samples
A B C D E F
() () () () () ()
NB 2.2732 1.4165 0.6003 1.3189 0.8913 0.2670
1.2564 1.5837 0.2974 3.0544 1.2679 0.6069
chi-squared statistic 20.6558 3.1086 10.5075 2.9993 2.3843 0.2776
-value 0.0081 0.8748 0.0621 0.8089 0.6655 0.8704
AIC 2211.4460 1662.1623 1279.5742 1506.5751 724.9286 252.4139
ZIG -0.0256 -0.1304 0.4875 -0.3313 -0.1020 0.2195
0.3109 0.4438 0.4605 0.5023 0.5528 0.7451
chi-squared statistic 23.4185 6.0467 6.5825 8.6706 2.1793 0.4542
-value 0.2536 0.1930 0.7028 0.7969
AIC 2215.4026 1665.9254 1274.8293 1514.0280 724.8346 252.4989
df 8 7 5 6 4 2
Table 2: Parameters of NB and ZIG distributions calculated from observed of parasite in crabs data [4] and results of chi-squared test and AIC.