On the "Poisson Trick" and its Extensions for Fitting Multinomial Regression Models

by   Jarod Y. L. Lee, et al.

This article is concerned with the fitting of multinomial regression models using the so-called "Poisson Trick". The work is motivated by Chen & Kuo (2001) and Malchow-Møller & Svarer (2003) which have been criticized for being computationally inefficient and sometimes producing nonsense results. We first discuss the case of independent data and offer a parsimonious fitting strategy when all covariates are categorical. We then propose a new approach for modelling correlated responses based on an extension of the Gamma-Poisson model, where the likelihood can be expressed in closed-form. The parameters are estimated via an Expectation/Conditional Maximization (ECM) algorithm, which can be implemented using functions for fitting generalized linear models readily available in standard statistical software packages. Compared to existing methods, our approach avoids the need to approximate the intractable integrals and thus the inference is exact with respect to the approximating Gamma-Poisson model. The proposed method is illustrated via a reanalysis of the yogurt data discussed by Chen & Kuo (2001).



There are no comments yet.


page 1

page 2

page 3

page 4


Optimal Designs for Poisson Count Data with Gamma Block Effects

The Poisson-Gamma model is a generalization of the Poisson model, which ...

Bayesian inference, model selection and likelihood estimation using fast rejection sampling: the Conway-Maxwell-Poisson distribution

Bayesian inference for models with intractable likelihood functions repr...

Faster estimation for constrained gamma mixture models using closed-form estimators

Mixture models are useful in a wide array of applications to identify su...

Software defect prediction with zero-inflated Poisson models

In this work we apply several Poisson and zero-inflated models for softw...

Improved log-Gaussian approximation for over-dispersed Poisson regression: application to spatial analysis of COVID-19

In the era of open data, Poisson and other count regression models are i...

Fitting Infinitely divisible distribution: Case of Gamma-Variance Model

The paper examines the Fractional Fourier Transform (FRFT) based techniq...

3D Nanoscale Mapping of Short-Range Order in GeSn Alloys

GeSn on Si has attracted much research interest due to its tunable direc...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Data with correlated categorical responses arise frequently in applications. This may arise from units grouped into clusters (clustered data) or multiple measurements taken on the same unit (longitudinal data). For instance, we might expect the unemployment outcomes (employed, unemployed, not in labour force) of residents living in the same region to be correlated, due to similar job opportunities and socioeconomic levels. Ignoring the correlation structure and assuming that all observations are independent by fitting an ordinary multinomial regression model may result in biased estimates and inaccurate predictions. Multinomial mixed models can account for correlation by using group level random effects Daniels & Gatsonis (1997); Hartzel et al. (2001); Hedeker (2003).

For multinomial mixed models, it is a common practice to assume a multivariate normal distribution for the random effects. The multivariate normal distribution is easy to interpret and is convenient when we want to build more complicated correlation structures into our model. However, the resulting likelihood involves multidimensional integrals that cannot be solved analytically. The computational effort to evaluate the likelihood increases with the number of groups and categories, making it not suitable for large scale applications. In fact,

Lee, Green & Ryan (2017) showed that closed-form likelihoods for multinomial mixed models do not exist regardless of the random effect distribution, except for the special case when there are no covariates.

Various methods have been proposed to circumvent the computational obstacle for fitting multinomial mixed models. Among them are quadrature Hartzel et al. (2001); Hedeker (2003), Monte Carlo EM algorithm, pseudo-likelihood approach Hartzel et al. (2001)

and Markov Chain Monte Carlo methods

Daniels & Gatsonis (1997). Jain et al. (1994)

proposed a random effects estimation approach using a discrete probability distribution approximation. Simulation based methods such as method of simulated moments

McFadden (1989) and method of simulated maximum likelihood Gong et al. (2004); Hann & Uhlendorff (2006) are widely used in the econometrics literature. Recently, Perry (2016) proposed a fast moment-based estimation method that scales well for large samples and which arguably can be extended for fitting multinomial mixed models. Kuss & McLerran (2007) used the fact that the multinomial model is a multivariate binary model and exploited a procedure proposed by Wright (1998) for model fitting. Their approach has been criticized by de Rooij & Worku (2012) as they failed to realize that a multivariate link function is needed in the context of multinomial models. An alternative strategy using clustered bootstrap was subsequently proposed by de Rooij & Worku (2012). Although some authors have considered the Dirichlet-multinomial model that results in a closed-form likelihood, it does not allow the incorporation of individual level covariates.

Chen & Kuo (2001)

advocate using Poisson log-linear or non-linear mixed models, both with random effects, as surrogates to multinomial mixed models. Their method capitalizes on existing mixed models software packages for fitting generalized linear models with random effects. This allows multinomial mixed models to be fitted using an approximate likelihood from the Poisson surrogate models. Their results are based on extensions of the well known “Poisson Trick”

Baker (1994); McCullagh & Nelder (1989); Venables & Ripley (2002) that relates multinomial models with Poisson models via a respecification of the model formulae. Although clever, their methods have been criticized for being computationally inefficient Malchow-Møller & Svarer (2003); Kuss & McLerran (2007) and sometimes producing nonsense results Kuss & McLerran (2007). This might be due to the intractable likelihoods of their models and the various approximation methods being used in different software packages. The considerable execution time is especially problematic, where it can take up to months to fit the model to a moderate-sized dataset Malchow-Møller & Svarer (2003)! In this article, we propose a new approach based on an extension of the Gamma-Poisson model Lee, Brown & Ryan (2017), where the likelihood can be expressed in closed-form. Using the proposed estimation procedure, the parameters can be estimated via readily available packages for fitting generalized linear models.

The remaining paper is organized as follows. Section 2 reviews the “Poisson Trick” for multinomial regression models with independent responses, and suggests a parsimonious fitting strategy when all covariates are categorical. In Section 3 we propose a new approach for approximating the likelihood of multinomial regression models with random effects. The empirical performance of the proposed model is demonstrated via a simulation study and a reanalysis of the yogurt brand choice dataset as discussed by Chen & Kuo (2001) in Section 4. Finally, we conclude with a discussion and a summary of our findings in Section 5.

2 "Poisson Trick" for Independent Multinomial Responses

This section describes the relationship between multinomial and Poisson regression models for independent responses, which we refer to as the “Poisson Trick”. The results are based on the well known fact that given the sum, Poisson counts are jointly multinomially distributed McCullagh & Nelder (1989).

2.1 Derivation

Let be the

response vector for observation

with the corresponding probability vector , where indexes the multinomial category. A common approach to satisfy the two characteristics of probability (i) for all and ; and (ii) is via


where is a positive user-specified function of covariates and fixed effects , and . Depending on the type of variable under consideration, and can be indexed by various combinations of and (Croissant, 2013, pp. 7-8). Conditional on the multinomial sums , are independently multinomially distributed for each , i.e.


In multinomial models, is treated as fixed. Suppose we instead treat as random and assume


independently for each . This results in a multinomial-Poisson mixture with the following joint probability function for each :


The marginal probability of can then be obtained by summing the joint probability over all possible values of :


Thus, allowing the multinomial sums to be random according to a Poisson distribution results in


independently for each and . Summing over all observations, the log-likelihood is


where and denote the Poisson and multinomial log-likelihood functions respectively, and . The second term on the left hand side is the model we would like to fit, and the term on the right hand side is the model we actually fit.

To show that the Poisson surrogate model is an exact fit to the multinomial model, first note the log-likelihood corresponding to the multinomial model is


and the log-likelihood of the Poisson surrogate model is


Differentiating Equation 9 with respect to and setting it to , we obtain . Plugging in the maximizing value of into Equation 9, we have


Equation 10 is identical to Equation 8

, up to an additive constant. It follows that the maximum likelihood estimates, their asymptotic variances and tests for the fixed effects can be

exactly recovered under the Poisson surrogate model Richards (1961). That is, likelihood inference for is the same whether we regard as fixed (multinomial) or randomly sampled from independent Poissons. This result applies to the fixed effects model, with any parameterization of , including:

  • Exponential transformations of linear combinations of categorical variables and regression coefficients McCullagh & Nelder (1989); Agresti (2013),

  • Exponential transformations of linear combinations of continuous variables and regression coefficients,

  • Any monotonic transformations of linear combinations of covariates and regression coefficients,

  • Nonlinear functions of covariates and regression coefficients,

  • Nonparametric formulations.

The Poisson surrogate model eliminates from the denominator of the multinomial probabilities. This makes sense intuitively, as we do not expect the multinomial sums to provide any useful information in estimating the fixed effects. Given that can also be obtained by setting the fitted values of the multinomial sums equal to the observed counts in the Poisson surrogate model, has the effect of recovering the multinomial sums. The key idea is to include a separate constant for each unique combination of covariates in the Poisson surrogate models.

2.2 Specifying Model Formulae for Poisson Surrogate Models

For purposes of exposition, the model formulae in this section are written in terms of the R language R Development Core Team (2017), although this article is not concerned with software packages per se. Multinomial models are fitted using the multinom() function within the nnet package Ripley & Venables (2016); Poisson models are fitted using the glm() function within the stats package.

For concreteness, consider the non-parallel baseline category logit models. The “baseline category logit” assumption refers to the following: treating category as the baseline category with without loss of generality, we model = as a linear function of covariates and regression coefficients . This assumption is not necessary, but chosen so that the model formulae can be illustrated using functions within the stats package. The “non-parallel” assumption refers to covariate effects that vary across categories Fullerton & Xu (2016), i.e. all elements of the vector are indexed by . That is, if the set of logits are plotted against the covariate on the same graph, a set of straight lines with slopes that are in general different will be obtained. Later we shall discuss cases where we relax this assumption.

Consider a hypothetical dataset with two predictors and (these can be categorical or continuous) and a multinomial outcome vector with categories. In short format, each row of data represents an observation with a -dimensional outcome vector . Poisson models treat the outcomes of each observation as independent and glm() requires data to be presented in long format. This requires an additional factor that denotes the category memberships. Each row now comprises a scalar outcome, resulting in rows of data per observation. The first few rows of data in both short and long format are shown in Table 1.

1 0 0 3 5 2
2 0 1 5 5 0
3 1 0 7 2 1
4 1 1 1 3 6
(a) Short format.
1 0 0 1 3
1 0 0 2 5
1 0 0 3 2
2 0 1 1 5
2 0 1 2 5
2 0 1 3 0
(b) Long format.
Table 1: Multinomial data.

Table 2 shows the equivalant relationship between non-parallel multinomial models and the corresponding Poisson models, where the parameters satisfy the usual constraints for identifiability. The Poisson surrogate models possess several important features:

  1. The model includes an indicator variable (that corresponds to in Section 2.1) for each observation, although this can be simplified when all covariates are categorical. This is to ensure the exact recovery of the multinomial sums, as the fixed sums are treated as random in the Poisson models. As a result, we do not interpret the coefficients of since they are just nuisance parameters.

  2. The category membership indicator enters as a covariate in the Poisson models, where the coefficients correspond to the intercepts in the multinomial modelsv so that the counts are allowed to vary by category.

  3. The model includes interaction terms between and (denoted by in the model formula), where the coefficients correspond to the slopes in the multinomial models. This is due to the non-parallel assumption where each category has a separate slope, and also the fact that multinomial models treat the response counts jointly for each observation, whereas Poisson models treat each response count as a separate observation. It is important that these interaction terms are included even if they are not significant. For multinomial models where some (partial models) or all (parallel models) of the covariate effects do not vary across categories, the equivalent Poisson models can be obtained by modifying the interaction structure between and accordingly. For instance, in parallel models where all categories share the same covariate effects, there is no need to include the interation terms between and , since the slopes do not vary across categories.

Multinomial Poisson
  • Syntax for using multinom() in R, where data are presented in short format and is a vector of response counts.

  • Syntax for using glm() in R, where data are presented in long format and is a scalar response count.

Table 2: Equivalent relationship between non-parallel multinomial models and the corresponding Poisson models.

When writing the model formula, it is important to specify and as factors due to their categorical nature. This can be achieved via the factor() function in R.

Special Case: Categorical Covariates
When all covariates are categorical, the model formulae in Table 3 offer a more parsimonious option for fitting the Poisson models without having to estimate a separate parameter for each observation.

Multinomial Poisson
  • Syntax for using multinom() in R, where data are presented in short format and is a vector of response counts.

  • Syntax for using glm() in R, where data are presented in long format and is a scalar response count.

Table 3: Equivalent relationship between non-parallel multinomial models and the corresponding Poisson models, when all the covariates are categorical.

As stated above, the key to achieving the 1-1 correspondence between multinomial and Poisson models (with the same link function) is to include a separate constant for each unique combination of covariates. For categorical covariates, this can be achieved by including the full interaction among the predictors in the Poisson model. When all the covariates are categorical, the interaction term has the precise effect of pooling groups of observations with identical covariates. Fitting such models is equivalent to fitting the observation index as a factor (Table 2), but the pooling results in a smaller effective data frame, and therefore smaller storage requirements and faster fitting speed, with no loss of information. Of course, if there are many factors, there may not be much saving, because it will be comparatively rare for different observations to have all the same factor level combinations.

3 Extending the “Poisson Trick” for Correlated Multinomial Responses

3.1 Derivation

Consider a set of observations which fall into a collection of groups and let be a vector-valued random effect for group . Each observation belongs to only a single group. Extending the notation in Section 2, the response vector for observation in group is , with the corresponding probability vector , where . Conditional on the multinomial sums and the random effects , the counts are Multinomially distributed:


In analogy to the results in Section 2, given the random effects, we treat as random and assume


independently for each and . This gives


independently for each and . The probability argument in Equation 7 still holds, now conditional on the random effects:


where . If the random effects are observed, the conditional probability statement above imply a 1-1 exact correspondence between the multinomial and the Poisson surrogate models. However, due to the unobserved nature of the random effects, interest lies in the marginal distribution, obtained by integrating out the random effects. This results in an approximate relationship between the two models. It turns out that the marginal likelihood of the approximating Poisson surrogate model (right hand side of Equation 14) can be expressed in closed-form if we assume an independent Gamma model for the random effects, with and , i.e. Lee, Green & Ryan (2017).

With this assumption for the distribution of the random effects, the marginal likelihood of the multinomial model that we would like to fit (second term on the left hand side of Equation 14) is given by


This does not generally exhibit a closed-form solution regardless of the random effect distribution, unless in the special cases of no covariate or with only group specific covariates Lee, Green & Ryan (2017). Numerical or simulation methods can be used to approximate the likelihood, with computational efforts increasing with increasing number of groups and categories. On the other hand, the marginal likelihood of the Poisson surrogate model can be expressed in closed-form:


The Poisson surrogate model is an extension of the Gamma-Poisson model as proposed by Lee, Brown & Ryan (2017) and Lee, Green & Ryan (2017) to allow the modelling of counts for multiple categories.

As a consequence of Equation 16, we have


Refer to the appendix for details. This is the population-averaged expected value and is not suitable for prediction in general, as it does not take into account the cluster effect. However, it can be useful for out of sample prediction, when there are no samples present in a particular group.

Special Case: Var() approaches
When Var() approaches for all , the model reduces to the special case of no random effects as outlined in Section 2, and the exact correspondence between the multinomial and the Poisson models can be regained.

3.2 Identifiability

There is some lack of identifiability with the model formulation given by Equation 16, characterized by non-uniqueness of the maximum likelihood estimates. There is an identifiability issue between and , and also between and the category intercepts. To fix this, we impose the constraint of so that . As a consequence, and . Also, we only require a random effect for each logit, and thus a constraint for the random effects associated with the baseline category is needed. Denote . Several authors such as Agresti (2013) (pp.514) and Hartzel et al. (2001) considered a multivariate normal distribution for the random effects, i.e. , where is a by variance-covariance matrix. This is equivalent to saying that for all , or . The equivalent statement in our proposed model is to fix for all . This is tantamount to saying approaches , and thus approaches .

3.3 Prediction of Random Effects and Fitted Values

We focus on the best predictor (BP) for random effects prediction, i.e. the predictor that minimises the overall mean squared error of prediction. McCulloch et al. (2008) shows that the BP is given by the posterior expectation of the random effect. Under the proposed Poisson surrogate model, the BP is given by


which can be calculated via


Solving for the integral, the BP is


where . depends on the parameters , and , in which we replace by their estimators, leading to the empirical best predictor (EBP). The fitted values can then be defined as


where we replace and by their respective estimators and .

3.4 Parameter Estimation

Consider the parameterization which is widely adopted in practice, where , where and are both vectors. The chosen index structure for and encompasses a variety of possible scenarios: (i) category-specific predictors with generic coefficients , (ii) category-specific predictors with category-specific coefficients , and (iii) observation-specific predictors with category-specific coefficients

. This can be achieved by creating the appropriate interaction terms between the predictor and the category indicator variable, thus modifying the model matrix. Note that observation-specific predictors must be paired with choice-specific coefficients. Otherwise they will disappear in the differentiation when we consider the log-odds.

Denote , where includes the incidental parameters for all and . Algorithm 1 presents an Expectation/Conditional Maximization (ECM) algorithm Meng & Rubin (1993) for parameter estimation of the Poisson surrogate model. Refer to the appendix for a detailed derivation.

  Initialize . Cycle: while relative differences in the parameter estimates are not negligible do
       E-Step: Calculate for each and :
where is the digamma function.CM-Step:
  • Obtain by fitting a Poisson log-linear model with as the response and as the design matrix, with as offset. includes indicator variables for each unique combination of covariates.

  • Obtain for each for to by maximizing

    where is the gamma function.

end while
Algorithm 1 Expectation/Conditional Maximization (ECM) algorithm for fitting the Poisson surrogate model.

4 Yogurt Brand Choice Dataset

We consider the yogurt brand choice dataset previously analyzed by Jain et al. (1994) and Chen & Kuo (2001). Jain et al. (1994) approximated the likelihood of a multinomial logit model with Gaussian random effects using a discrete distribution. Chen & Kuo (2001) approximated the multinomial logit model using the Poisson log-linear model and Poisson nonlinear model, both with Gaussian random effects.

The dataset consists of purchases of yogurt by a panel of households in Springfield, Missouri, and were originally provided by A. C. Nielsen. The data were collected by optical scanners for about two years and correspond to purchases. Variables collected include brand, price and presence of newspaper feature advertisements for each purchase made by households in the panel. Price and feature advertisements are choice-specific variables. We assume a parallel baseline logit model by assigning generic coefficients to these variables, as we do not expect the effect of price and feature advertisements on the probability of purchase to vary according to brands. The four brands of yogurt: Yoplait, Dannon, Weight Watchers, and Hiland account for market shares of 34%, 40%, 23%, and 3% respectively. Following Chen & Kuo (2001), we put Hiland as the reference brand. Table 4 presents the yogurt data in both long and short format. The letters ‘f’ and ‘p’ represent the feature and price variables respectively, with the letter that follows denoting the brand. For instance, ‘fy’ stands for ‘feature of Yoplait’ and ‘pd’ stands for ‘price of Dannon’.

id obs yoplait dannon weight hiland fy fd fw fh py pd pw ph
1 1 0 1 0 0 0 0 0 0 0.108 0.081 0.079 0.061
1 2 0 1 0 0 0 0 0 0 0.108 0.098 0.075 0.064
1 3 0 1 0 0 0 0 0 0 0.108 0.098 0.086 0.061
2 9 1 0 0 0 0 0 0 0 0.108 0.098 0.079 0.050
100 2412 0 0 1 0 0 0 0 0 0.108 0.086 0.079 0.043
(a) Short format.
id obs feature price count brand
1 1 0 0.108 0 yoplait
1 1 0 0.081 0 dannon
1 1 0 0.079 1 weight
1 1 0 0.061 0 hiland
1 2 0 0.108 0 yoplait
1 2 0 0.098 1 dannon
100 2412 0 0.043 1 hiland
(b) Long format.
Table 4: Yogurt data.

We fit the models proposed in Sections 2 and 3, and compare our results to that of Chen & Kuo (2001), fitted using the SAS macro GLIMMIX and the SAS procedure NLMIXED. The results are presented in Table 5. The preference ordering of the brands are the same for all models, i.e. Yoplait is the most preferred brand, followed by Dannon, Weight Watchers and Hiland. The slope parameters estimates have the expected signs for all models. An increase in price is associated with a decrease in the probability of purchase. Feature advertisement tends to increase the chance of purchase. The household-to-household variation in the probability of purchase for Weight Watchers is much larger than the other brands, although none are significant (p > 0.05).

In comparing the estimates between models, we note that the fixed effects model is likely to produce biased estimates as it did not take into account of the correlation induced by multiple purchases from the same household. The parameter estimates of NLMIXED and Gamma-Poisson are uniformly larger than that of GLIMMIX, except for the intercept associated with Weight Watchers (for NLMIXED) and for the slope associated with price (for Gamma-Poisson). The estimates of the standard errors from NLMIXED and Gamma-Poisson are also uniformly larger than that of GLIMMIX. These differences can be attributed to the different distributional assumptions of the random effects, and also the different approximations used in GLIMMIX and NLMIXED to estimate the intractable likelihood. In this regard, our model exhibit a closed-form likelihood that allows exact inference to be performed with respect to the approximating model.

We tried to fit a simplified version of GLIMMIX using the glmer() function within the lme4 package in R, with just a random effect per household (ignoring the choice effect). However, the model failed to converge within a few months, even though Chen & Kuo (2001) claimed that the GLIMMIX model coverged in SAS.

Random Effects
Variables Fixed Effects GLIMMIX NLMIXED Gamma-Poisson
Dannon 3.716 (0.145) 3.838 (0.231) 4.130 (0.648) 4.616 (0.309)
Weight 3.074 (0.145) 2.242 (0.241) 1.046 (0.671) 3.677 (0.392)
Yoplait 4.450 (0.187) 4.626 (0.261) 4.805 (0.699) 5.275 (0.342)
Feature 0.491 (0.120) 0.730 (0.121) 0.956 (0.185) 0.785 (0.178)
Price -36.658 (2.437) -40.012 (2.562) -36.686 (3.725) -40.881 (3.778)
N/A N/A N/A 2.203 (0.134)
N/A N/A N/A 6.067 (0.374)
N/A N/A N/A 1.918 (0.135)
  • Fitted using the glm() function in R, using the “Poisson Trick” as outlined in Section 2.

  • Poisson log-linear model with Gaussian random effects, fitted by Chen & Kuo (2001) using the SAS macro GLIMMIX.

  • Poisson nonlinear model with Gaussian random effects, fitted by Chen & Kuo (2001) using the SAS procedure NLMIXED.

  • Poisson log-linear model with Gamma (multiplicative) random effects fitted using the ECM algorithm, as outlined in Section 3.

Table 5: Regression estimates for the yogurt data, and the associated standard errors.

5 Concluding Remarks

In this article, we presented methods for fitting various multinomial regression models via the so-called “Poisson Trick” and its extensions. The “Poisson Trick” for fitting fixed effects multinomial regression models is handy when the direct fitting of multinomial models is not supported, for instance the INLA package Rue et al. (2009) in R. For multinomial regression models with random effects, there exist a variety of experience for using the existing extensions proposed by Chen & Kuo (2001), from taking months to fit a moderate sized dataset Malchow-Møller & Svarer (2003), producing nonsense results Kuss & McLerran (2007) to non-convergence in our experience of fitting the yogurt brand choice dataset. We proposed an extension of the “Poisson Trick” using Gamma (multiplicative) random effects. In contrast to the models by Chen & Kuo (2001), our model exhibits a closed-form likelihood and can be maximized using existing functions for fitting generalized linear models that are stable and heavily optimized, without having to approximate the integrals.

6 Appendix

6.1 Derivation of the Population-Averaged Expected Values in Equation 17

Equation 16 is also equivalent to


This results in two different interpretations for the extended Gamma-Poisson surrogate model:

  1. For each and , is independent negative multinomial (Equation 16).

  2. For each and , the category sums are independent Negative Binomial
    , and conditional on the , is independent multinomial (Equation 22).

Taking expectation of both Equations 16 and 22 with respect to gives rise to the population-averaged expected value given in Equation 17

. The definitions of negative multinomial and negative binomial distributions are given in the following subsections.

6.1.1 Negative Multinomial Distribution

This is the distribution on the non-negative integers outcomes , with corresponding probability of occurence and probability mass function

for parameters and , where for all , and is the Gamma function. We write . For positive integer

, the negative multinomial distribution can be recognized as the joint distribution of the n-tuple

when performing sampling until reaches the predetermined value . The mean vector of negative multinomial distribution is given by .

6.1.2 Negative Binomial Distribution

This is the distribution on the non-negative integers outcome , with corresponding probability of occurence and probability mass function

for parameters and . We write . For positive integer , the negative binomial distribution can be recognized as the distribution for the number of heads before the th tail in biased coin-tossing, but it is a valid distribution for all . In engineering, it is sometimes called the Pólya distribution in the case where is not integer.

6.2 Derivation of the Expectation/Conditional Maximixation (ECM) Algorithm in Section 3.4

Treating for all and to as missing data and for all , and as observed data, the complete data is . Denote , where includes the incidental parameters for all and . The complete data log-likeliood is


The th E-step involves finding the conditional expectation of the complete data log-likelihood with respect to to the conditional distribution of given and the current estimated parameter . Straightforward algebra establishes that


independently for each and

, where the gamma distribution is parameterized in terms of scale parameter. It follows that


Thus, in the th E-step, we replace and in Equation 23 with and , giving . The th CM-step then finds to maximize via a sequence of conditional maximization steps, each of which maximizes the function over a subset of , with the rest fixed at its previous value. In our application, it is natural to partition into and for each to . Differentiating Equation 23 with respect to , we obtain


which is the score equation of the Poisson log-linear model McCullagh & Nelder (1989) with an additional offset . This allows us to leverage existing functions for fitting generalized linear models available in most statistical software packages for maximizing in the CM step. This is an important feature as often contains a huge amount of parameters in our applications, due to the inclusion the incidental parameter for every unique combination of covariates. Existing functions for fitting generalized linear models are typically stable and heavily optimized, even for a large number of parameters. Maximizing in the CM step for each q is straightforward, as it only involves univariate optimization.


Lee’s research is partially supported by the Australian Bureau of Statistics. The authors are grateful to Zhen Chen for providing the yogurt data.


  • (1)
  • Agresti (2013) Agresti, A. (2013), Categorical Data Analysis, Wiley.
  • Baker (1994) Baker, S. G. (1994), ‘The multinomial-poisson transformation’, Journal of the Royal Statistical Society: Series D 43(4), 495–504.
  • Chen & Kuo (2001) Chen, Z. & Kuo, L. (2001), ‘A note on the estimation of the multinomial logit model with random effects’, The American Statistician 55(2), 89–95.
  • Croissant (2013) Croissant, Y. (2013), Estimation of multinomial logit models in R: The mlogit Packages. R package version 0.2-4.
  • Daniels & Gatsonis (1997) Daniels, M. J. & Gatsonis, C. (1997), ‘Hierarchical polytomous regression models with applications to health services research’, Statistics in Medicine 16, 2311–2325.
  • de Rooij & Worku (2012) de Rooij, M. & Worku, H. M. (2012), ‘A warning concerning the estimation of multinomial logistic models with correlated rresponse in SAS’, Computer methods and programs in biomedicine 107, 341–346.
  • Fullerton & Xu (2016) Fullerton, A. S. & Xu, J. (2016), Ordered Regression Models: Parallel, Partial, and Non-Parallel Alternatives, CRC Press.
  • Gong et al. (2004) Gong, X., van Soest, A. & Villagomez, E. (2004), ‘Mobility in the urban labor market: A panel data analysis for Mexico’, Economic Development and Cultural Change 53(1), 1–36.
  • Hann & Uhlendorff (2006) Hann, P. & Uhlendorff, A. (2006), ‘Estimation of multinomial logit models with unobserved heterogeneity using maximum simulated likelihood’, The Stata Journal 6(2), 229–245.
  • Hartzel et al. (2001) Hartzel, J., Agresti, A. & Caffo, B. (2001), ‘Multinomial logit random effects models’, Statistical Modelling 1(2), 81–102.
  • Hedeker (2003)

    Hedeker, D. (2003), ‘A mixed-effects multinomial logistic regression model’,

    Statistics in Medicine 22(9), 1433–1446.
  • Jain et al. (1994) Jain, D. C., Vilcassim, N. J. & Chintagunta, P. K. (1994), ‘A random-coefficients logit brand-choice model applied to panel data’, Journal of Business & Economic Statistics 12(3), 317–328.
  • Kuss & McLerran (2007) Kuss, O. & McLerran, D. (2007), ‘A note on the estimation of the multinomial logistic model with correlated rresponse in SAS’, Computer Methods and Programs in Biomedicine 87, 262–269.
  • Lee, Brown & Ryan (2017) Lee, J. Y. L., Brown, J. J. & Ryan, L. (2017), ‘Sufficiency revisited: Rethinking statistical algorithms in the big data era’, The American Statistician (to appear) .
  • Lee, Green & Ryan (2017) Lee, J. Y. L., Green, P. J. & Ryan, L. M. (2017), ‘Conjugate generalised mixed models for two-level data’, Paper in preparation .
  • Malchow-Møller & Svarer (2003) Malchow-Møller, N. & Svarer, M. (2003), ‘Estimation of the multinomial logit model with random effects’, Applied Economics Letters 10(7), 389–392.
  • McCullagh & Nelder (1989) McCullagh, P. & Nelder, J. A. (1989), Generalized Linear Models, Chapman & Hall.
  • McCulloch et al. (2008) McCulloch, C. E., Searle, S. R. & Neuhaus, J. M. (2008), Generalized, linear and mixed models, John Wiley & Sons.
  • McFadden (1989) McFadden, D. (1989), ‘A method of simulated moments for estimation of discrete response models without numerical integration’, Econometrica 57(5), 995–1026.
  • Meng & Rubin (1993) Meng, X.-L. & Rubin, D. B. (1993), ‘Maximum likelihood estimation via the ECM algorithm: A general framework’, Biometrika 80(2), 267–278.
  • Perry (2016) Perry, P. O. (2016), ‘Fast moment-based estimation for hierarchical models’, Journal of the Royal Statistical Society: Series B 79(1), 267–291.
  • R Development Core Team (2017) R Development Core Team (2017), R: A Language and Environment for Statistical Computing, Vienna, Austria: R Foundation for Statistical Computing. ISBN 3-900051-07-0, http://www.R-project.org.
  • Richards (1961) Richards, F. (1961), ‘A method of maximum-likelihood estimation’, Journal of Royal Statistical Society (Series B) 23, 469–475.
  • Ripley & Venables (2016) Ripley, B. & Venables, W. (2016),

    nnet: Feed-Forward Neural Networks and Multinomial Log-Linear Models

    R package version 7.3-12.
  • Rue et al. (2009)

    Rue, H., Martino, S. & Chopin, N. (2009), ‘Approximate Bayesian inference for latent Gaussian models using integrated nested Laplace approximations (with discussion)’,

    Journal of the Royal Statistical Society, Series B 71(2), 319–392.
  • Venables & Ripley (2002) Venables, W. & Ripley, B. (2002), Modern Applied Statistics with S, Springer.
  • Wright (1998)

    Wright, S. (1998), Multivariate analysis using the MIXED procedure (paper 229-23),

    in ‘Proceedings of the 23rd Annual SAS Users Group (SUGI) International Conference’.