Count data often arise in healthcare and epidemiology data sets. For example, the number of outcomes observed (i.e. cases of a disease) in a specific group of people (i.e.
citizens residing in Knox county). We tend to treat count data as realizations from a Poisson distribution, so that the probability of observingevents given an expected number of events, , is precisely
In regression analysis, Poissonian data are most often related to a linear combination of independent variables (i.e., covariates, exposures) via a log-link function. Hence, the Poisson regression model is defined as
where , , are the measured effects of indpendent exposures, . For a binary exposure, , the amount is the increase (or decrease) in outcome incidence for subjects in which the exposure was observed, relative to subjects in which the exposure was not observed . The amount is the expected number of counts in baseline subjects. Researchers are responsible for determining the criteria defining a baseline measurement.
A major assumption in Poisson regression modeling is that the mean and variance of the observed counts are equivalent. When this assumption is violated, as is often true of real data, we say the data are overdispersed. A negative binomial distribution may more appropriate than a Poisson for describing overdispersed count data, due to its excessive tail behavior.
The difference between negative binomial regression (also referred to as NB-2 models) and Poisson regression comes from our treatment of the observed count data. We still make the assumption that these counts come from a Poisson distribution, but we make the added assumption that the mean number of events,
, comes from a 2-parameter Gamma distribution.
1.1 Mixture models
Finite mixture (FM) modeling has emerged within the past 20 years as a popular means for handling unsupervised classification tasks. The underlying principle behind mixture modeling is that we treat our observed data as having been sampled from a convex sum of distributions [5, 15]. This concept is particularly useful for explainining overdispersion in count data as being due to an underlying heterogeneous population.
FM modeling has appeared in tandem with Poisson regression models. The extent of this type of research has mostly been limited to zero-inflated, zero-truncated, and hurdle-type models . These models are designed to handle zero-counts, and are largely inappropriate for explaining heterogeneities due to covariates. Papastamoulis et al. (2014) described a methodology that can address covariate heterogeneity using FMs of Poisson regression models in the context of high-throughput sequencing data 
. Their EM algorithms for estimatingmixtures are freely available online from CRAN.
In addition, Park and Lord (2009) have developed FMs of Poisson and negative binomial regression models for explaining hetergeneities in vehicle crash data . Of the modeling frameworks that have been suggested in the literature, theirs is most similar to ours in that they proposed using the information criteria AIC, BIC, and DIC to choose the optimal number of mixing components. However, these criteria did not give conclusive results, and so they were forced to subjectively choose a mixture of 2 NB-2 models. In the proceeding sections, we will discuss potential reasons for these inconclusive results. For more information on finite mixtures of negative binomial regression models, see also Zou et al. (2013) .
1.2 Model selection
Model selection exists as an alternative approach to classical hypothesis-drive statistics which require distributional assumptions and an arbitrary selection of a confidence level to evaluate -values. Akaike was the first to propose a criterion for conducting model selection when he introduced his celebrated Akaike information criterion (AIC) in 1973. The formula, which involves a tradeoff between a candidate model’s lack of fit given by the maximized log-likelihood function, and a penalty term for the number of parameters, , is shown below:
When we compare 2 or more models that have been fit to the same set of data, we prefer to choose the model that minimizes the AIC score.
As it turns out, the original penalty term that Akaike proposed, , is not enough to prevent model overfitting . Numerous other information criteria (IC) have been proposed, including but not limited to Schwarz’s Bayesian Criterion (SBC or BIC) , the deviance information criterion (DIC) (which we will not consider here), the consistent form of AIC (CAIC) , and the information-theoretic measure of complexity (ICOMP)  which involves evaluating the inverse Fisher’s information matrix (IFIM). The reader is directed to the respective citations for derivations of the formulae. We will point out that each of the criteria we consider in this paper all share the lack-of-fit term in common – what makes them different is how they penalize overparameterization.
2.1 Log-likelihood functions
Our ultimate goal when we carry out model selection in this framework is to determine the optimal number of components in a FM model, given a fixed maximum number of components (see 
for heuristics on choosing this number). This optimal number is reflective of the number of subpopulations composing the entire population from which we have sampled our data. Before we can perform model selection, we must be able to generate the various criteria discussed previously. To do so requires that we compute the log-likelihood function.
Recall equation (2) when dealing with Poissonian data whose mean and variance are similar. It follows that . Since our outcome is sampled from a convex sum of Poisson distributions we have the following probability for observing -dimensional count data, :
In equation (4), is the total number of components in the FM model; is the weight assigned to component ; and, refers to the effects in the -th component, as this will vary within components. It follows directly from equation (2) that the log-likelihood is
When data are significantly overdispersed, recall that we assume our counts, , come from a Poisson distribution whose mean parameter, , is Gamma-distributed. This additional constraint gives us the following probability density for :
The interpretation of and follows as before. The are overdispersion parameters, satisfying (note: ). The quadratic dependence of the variance on the mean is why we refer to these as NB-2 models.
Note that when , the data are not overdispersed and so the assumption that is satisfied. Using our methodology, if the overdispersion parameter is below a certain threshold, we elect to perform Poisson regression, otherwise we perform NB-2 regression. This is especially important for computing the log-likelihood functions once we carry out scoring.
The choice of an appropriate initialization scheme was a source of great concern for Papastamoulis et al. in their algorithm . They used a “Small-EM” strategy from Biernacki et al. (2003), which was necessary for dealing with a large number of component mixtures () .
We are not interested in exploring the possibility of more than components unless there is some justifiable epidemiological reason for doing so. Because of this, we opted for using a Poisson mixture model to perform unsupervised classification of our observed counts (see ). As a result, our methodology involves first sorting the count data into groups, and then assigning observations in each group to an initial Poisson or NB-2 regression model.
All numerical simulations were carried out in MATLAB 2015a. We used ArcGIS 10.2 for generating the map of Tennessee counties.
3.1 Simulated data
We begin using our proposed model framework to choose the optimal number of components in a simple set of simulated data. Each observation in the data is an ordered pair,, where , and , . The parameters and for all are randomly generated integers between 1 and 50. In this way, each observation comes from 1 of 4 potential groups.
We tried fitting FMs of 1 to 5 Poisson regression models to these data; their scores are shown in Table 1. We see that all 3 scores select as the optimal number of components. The regression models corresponding to the solution when are shown in the right pane of Figure 1. The regression models each seem to do well modeling these data.
3.2 Tennessee HIV counts
We now use real data taken from a 2014 county-level study on persons living with HIV in the U.S. South . Readers are encouraged to refer to  for a summary of the entire data set. For demonstrating our methodology, we are interested only in the TN counties that were included in the study. These data are summarized in Table 4. For reference, the abbreviations we use when referring to individual covariates are summarized below in Table 2.
|–||HIV||Number of persons living with HIV||Continuous (count)|
|1||SCH||Proportion of persons with less than a HS education||Continuous|
|2||POV||Proportion of persons living below the poverty line||Continuous|
|3||INC||Natural logarithm of median income||Continuous|
|4||URB||Urbanicity indicator (population 50,000/50,000)||Categorical (2)|
|6||NHB||Proportion of Non-Hispanic Black persons||Continuous|
|7||NHW||Proportion of Non-Hispanic White persons||Continuous|
|8||HSP||Proportion of Hispanic persons||Continuous|
Description of the variables in the Tennessee HIV data. The outcome is HIV. Categorical variables have the number of levels listed in parentheses.
The count data are largely overdispersed (mean(HIV) = 166.45, var(HIV) 500,000), so we proceed by using mixtures of NB-2 models. In Figure 2, we see that the model selected 4 regression models, which is indicative of 4 separate homogeneous subpopulations within Tennessee counties.
Given that there are 4 subpopulations, we used a genetic algorithm (GA) to select the subset of variables from the original 8 that contributes most to sorting the data. GA have been used numerous times for carrying out variable subset selection, although never for Poisson or NB-2 regression (see [1, 7, 16] for GA applied to other regression models). The results for performing GA subsetting in our model are summarized in Table 3. We see that variables 4 and 7 appear in each of the subsets most frequently chosen, and the model fitted with only variables 1, 3, 4, 5, and 7 is the best at minimizing the IC scores.
|1, 3, 4, 5, 7||17.00%||708.76||776.89||756.89|
|1, 2, 3, 4, 5, 7||10.00%||703.51||781.86||758.86|
In the interest of parsimony, Table 5 shows coefficients from each of the 4 components in a FM model fitted with just the urban indicator and the non-Hispanic white proportion. We see that the proportion of whites in a county is significantly protective against HIV count in 3 out of 4 components, while the urban indicator is a significant risk for HIV count.
In components 1 and 4, urban indicator was not included in the regression analysis because all of the counties belonging to those components had urban indicators of 0 or 1, respectively. Based on Table 6, this makes sense. We see that component 1 contains counties with HIV counts less than 15, while component 4 contains counties with counts greater than 100. The proportion of individuals with a high school diploma or lower decreases in each component, as do the non-Hispanic white proportions. Interestingly, the poverty rate and unemployment rates also decrease. At the same time, the urban indicator, and the non-Hispanic black and Hispanic proportions increase with each component.
Our results indicate that the proportion of white persons in a county seems to be a significant predictor of HIV count in that county. This is slightly inconsistent with the conclusions of Gray, et al. , who found that the proportion of black persons was the most significant contributor to HIV rate. That being said, our GA the non-Hispanic black variable was selected numerous times as a significant variable, however it was not one the most significant (results not shown).
That urban indicator was selected as a significant variable comes as no surprise. This result tells us that as the population increases, we should also expect to see an increase in HIV count. For numerous different reasons, this result makes perfect sense. Moreover, in Figure 3, we see that the counties with the highest HIV counts correspond to the some of the largest cities in Tennessee, including Knox (Knoxville), Davidson (Nashville), Hamilton (Chattanooga), and Shelby (Memphis).
It was also interesting to find that poverty rates and education levels tended to be negatively associated with HIV counts. Our best explanation is that, at least in Tennessee, population size is a confounder for education level and poverty rate. That is to say that as the population of a county increases, we also expect to see access to jobs and education increasing. It would be worth investigating these results further to determine what potential public health impacts exist.
We have proposed a novel framework for analyzing data sets in which the response variable is a count. Our approach allows us to systematically identify homogeneous subpopulations arising in the dataset, based on the covariates and their contributions to the count data in a Poisson regression model. We are also able to select the covariates that contribute most to determining the aforementioned subpopulations. Using this approach on a set of HIV count data in Tennessee has led to reasonable and quantifiable results. This would suggest that we could apply this approach to other sets of data involving counts of other infectious diseases, such as influenza, dengue fever, or even the Ebola virus disease.
Akbilgic, O. and H. Bozdogan (2011).
Predictive Subset Selection using Regression Trees and RBF Neural Networks Hybridized with the Genetic Algorithm.European Journal of Pure and Applied Mathematics 4(4): 467 – 486.
Biernacki, C., C. Celeux, and G. Govaert (2003).
Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models.Computational Statistics & Data Analysis, 41(1): 561 – 575.
-  Biswas, S. (2013). nbreg.m – Negative Binomial Regression. MATLAB Central File Exchange. Accessed 2015.
-  Bozdogan, H. (1987). Model selection and Akaike’s information criterion (AIC): The general theory and its analytical extensions. Psychometrika, 52(3): 345 – 370.
Bozdogan, H. (1994).
Mixture-model cluster analysis using model selection criteria and a new informational measure of complexity, inProceedings of the First US/Japan Conference on the Frontiers of Statistical Modeling: An Informational Approach, Vol. 2, H. Bozdogan (ed.). Kluwer Academic Publishers, Dordrecht, NED, 1994, pp. 69 – 113.
-  Bozdogan, H. (2000). Akaike’s information criterion and recent developments in information complexity. Journal of Mathematical Psychology: 62 – 91.
Broadhurst, D., R. Goodacre, A. Jones, J. J. Rowland, and D. B. Kell (1997).
Genetic algorithms as a method for variable selection in multiple linear regression and partial least squares regression, with applications to pyrolysis mass spectrometry.Analytica Chimica Acta 348: 71 – 86.
-  Dohoo, I., W. Martin, and H. Stryhn (2012). Methods in epidemiologic research. VER Inc, Charlottetown, PEI, CAN: 890 pages.
-  Gray, S., T. J. Massaro, I. Chen, C. J. Edholm, R. Grotheer, Y. Zheng, and H. H. Chang (2015). A county-level analysis of persons living with HIV in the southern United States. Submitted to AIDS Care, ID: AC-2015-01-0069. 15 pages.
-  Massaro, T. J., and H. Bozdogan (2015). On the selection of the number of mixtures in Poisson data using information complexity. Unpublished.
-  Papastamoulis, P., M.-L. Martin-Magniette, and C. Maugis-Rabusseau (2014). On the estimation of mixtures of Poisson regression models with large number of components. Computational Statistics and Data Analysis, in press: 10 pages.
-  Park, B.-J., and D. Lord (2009). Application of finite mixture models for vehicle crash data analysis. Accident Analysis and Prevention, 41: 683 – 691.
Paterlini, S., and T. Minerva (2000).
Application of Genetic Algorithm and Neural Network in Forecasting with Good Date.
Proceedings of the 6th WSEAS International Conference on RECENT Advances in Neural Networks, Fuzzy Systems & Evolutionary Computing: 19 – 27.
-  Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6(2): 461 – 464.
-  Titterington, D. M., A. F. M. Smith, and U. E. Makov (1985). Statistical Analysis of Finite Mixture Distributions. John Wiley & Sons Ltd., Chichester, GBR.
Vinterbo, S. and L. Ohno-Machado (1999).
A genetic algorithm to select variables in logistic regression: Example in the domain of myocardial infarction.Journal of the American Medical INformatics Association 6: 984 – 988.
-  Zou, Y., Y. Zhang, and D. Lord (2013). Application of finite mixture of negative binomial regression models with varying weight parameters for vehicle crash data analysis. Accident Analysis and Prevention 50: 1042 – 1051.