Variable subset selection via GA and information complexity in mixtures of Poisson and negative binomial regression models

05/20/2015
by   T. J. Massaro, et al.
0

Count data, for example the number of observed cases of a disease in a city, often arise in the fields of healthcare analytics and epidemiology. In this paper, we consider performing regression on multivariate data in which our outcome is a count. Specifically, we derive log-likelihood functions for finite mixtures of regression models involving counts that come from a Poisson distribution, as well as a negative binomial distribution when the counts are significantly overdispersed. Within our proposed modeling framework, we carry out optimal component selection using the information criteria scores AIC, BIC, CAIC, and ICOMP. We demonstrate applications of our approach on simulated data, as well as on a real data set of HIV cases in Tennessee counties from the year 2010. Finally, using a genetic algorithm within our framework, we perform variable subset selection to determine the covariates that are most responsible for categorizing Tennessee counties. This leads to some interesting insights into the traits of counties that have high HIV counts.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/27/2020

Transition Models for Count Data: a Flexible Alternative to Fixed Distribution Models

A flexible semiparametric class of models is introduced that offers an a...
research
04/15/2020

A parsimonious family of multivariate Poisson-lognormal distributions for clustering multivariate count data

Multivariate count data are commonly encountered through high-throughput...
research
01/10/2020

Review of Probability Distributions for Modeling Count Data

Count data take on non-negative integer values and are challenging to pr...
research
07/19/2021

Dim but not entirely dark: Extracting the Galactic Center Excess' source-count distribution with neural nets

The two leading hypotheses for the Galactic Center Excess (GCE) in the F...
research
07/29/2020

Regression-based imputation of explanatory discrete missing data

Imputation of missing values is a strategy for handling non-responses in...
research
08/27/2021

Machine learning on DNA-encoded library count data using an uncertainty-aware probabilistic loss function

DNA-encoded library (DEL) screening and quantitative structure-activity ...
research
08/06/2019

Statistical modeling of groundwater quality assessment in Iran using a flexible Poisson likelihood

Assessing water quality and recognizing its associated risks to human he...

Please sign up or login with your details

Forgot password? Click here to reset