Missingness in categorical data is a common problem in many real applications. For examples, in social surveys, data collected by questionnaires are often incomplete because subjects may be unwilling or unable to respond to some items huisman1999item . In biological experiments, data may be incomplete due to high system complexity and low sample quality troyanskaya2001missing . In recommendation system problems marlin2011recommender , analysts often have a dataset as the toy example in Table 1. The goal is to predict the potential purchase behavior without direct observation, e.g., whether Customer3 and Customer4 would buy Book5.
When dealing with datasets with missingness, naive approaches, such as the complete-case analysis (CCA) and overall mean imputation, would waste the information in the missing data and may bias the inference donders2006review . Advanced methods, such as multiple imputation rubin1987multiple
, impose a parametric model on the data and then draw multiple sets of samples to account for the uncertainty of the missing information. For categorical cases,schafer1997analysis
advocated the log-linear model for multiple imputation, which can capture certain types of association among the categorical variables. However, this model works only when the number of variables is small, as the full multi-way cross-tabulation required for the log-linear analysis increases exponentially with the number of variablesvermunt2008multiple . An improved method is known as Multiple Imputation by Chained Equation (MICE), which specifies a sequence of regression models iteratively van1999flexible . buuren2011mice provided the R package mice
to implement this method efficiently. Although it has been shown to work well for many datasets and become a popular method, MICE still has some drawbacks. A typical application of MICE is to use multinomial logistic regression for the categorical data, but the relationship among variables may be nonlinear and may involve complex interaction or higher-order effectsmurray2016multiple . Besides, there is no guarantee that the iterations of sequential regression model will converge to the true posterior distribution of missing values vermunt2008multiple .
To capture general relationship among categorical variables, vermunt2008multiple proposed a latent class model, i.e., the finite mixture of product-multinomial model. The latent class model can characterize both simple association and complex higher-order interactions among the categorical variables as long as the number of latent class is large enough mclachlan2000finite . dunson2009nonparametric proposed the Dirichlet Process Mixture of Products of Multinomial distributions (DPMPM) to model complete multivariate categorical datasets, which avoids the arbitrary setting of the number of mixture components. Furthermore, dunson2009nonparametric proved that any multivariate categorical data distribution can be approximated by DPMPM for a sufficiently large number of mixture components. si2013nonparametric generalized the DPMPM framework to analyze incomplete categorical datasets, which works well for low missingness but performs poorly for high missingness since the number of parameters would increase dramatically. Other related works include the divisive latent class model van2015divisive , the hierarchically coupled mixture model with local dependence murray2016multiple and so on. vidotto2014multiple presented a detailed overview of recent researches on multiple imputation using the latent class model.
In this paper, we propose a Bayesian nonparametric approach, DPMCPM, which extends DPMPM for modelling incomplete multivariate categorical data efficiently. DPMCPM inherits some nice properties of DPMPM. It avoids the arbitrary setting of the number of mixture components by using the Dirichlet process. In addition, DPMCPM is applicable for any categorical data regardless of the true distribution and can capture the complex association among variables. Different from the slice Gibbs sampler implementation of DPMPM in dunson2009nonparametric , DPMCPM formulates the Dirichlet process using the Chinese restaurant process, which sidesteps the drawback of the slice Gibbs sampler walker2007sampling . Under the framework of the latent class analysis, we show that DPMCPM can model general missing mechanisms by creating an extra category to denote the missingness. Through simulation studies and a real application, we demonstrated DPMCPM performs better statistical inference and imputation than existing approaches. As to our knowledge, this is the first non-parametric tool which can model arbitrary categorical distributions and handle high missingness.
2.1 Dirichlet Process Mixture of Product-Multinomials
We begin with the finite mixture product-multinomial model for the case of complete dataset , where and . Suppose comprises of independent samples associated with categorical variables, and the -th variable has categories. Let denote the observed category for the -th sample in the -th variable. The finite mixture product-multinomial model assumes that those are generated from a multinomial distribution indexed by a latent variable . A finite mixture model with latent components can be expressed as:
where and . We further define and .
dunson2009nonparametric proved that any multivariate categorical data distribution can be represented by the mixture distribution in Equation 1 and 2 for a sufficiently large . However, specifying a good to avoid over-fitting and over-simplification is non-trivial, and it becomes even harder when the dataset is highly incomplete dunson2009nonparametric ; si2013nonparametric . This motivates the use of an infinite extension of the finite mixture model, i.e., the Dirichlet process mixture. A Dirichlet process can be represented by various schemes, including the Pólya urn scheme, the Chinese restaurant process and the stick-breaking construction teh2006hierarchical . dunson2009nonparametric chose the stick-breaking construction to model the Dirichlet process. However, in practice, the slice Gibbs sampler in their construction may often be trapped in a single component when is relatively large due to numeric limits and thus fail to identify the correct number of components (see the supplementary materials). To avoid this drawback, we construct the Dirichlet process using the Chinese restaurant process in DPMCPM:
where denotes the number of previously occupied components and denotes the number of samples in the -th component excluding the -th sample. Here is a prior hyper-parameter that acts as the pseudo-count of the number of samples in the new component regardless of the input data.
2.2 Multinomials with Incomplete Data
rubin1976inference introduced a class of missing mechanisms, which are commonly referred as Missing Completely At Random (MCAR), Missing At Random (MAR) and Missing Not At Random (MNAR). MCAR is the most restrictive, which assumes that missingness does not depend on the missing or observed data. MAR is less restrictive than MCAR, which assumes that missingness only depends on the observed data. MNAR is the least restrictive, which assumes that missingness depends on both the missing and observed data.
where the extra -th category denotes the case when is missing. This idea comes from formann2007mixture for the log-linear model but we provide a strict proof for the first time. In fact, because any relationship among variables can be approximated by the mixture product-multinomial model for a sufficiently large , DPMCPM can also accommodate any dependencies among the -th categories and other parameters. We have the following theorem:
Theorem: DPMCPM can capture the MNAR, MAR and MCAR missing mechanisms.
A construction proof for this theorem is provided in the supplementary materials.
Once we get the parameter estimates using Equation 4
, we shall rescale the multinomial probabilities as follows to recover the corresponding parameters in Equation1:
This procedure implicitly integrates out the missing values according their estimated conditional distributions. As compared to si2013nonparametric , this “collapse” step dramatically shrinks the dimension of the posterior sampling space, thus enables DPMCPM to handle high missingness.
Based on the estimation of and , we can obtain an approximate of the true distribution up to the Monte Carlo error. With this estimated distribution, any statistical inference and imputation can be performed. For example, if is missing and a single imputation is desired, we can impute by
where denotes all of the observed data in the -th sample.
2.3 Posterior Inference
We use the Bayesian approach to fit the non-parametric model in Equation2-4
to the data. More specifically, we introduce prior distributions for unknown parameters, and then use a Markov chain Monte Carlo algorithm to sample from their posterior distribution. The converged samples will be used for statistical inference.
For prior distributions, we assume the conjugate prior for, i.e.,
for and . If the sample size is small, we suggest using a flat and weak prior by setting for all . For the Dirichlet prior in Equation 3, teh2006hierarchical suggested that the hyper-parameter should be a small number, thus we set by default.
Since ’s are independent conditional on the latent component ’s and ’s as in Equation 4, the joint likelihood of given and can be written as
Since and are independent, the augmented joint posterior distribution is .
To sample from the joint posterior distribution, we use the following Gibbs sampler algorithm:
In Step 2, for , the conditional posterior distribution is derived as follows:
For , we have:
In Step 4, the conditional posterior distribution comes from:
We implemented the above Gibbs sampler algorithm in the R package MMDai.
dunson2009nonparametric proved that any multivariate categorical distribution can be decomposed to a finite mixture product-multinomial model for some . It should be noted that this decomposition is not identifiable if no restrictions are placed, where “identifiable” means the decomposition is unique up to label-switching. In fact, elmore2003identifiability proved that the -component finite mixture of univariate multinomial distribution is identifiable if and only if the number of trials in multinomial distributions satisfies the condition that . For the multivariate multinomial distribution, we can also prove that it is identifiable if and only if for all . In most real applications, we only have and thus the decomposition is always unidentifiable.
Fortunately, identifiability is not a problem in our analysis. Firstly, we are only interested in the estimation of the joint distribution, which is unique though the decomposition is not. The main issue here is whether the joint distribution can be approximated well enough.vermunt2008multiple discussed the above difference between estimating joint distribution and clustering when the latent class model is adopted. Secondly, with the Dirichlet prior in our Bayesian approach, different decompositions are weighted unequally and the simpler model is preferred. For the multiple non-identifiable decompositions, our algorithm usually converges to the decomposition with the smallest .
In this section, we check the performance of our method on imputation and correlation estimation, and compare it with other methods based on synthetic datasets. In each simulation experiment, we first synthesize a set of complete data following a given data model, then mask a certain percentage of the observations according to the corresponding missing mechanism to generate the incomplete dataset as our input data, finally use the masked entries to test the performance of statistical inference and imputation. We consider the following missing mechanisms in simulation studies:
MCAR: The missingness does not depend on observed and missing values,
for all and .
MAR: The first variable is fully observed and the missingness in other variables depends on the observation in the first variable:
for and .
MNAR: The missingness depends on the missing value itself:
for all and .
Considering the uncertainty of Monte Carlo simulations, we repeat the experiments, including the data synthesis step, for 100 times independently for each data model. To compare the performance with existing methods, we also apply the DPMPM model in si2013nonparametric and MICE on these datasets. See the supplementary materials for the detailed R code.
3.1 Case 1: Data from Mixture Model
In this case, we generate binary data ( for all ) from a finite mixture of product-multinomial distributions with and . The true parameters in the mixture model are set as follows: the number of components , and are sampled from and for and , respectively.
First we show that DPMCPM can identify the correct number of components in the mixture model. Figure 1 presents the histogram of the estimated in 100 replications under different missing mechanisms. In most cases, DPMCPM can capture the true number of components . In other cases, DPMCPM may select the simpler model to interpret datasets.
Then we show that DPMCPM can outperform DPMPM and MICE in the imputation of missing values, where the imputation accuracy is defined as follows:
presents the mean and standard error of imputation accuracy of the three methods for 100 replications under different missing mechanisms. It shows that DPMCPM significantly outperformed other methods.
Besides the imputation accuracy, we can also compare the performance on other statistical inference problems. From DPMCPM, we obtain an estimation of the multivariate multinomial distribution with the estimates of and . Although the distribution estimation can be a little different from the true distribution due to the existence of missingness and the limited sample size, DPMCPM can perform more or better statistical inference than DPMPM and MICE based on the estimated distribution. Table 3 presents the gap between the estimated correlation matrix and the true correlation matrix. The gap is defined as follows:
where and are the estimated and true correlation between the -th and -th variables, respectively. The mean and standard error of the gap for 100 replications are summarized in Table 3. Because MICE can not provide an estimation of the distribution directly, we use instead the sample distribution from its imputation. The table shows that DPMCPM have better correlation estimation than other methods under all missing mechanisms.
3.2 Case 2: Nonlinear Association
In addition to the data generated from mixture models, we also perform simulation studies on more general cases. In practice, it is cumbersome to design and simulate general categorical data distribution with large
due to the exponentially increasing size of the contingency table. Here we design a simple nonlinear experiment to demonstrate the power of DPMCPM in this aspect.
As an example, we generate three Bernoulli random variables,, and , as follows. Assume that and are independent, where denotes that
follows a Bernoulli distribution with probability. Let be the output of the exclusive-or operator on and with probability 95% and be an independent Bernoulli random error, i.e., , with probability 5%. Essentially, this is a data model with strong nonlinear association. We set , and the same missing mechanisms as in Section 3.1. The mean and standard error of the imputation accuracy in this study are summarized in Table 4. The results showed that MICE did not capture the nonlinear association among variables while DPMCPM did it well. DPMCPM also performs better than DPMPM here.
4 Application on Recommendation System Problem
In this section, we present a real application on a recommendation system problem. harper2016movielens contributed a dataset about the ratings of movies. The entire ratings table in harper2016movielens contains 20000263 ratings on 27278 movies from 138493 users. For the purpose of cross-validation, we extract a subset of data with low missingness such that we mask a high percent of them for evaluating the imputation accuracy. First, we select the movies which were rated by more than 25% users. Then we remain the users who rated more than 95% of the selected movies. The resulting dataset contains 68861 ratings on 38 movies from 1837 users, where the percentage of missingness is 1.35%. The data matrix MovieRate in the MMDai package is the resulted dataset.
Since the original ratings in harper2016movielens are ordinal data made on a 5-star scale with half-star increments (0.5 stars - 5.0 stars), we transform the ordinal data into two categories according to a cutoff: dislike (4.0 stars) and like (4.0 stars). To evaluate the performance of DPMCPM, we mask 40% of the ratings in MovieRate under MCAR, resulting in the incomplete “observed data”, where the percentage of missing values is 40.81%. Predicting the unobserved favor of a user on a particular movie is equivalent to impute a missing value in incomplete data.
The analyses in Section 3 are repeated on this real dataset. The data MovieRate is masked and imputed for 100 times independently. The mean and standard error of the imputation accuracy are summarized in Table 5. To check the effect of the cutoff used to dichotomize the original ordinal data, we repeated the experiments by varying the cutoff from 4.0 to 3.5 and 3.0. Table 5 shows that DPMCPM has significantly higher imputation accuracy than MICE on MovieRate. The performance improvement is not sensitive to the choice of the cutoff. The improved accuracy can be very helpful for reducing the cost of bad recommendations in real life.
Except for the imputation of missing values, we also explore the clustering structure among movies according to the fitting of DPMCPM. Figure 2 shows the heatmap plots of an input data from one of the 100 experiments (both unordered and ordered versions) and the true data MovieRate, where the rows correspond to users and the columns correspond to movies. The order is based on the latent class inferred by DPMCPM. The figure demonstrated that DPMCPM could detect the cluster structure efficiently despite the high missingness in the input data.
To demonstrate the usage of the estimated distribution from DPMCPM, we performed the Fisher exact test to test the independence of each pair of movies according to the estimated joint distribution. The five pairs with the least and largest-values are reported in Table 6. The association results agree with common sense based on the nature of these movies.
|Movie 1||Movie 2||-value|
|Star Wars: Episode IV (1977)||Star Wars: Episode V (1980)||0.00004|
|Star Wars: Episode V (1980)||Star Wars: Episode VI (1983)||0.00023|
|Star Wars: Episode IV (1977)||Star Wars: Episode VI (1983)||0.00069|
|The Fugitive (1993)||Jurassic Park (1993)||0.00086|
|Jurassic Park (1993)||Men in Black (1997)||0.00106|
|Twelve Monkeys (1995)||Ace Ventura: Pet Detective (1994)||1.00000|
|Apollo 13 (1995)||Fight Club (1999)||1.00000|
|Pulp Fiction (1994)||Mission: Impossible (1996)||1.00000|
|Pulp Fiction (1994)||Independence Day (1996)||1.00000|
|Speed (1994)||Fight Club (1999)||1.00000|
In this paper, we introduced the DPMCPM method, which implemented multivariate multinomial distribution estimation for categorical data with missing values. DPMCPM inherits some nice properties of DPMPM. It avoids the arbitrary setting of the number of mixture components by using the Dirichlet process. Also, DPMCPM is applicable for any categorical data regardless of the true distribution and can capture the complex association among variables. Unlike DPMPM, DPMCPM can model general missing mechanisms and handle high missingness efficiently, thus DPMCPM can achieve more accurate results in empirical studies. Through simulation studies and a real application, we demonstrated that DPMCPM performed better statistical inference and imputation than other methods.
It shall be noted that approximating a general distribution of high-dimensional categorical data is still a hard task, although DPMCPM could approximate the true general distribution at arbitrary precision theoretically if is big enough. To estimate the underlying general distribution accurately, the required sample size still increases exponentially with the number of variables . Fortunately, in many applications, there is underlying cluster structures in the dataset. In these cases, the underlying distribution is close to a mixture of product-multinomials, thus the sample size needed for DPMCPM to estimate the underlying distribution accurately is much smaller. For example, in recommendation system problems, it is reasonable to assume that the customers with similar historical activities have similar tastes. For handling such problems, DPMCPM would have more advantages than traditional approaches. Future work shall try to improve the efficiency for the case that there is no underlying cluster structures in the high-dimensional dataset.
This research is partially supported by two grants from the Research Grants Council of the Hong Kong SAR, China (CUHK400913, CUHK14203915) and two direct grants from The Chinese University of Hong Kong (4053135, 3132753).
The supplemental materials include a discussion on the drawback of the stick-breaking construction of Dirichlet process in dunson2009nonparametric , a construction proof for the statement latent class model can capture general missing mechanisms in Section 2.2 and all other independence test results in Table 6. The related R codes are also provided with instructions.
- (1) M. Huisman, Item Nonresponse: Occurrence, Causes, and Imputation of Missing Answers to Test Items, DSWO Press, Leiden, 1999.
- (2) O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein, R. B. Altman, Missing value estimation methods for dna microarrays, Bioinformatics 17 (6) (2001) 520–525. doi:10.1093/bioinformatics/17.6.520.
B. M. Marlin, R. S. Zemel, S. T. Roweis, M. Slaney, Recommender systems, missing data and statistical model estimation, in: IJCAI Proceedings-International Joint Conference on Artificial Intelligence, Barcelona, Spain, Vol. 22, 2011, p. 2686.
- (4) A. R. T. Donders, G. J. van der Heijden, T. Stijnen, K. G. Moons, Review: A gentle introduction to imputation of missing values, Journal of Clinical Epidemiology 59 (10) (2006) 1087–1091. doi:10.1016/j.jclinepi.2006.01.014.
- (5) D. B. Rubin, Multiple Imputation for Nonresponse in Surveys, John Wiley & Sons, New York, 1987.
- (6) J. L. Schafer, Analysis of Incomplete Multivariate Data, CRC Press, FL, 1997.
- (7) J. K. Vermunt, J. R. Van Ginkel, V. Der Ark, L. Andries, K. Sijtsma, Multiple imputation of incomplete categorical data using latent class analysis, Sociological Methodology 38 (1) (2008) 369–397. doi:10.1111/j.1467-9531.2008.00202.x.
- (8) S. Van Buuren, K. Oudshoorn, Flexible Mutlivariate Imputation by MICE, TNO, Leiden, 1999.
- (9) S. Van Buuren, K. Groothuis-Oudshoorn, mice: Multivariate imputation by chained equations in r, Journal of Statistical Software 45 (3).
- (10) J. S. Murray, J. P. Reiter, Multiple imputation of missing categorical and continuous values via bayesian mixture models with local dependence, Journal of the American Statistical Association 111 (516) (2016) 1466–1479. doi:10.1080/01621459.2016.1174132.
- (11) G. McLachlan, D. Peel, Finite Mixture Models, John Wiley & Sons, New York, 2000.
- (12) D. B. Dunson, C. Xing, Nonparametric bayes modeling of multivariate categorical data, Journal of the American Statistical Association 104 (487) (2012) 1042–1051. doi:10.1198/jasa.2009.tm08439.
- (13) Y. Si, J. P. Reiter, Nonparametric bayesian multiple imputation for incomplete categorical variables in large-scale assessment surveys, Journal of Educational and Behavioral Statistics 38 (5) (2013) 499–521. doi:10.3102/1076998613480394.
- (14) D. W. Van der Palm, L. A. Van der Ark, J. K. Vermunt, Divisive latent class modeling as a density estimation method for categorical data, Journal of Classification (2015) 1–21doi:10.1007/s00357-016-9195-5.
- (15) D. Vidotto, J. K. Vermunt, M. C. Kaptein, Multiple imputation of missing categorical data using latent class models: State of art, Psychological Test and Assessment Modeling.
- (16) S. G. Walker, Sampling the dirichlet mixture model with slices, Communications in Statistics-Simulation and Computation 36 (1) (2007) 45–54. doi:10.1080/03610910601096262.
- (17) Y. W. Teh, M. I. Jordan, M. J. Beal, D. M. Blei, Hierarchical dirichlet processes, Journal of the American Statistical Association 101 (476) (2006) 1566–1581. doi:10.1198/016214506000000302.
- (18) D. B. Rubin, Inference and missing data, Biometrika 63 (3) (1976) 581–592. doi:10.1093/biomet/63.3.581.
- (19) A. K. Formann, Mixture analysis of multivariate categorical data with covariates and missing entries, Computational Statistics & Data Analysis 51 (11) (2007) 5236–5246. doi:10.1016/j.csda.2006.08.020.
- (20) R. T. Elmore, S. Wang, Identifiability and estimation in finite mixture models with multinomial coefficients, Tech. rep., 03-04, Penn State University (2003).
- (21) F. M. Harper, J. A. Konstan, The movielens datasets: History and context, ACM Transactions on Interactive Intelligent Systems (TiiS) 5 (4) (2016) 19.
Drawback of the stick-breaking construction
Here we explain why the following stick-breaking construction of Dirichlet process in dunson2009nonparametric may fail in practice:
They introduced a vector of latent variables,, and defined the joint likelihood of u and X given V & as
where , for .
Their slice Gibbs sampler iterates through the following sampling steps as listed in dunson2009nonparametric :
Update for , by sampling from the conditional posterior Uniform(0,).
For with , update from the full conditional posterior distribution of the -th response for subjects in the -th component:
For , sample from the full conditional posterior, which is truncated to fall into the interval
To update for , sample from the multinomial full conditional with
where . To identify the elements in , first update , for , where is the smallest value satisfying .
Unfortunately, this slice Gibbs sampler tends to be trapped in the case of in practice. To illustrate this, assume at some . Then for all and . In Step 1, we sample from Uniform(0,1). is a random variable and we have the probability
When is large, at an exponential rate. Then in Step 3, the interval converges to the point . Since , the probability of new component is , we have
When is large, at an exponential rate. For the sampling in Step 3, because of numeric accuracy limitation in practice, it is easy to obtain and then . Thus, the whole Gibbs sampler chain is trapped in the case of and can not identify the correct number of components.
In our method, we use the Chinese restaurant process to construct the Dirichlet process. The probability of proposing a new component is , which is a fixed constant and it avoids the above sampling problem. Thus it produces a Markov chain with better mixing across the spaces of different numbers of components.
Proof of the theorem
In this section, we show that the general missing mechanisms can be captured by the latent class model with an extra category denoting the missingness. This idea comes from formann2007mixture for the log-linear model but they did not provide a strict proof. Here we present a construction proof for this statement.
Assume that there are variables and the -th variable has categories. Let , then .
For all missing mechanisms belonging to the MCAR, MAR or MNAR family, the missing rate can be written as , which is a function of . For each variable, let
For any joint distribution with the missing mechanism Q where is a table with cells, we can estimate a new joint probability table by adding an extra category to each variable to denote the corresponding missingness:
We need to prove that there exists a set of parameters that can recover the original joint distribution table and the missing mechanism Q after rescaling the extra categories in the latent classes. In other words, given and Q, we shall prove that there exists a set of parameters under our parameterization.
Note that for any general joint distribution of categorical variables, we always have the following decomposition:
Define a series of latent classes where is a notation of latent class. Here we design a set of parameters :
where is a -dimension vector. For the vector , we have and with other elements are 0.
Then we re-scale the extra category where . Thus, we have where is a binary vector of length with only the -th element being 1.
Under this setting, the joint distribution is
In this decomposition, the posterior probability of latent class
For the missing mechanism, on the level of latent class, we have
Thus, integrating the level of latent class, we have
Thus, any joint distribution with general missing mechanism can be captured by our latent class model. Proven.
This proof is a constructive proof. We show that there exists a construction of the latent class model that can capture any general missing mechanism under any joint distributions. In specific cases, the solution may be not unique since there may be multiple decompositions of the same joint distribution, and may be not optimal in terms of the Bayes factor. Actually, this proof shows that the estimate can always attain the true joint distribution with the right missing mechanism.