Multiple Imputation and Synthetic Data Generation with the R package NPBayesImputeCat

07/12/2020
by   Jingchen Hu, et al.
Duke University
Vassar College
0

In many contexts, missing data and disclosure control are ubiquitous and difficult issues. In particular at statistical agencies, the respondent-level data they collect from surveys and censuses can suffer from high rates of missingness. Furthermore, agencies are obliged to protect respondents' privacy when publishing the collected data for public use. The NPBayesImputeCat R package, introduced in this paper, provides routines to i) create multiple imputations for missing data, and ii) create synthetic data for disclosure control, for multivariate categorical data, with or without structural zeros. We describe the Dirichlet Process mixtures of products of multinomials (DPMPM) models used in the package, and illustrate various uses of the package using data samples from the American Community Survey (ACS).

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

03/09/2022

gcimpute: A Package for Missing Data Imputation

This article introduces the Python package gcimpute for missing data imp...
08/01/2016

R package imputeTestbench to compare imputations methods for univariate time series

This paper describes the R package imputeTestbench that provides a testb...
03/17/2021

Bayesian Estimation of Attribute Disclosure Risks in Synthetic Data with the R Package

Synthetic data is a promising approach to privacy protection in many con...
04/06/2021

Statistical Network Analysis with Bergm

Recent advances in computational methods for intractable models have mad...
10/20/2020

A Comparative Study of Imputation Methods for Multivariate Ordinal Data

Missing data remains a very common problem in large datasets, including ...
04/14/2018

Simultaneous Edit and Imputation for Household Data with Structural Zeros

Multivariate categorical data nested within households often include rep...
12/12/2017

Guidelines for Producing Useful Synthetic Data

We report on our experiences of helping staff of the Scottish Longitudin...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

0.1 Introduction and background

0.1.1 Multiple imputation for missing data

Missing data problems arise in many statistical analyses. To impute missing values, multiple imputation, first proposed by Rubin1987

, has been widely adopted. This approach estimates predictive models based on the observed data, fills in missing values with draws from the predictive models, and produces multiple imputed, completed datasets. Data analysts then apply standard statistical analyses (e.g. a regression analysis) on each imputed dataset, and use appropriate combining rules to obtain valid point estimates and variance estimates

(Rubin1987).

As a brief review of the multiple imputation combining rules for missing data, let be the completed data estimator of some estimand of interest , and let be the estimator of the variance of . For , let and be the values of and in the th completed dataset. The multiple imputation estimate of is equal to , and the estimated variance associated with is equal to , where and . Inferences for are based on , where is a -distribution with degrees of freedom.

For more details and reviews of multiple imputation, see Rubin1996; HarelZhou2007; ReiterRaghu2007JASA.

0.1.2 Synthetic data for disclosure control

The multiple imputation methodology has been generalized to disclosure control. After collecting respondent-level microdata from surveys and censuses, statistical agencies are obliged to protect survey respondents’ privacy when making these respondent-level microdata available to the public. One approach to facilitating microdata release is to provide synthetic data. First proposed by Little1993; Rubin1993, the synthetic data approach estimates predictive models based on the original, confidential data, simulates synthetic values with draws from the predictive models, and produces multiple synthetic datasets. Data analysts then apply standard statistical analyses (e.g. a regression analysis) on each synthetic dataset, and use appropriate combining rules (different from those in multiple imputation) to obtain valid point estimates and variance estimates (ReiterRaghu2007JASA; Drechsler2011book).

When dealing with fully synthetic data, estimates as in the multiple imputation setting, but the estimated variance associated with becomes , where and are defined as in previous section on multiple imputation. Inferences for are now based on , where the degrees of freedom is .

For partially synthetic data however, still estimates but the estimated variance associated with is , where and are defined as in the multiple imputation setting. Inferences for are based on , where the degrees of freedom is .

0.1.3 Structural zeros

An important feature of survey data is the existence of impossible combinations of variables, also known as structural zeros. For example, in the combinations of variables of pregnancy status and gender, there should not exist a pregnant male. For household survey, in the combinations of variables of relationship and age, there should not exist a household where a son is older than his biological father. To deal with structural zeros, statistical models should be designed to assign zero probability for impossible combinations, which is a challenging task.

0.1.4 What NPBayesImputeCat does

The NPBayesImputeCat package specializes in estimating and performing multiple imputation and synthetic data generation for multivariate categorical data. It specifies a joint latent class model on multivariate categorical variables, and uses Dirichlet Process priors to allow effective clustering of the observations. For multiple imputation, the package allows data without structural zeros and data with structural zeros. For synthetic data, currently the package only allows data without structural zeros.

0.1.5 The structure of this paper

The rest of the paper is organized as follows. We first introduce the joint latent class models for multivariate categorical data that the NPBayesImputeCat package applies, that is, the Dirichlet Process mixtures of products of multinomials (DPMPM) models. This section presents the details of the two versions of the DPMPM models: i) without structural zeros, and ii) with structural zeros. In addition, we review applications of multiple imputation and synthetic data generation using DPMPM in the literature. Next, we introduce the sample datasets to be used in the illustrations. Furthermore, we provide illustrations for multiple imputation using the package, and that for synthetic data are. The paper concludes with summary and discussion.

0.2 DPMPM models for multivariate categorical data

Dirichlet Process mixtures of products of multinomials (DPMPM) models are Bayesian versions of latent class models developed for multivariate categorical data. Proposed by DunsonXing2009JASA

, the DPMPM models are non-parametric joint models for multivariate categorical variables. To allow effective clustering of the observations based on all categorical variables, Dirichlet Process (DP) priors are specified for the mixture probabilities and multinomial probability vectors of the categorical data. DPMPM models have been demonstrated to capture the complex dependencies in multivariate categorical data, while being computationally efficient. The models also empower the data to select the number of latent classes to be used in the model estimation.

The DPMPM models have also been extended to account for structural zeros in categorical data (ManriqueReiter2014JCGS). Structural zeros refer to impossible combinations of categorical variables that cannot exist in a data sample. For example, if a dataset contains information of a record’s gender and pregnancy status in the forms of categorical variable, there can be no record having the combination of male and pregnant. As another example, if a dataset contains information of a record’s age and educational attainment in the forms of categorical variables, there can be no record having the combination of being younger than 5 and having a doctorate degree.

The NPBayesImputeCat package includes two versions of the DPMPM models: i) DPMPM without structural zeros, and ii) DPMPM with structural zeros. In this section, we introduce the details of the two versions of the DPMPM models and review previous work of DPMPM for multiple imputation and synthetic data.

0.2.1 DPMPM models without structural zeros

The description of the DPMPM models without structural zeros follows closely the description in HuHoshino2018PSD. Consider a sample consisting of records, where each th record, with , has unordered categorical variables. The basic assumption of the DPMPM is that every record belongs to one of underlying unobserved/latent classes. Given the latent class assignment of record , as in Equation (2), each variable independently follows a multinomial distribution, as in Equation (1), where is the number of categories of variable , and .

(1)
(2)

The marginal probability of can be expressed as averaging over the latent classes:

(3)

As pointed out in SiReiter2013JEBS; HuReiterWang2014PSD; AkandeLiReiter2017TAS, such averaging over latent classes results in dependence among the variables. Equation (3) will also help illustrate the DPMPM models with structural zeros in next section.

The DPMPM effectively clusters records with similar characteristics based on all variables. Relationships among these categorical variables are induced by integrating out the latent class assignment . To empower the DPMPM to pick the effective number of occupied latent classes, the truncated stick-breaking representation (Sethuraman1994SS) of the DP prior is used as in Equation (4) through Equation (7),

(4)
(5)
(6)
(7)

and a blocked Gibbs sampler is implemented for the Markov chain Monte Carlo (MCMC) sampling procedure

(IshwaranJames2001JASA; SiReiter2013JEBS; HuReiterWang2014PSD; AkandeLiReiter2017TAS; ManriqueHu2018JRSSA; DrechslerHu2018; HuSavitsky2018).

When used as an imputation engine, missing values are handled within the Gibbs sampler. As described in AkandeLiReiter2017TAS, at one MCMC iteration , one samples a value of the latent class indicator using Equation (2), given a draw of the parameters and observed data. In the same iteration , given the sampled , one samples missing values using independent draws from Equation (1). This process is repeated for every missing value in the dataset in iteration , obtaining one imputed dataset.

When used as a data synthesizer, the fully observed confidential dataset is used for model estimation through MCMC, and sensitive variable values are synthesized as an extra step at chosen MCMC iteration. For example, at MCMC iteration , one samples a value of the latent class indicator using Equation (2). Given the sampled , one samples synthetic values of sensitive variables using independent draws from Equation (1). This process is repeated for every record who has sensitive values to be synthesized, obtaining one synthetic dataset.

0.2.2 DPMPM models with structural zeros

When there are structural zeros in the dataset, we need to modify the likelihood in Equation (1) to enforce zero probability for impossible combinations. That is, we need to truncate the support of the DPMPM. Following the general description in ManriqueReiter2014JCGS; ManriqueHu2018JRSSA, let represent all combinations of individuals, including impossible combinations; let be the set of impossible combinations to be excluded. We restrict to the set , with . The marginal probability in the DPMPM models without structural zeros in Equation (3) then becomes

(8)

Let be the sample that only contains possible combinations, we have the joint likelihood as

(9)

To get the Gibbs sampler to work, we follow the general data augmentation technique proposed by ManriqueReiter2014JCGS, and assume the existence of an observed sample of unknown size generated from the DPMPM model without structural zeros (i.e. unrestricted DPMPM). only contains records that fall into .

The same set of DP priors in Equation (4) through Equation (7) are used in the DPMPM models with structural zeros. In the Gibbs sampler, we need to keep the generated and combine it with when estimating the model parameters. When used as either an imputation engine or a data synthesizer, missing values or synthetic data are generated from the truncated likelihood Equation (9).

0.2.3 Applications of DPMPM for multiple imputation

The DPMPM models have been adapted as multiple imputation engines to deal with missing values in categorical data. Some imputation applications have focused on DPMPM models without structural zeros, while others have dealt with DPMPM models with structural zeros.

Among the work on multiple imputation using DPMPM models without structural zeros, SiReiter2013JEBS applied the DPMPM imputation model to impute missing background data (categorical) in the 2007 Trends in International Mathematics and Science Study (TIMSS). The 2007 TIMSS data contains 80 background variables on 90,505 students. Among the 80 categorical background variables, 68 have less than 10% missing values, 6 variables have between 10% and 30% missing values, and 1 variable has more than 75% missing values.

AkandeLiReiter2017TAS designed simulation studies using data from the American Community Survey (ACS), and compared the DPMPM imputation models to other two widely used multiple imputation with chained equations (MICE)-based imputation engines: i) chained equations using generalized linear models, and ii) chained equations using classification and regression trees (CART). From a population of 671,153 housing units and 35 categorical variables collected and cleaned from the 2012 ACS data, AkandeLiReiter2017TAS performed repeated sampling and empirically compared the three multiple imputation models.

Among the work on multiple imputation using DPMPM models with structural zeros, ManriqueReiter2014SM followed the data augmentation approach ManriqueReiter2014JCGS, and imputed missing data of repeated samples from the 5% public use microdata sample from the 2000 United States Census for the state of New York, a population of 953,076 individuals and 10 categorical variables, with the number of levels ranging from 2 to 11.

Murray2018SS provides an excellent review of practical and theoretical findings of multiple imputation research. The DPMPM imputation models have been highlighted as recent development.

0.2.4 Applications of DPMPM for synthetic data

The DPMPM models have also been used as synthetic data generators to public release of useful and private microlevel categorical data. Some work focused on DPMPM models without structural zeros, while others dealt with synthetic data problems with DPMPM models with structural zeros.

Among the work on synthetic data generation using DPMPM models without structural zeros, HuReiterWang2014PSD used the DPMPM models to generate fully synthetic data for a subset of 10,000 individuals and 14 categorical variables from the 2012 ACS public use microdata sample for the state of North Carolina. DrechslerHu2018 generated partially synthetic data for a large-scale administrative data containing detailed geographic information in Germany. HuSavitsky2018 also used the DPMPM models to generate partially synthetic data for the Consumer Expenditure Surveys (CE) at the U.S. Bureau of Labor Statistics (BLS), disseminating detailed county-level geographic information.

Among the work on synthetic data generation using DPMPM models with structural zeros, ManriqueHu2018JRSSA proposed a data augmentation approach, and generated fully synthetic data of repeated samples from the 5% public use microdata from the 2000 United States Census for the state of California, a population of 1,690,642 records measured in 17 categorical variables, with the number of levels ranging from 2 to 11.

0.3 Two ACS samples for illustrations

Before presenting detailed step-by-step illustrations to use the NPBayesImputeCat package for multiple imputation and data synthesis applications, we introduce two samples from the 2016 1-year American Community Surveys (ACS), which will both be used for our illustrations.

ACS sample 1, ss16pusa_sample_zeros, contains structural zeros. It will be used to illustrate how to perform multiple imputation and data synthesis tasks when structural zeros are present.

ACS sample 2, ss16pusa_sample_nozeros, is a subset of ACS sample 1 and contains no structural zeros. It will be used to illustrate how to perform multiple imputation and data synthesis tasks when structural zeros are not present.

0.3.1 ACS sample 1, with structural zeros

Variable Description of categories Category details
AGEP Age 7 16; 17; [18, 24]; [25, 35]; [36, 50]; [51, 70]; (70, )
MAR Marital status 5 Married; Widowed; Divorced; Separated; Never married.
SCHL Education attainment 9 Up to K0; Some K12, no diploma; High school diploma or GED; Some college, no degree; Associate’s degree; Bachelor’s degree; Master’s degree; Professional degree; Doctorate degree.
SEX Gender 2 Male; Female
WKL When last worked 3 Within the last 12 months; 1-5 years ago; Over 5 years ago or never worked.
Table 1: Variables used in ACS sample 1: ss16pusa_sample_zeros.

We take a random sample of observations on variables. See Table 1 for the list of 5 variables, their description, number of categories and category details. The sample is saved as ss16pusa_sample_zeros, and it contains structural zeros: 8 combinations, all related to AGEP and SCHL variables, listed in Table 2. These 8 cases are derived from the original 2016 1-year ACS data (as the population).

Index Description
1 AGEP = 16 & SCHL = Bachelor’s degree.
2 AGEP = 16 & SCHL = Doctorate degree.
3 AGEP = 16 & SCHL = Master’s degree.
4 AGEP = 16 & SCHL = Professional degree.
5 AGEP = 17 & SCHL = Bachelor’s degree.
6 AGEP = 17 & SCHL = Doctorate degree.
7 AGEP = 17 & SCHL = Master’s degree.
8 AGEP = 17 & SCHL = Professional degree.
Table 2: 8 cases of structural zeros in the ACS sample.

0.3.2 ACS sample 2, without structural zeros

To obtain a sample without structural zeros, we take a subset of ACS sample 1, where and , dropping variables AGEP and SCHL to eliminate any structural zeros. This ACS sample 2 is saved as ss16pusa_sample_nozeros, and Table 3 for the list of 3 variables, their description, number of categories and category details.

Variable Description of categories Category details
MAR Marital status 5 Married; Widowed; Divorced; Separated; Never married.
SEX Gender 2 Male; Female
WKL When last worked 3 Within the last 12 months; 1-5 years ago; Over 5 years ago or never worked.
Table 3: Variables used in ACS sample 2: ss16pusa_sample_nozeros.

0.4 Illustrations for multiple imputation

For illustration, missing values are created according the missing completely at random (MCAR) mechanism at 30% for each variable. The corresponding datasets (containing missing values) to ACS sample 1 and ACS sample 2 are saved as ss16pusa_sample_zeros_miss and ss16pusa_sample_nozeros_miss respectively. The DPMPM imputation engines are designed to perform multiple imputations of missing at random (MAR) categorical data.

In this section, we first illustrate how to impute missing values for ACS sample 2 with 30% missingness using NPBayesImputeCat, where no structural zeros are present. We then present how to impute missing values for ACS sample 1 with 30% missingness using NPBayesImputeCat, where structural zeros are present.

0.4.1 Multiple imputation for data without structural zeros

We aim to impute missing values for ACS sample 2 with 30% missingness, ss16pusa_sample_nozeros_miss, where there are no structural zeros present.

Load the sample

First, we load ACS sample 2 with 30% missingness and make sure that all variables are unordered factors.

data("ss16pusa_sample_nozeros_miss")
X = ss16pusa_sample_nozeros_miss
p = dim(X)[2]
for (j in 1:p){
X[,j] = as.factor(X[,j])
}

Create and initialize the DPMPM imputation engine

Second, we create and initialize the DPMPM imputation engine without structural zeros. The function CreateModel takes 7 arguments as input: i) X, the original sample; ii) MCZ, the data frame containing the structural zeros definition - use NULL when structural zeros are not present; iii) K, the maximum number of mixture components (i.e. the maximum number of latent classes in the DPMPM imputation engine); iv) Nmax, an upper truncation limit for the augmented sample size, that is, the maximum number of observations allowable in the augmented - use 0 when structural zeros are not present; v) aalpha, the hyper parameter in stick-breaking prior distribution in Equation (6); vi) balpha, the hyper parameter in stick-breaking prior distribution in Equation (6); and vii) seed, the seed value.

As a demonstration, we let K be 30, and both aalpha and balpha be 0.25. We choose seed as 1234. The one line script below creates and initializes the DPMPM imputation engine without structural zeros for ACS sample 2 with 30% missingness, stored in X.

model <- CreateModel(X, NULL, 30, 0, 0.25, 0.25, 1234)

Run the DPMPM imputation engine

We then run the model object for a set of user-specified numbers of burn-ins, MCMC iterations, and thinning. For example, to run 5 burn-ins, 10 MCMC iterations and thin every 1 iteration, run the following code.

> model$Run(5, 10, 1)
Initializing...
Run model without structural zeros.
iter = 0  kstar = 30 alpha = 1 Nmis = 0
iter = 0  kstar = 30 alpha = 9.90264 Nmis = 0
iter = 1  kstar = 30 alpha = 9.9866 Nmis = 0
iter = 2  kstar = 30 alpha = 12.9047 Nmis = 0
iter = 3  kstar = 29 alpha = 9.94832 Nmis = 0
iter = 4  kstar = 30 alpha = 9.20156 Nmis = 0
iter = 5  kstar = 29 alpha = 9.08214 Nmis = 0
iter = 6  kstar = 28 alpha = 8.40744 Nmis = 0
iter = 7  kstar = 28 alpha = 9.80164 Nmis = 0
iter = 8  kstar = 30 alpha = 8.0682 Nmis = 0
iter = 9  kstar = 29 alpha = 9.04139 Nmis = 0

We can see that the output prints out the iteration index as iter, the value of occupied mixture components (i.e. the value of occupied latent classes) as kstar, posterior estimates of (the concentration parameter in stick-breaking prior distribution in Equation (6)) as alpha, and the size of the augmented sample as Nmis. In our demonstration, Nmis is always 0, as the size of the augmented sample is 0 when there are no structural zeros.

It is important to keep track of the value of kstar, as the NPBayesImputeCat package uses the truncated stick-breaking representation of the DP prior (Sethuraman1994SS). If the value of kstar is always K, the maximum number of mixture components, we should re-run the DPMPM model by specifying a larger value of K, to allow large enough number of mixture components to cluster the observations. The above initial run seems to suggest that the estimation uses almost all latent classes (kstar is close or is 30, which is what the maximum number of latent classes K set to). It is therefore prudent to increase the value of K when executing the CreateModel command, for example:

> model <- CreateModel(X, NULL, 80, 0, 0.25, 0.25, 1234)
> model$Run(5, 10, 1)
Initializing...
Run model without structural zeros.
iter = 0  kstar = 80 alpha = 1 Nmis = 0
iter = 0  kstar = 75 alpha = 21.7874 Nmis = 0
iter = 1  kstar = 71 alpha = 24.1254 Nmis = 0
iter = 2  kstar = 74 alpha = 27.6968 Nmis = 0
iter = 3  kstar = 74 alpha = 23.9416 Nmis = 0
iter = 4  kstar = 69 alpha = 21.548 Nmis = 0
iter = 5  kstar = 72 alpha = 25.159 Nmis = 0
iter = 6  kstar = 68 alpha = 23.7602 Nmis = 0
iter = 7  kstar = 72 alpha = 23.6195 Nmis = 0
iter = 8  kstar = 71 alpha = 25.9578 Nmis = 0
iter = 9  kstar = 70 alpha = 28.4868 Nmis = 0

This time, setting K equal to 80 seems sufficiently large. Users should keep track of value of kstar for the entire run, and adjust K accordingly.

Impute missing values for ACS sample 2 with 30% missingness

Finally to create imputed datasets, we write a function DPMPMimp_nozeros that takes the following arguments as inputs: i) X, the original sample with missing values; ii) dj, a vector recording the number of categories of the variables; iii) nrun, the total number of MCMC iterations; iv) burn, the number of burn-ins; v) thin, the number of thinnings; vi) K, the maximum number of mixture components; vii) aalpha and balpha, the hyper parameters and in stick-breaking prior distribution for ; viii) m, the number of imputed datasets; and viiii) seed, the seed value.

DPMPMimp_nozeros <- function(X, dj, nrun, burn, thin, K, aalpha, balpha, m, seed){
num_obs<-nrow(X)
p = dim(X)[2]
origdata = X

## start DP model
model = CreateModel(origdata, NULL, K, 0, aalpha, balpha, seed)
model$Run(0,burn,1)

## initalize matrices to save parameter draws:
## alpha, and the number of occupied latent classes
eff.sam<-(nrun-burn)/thin
alphaout<-matrix(rep(0,eff.sam),nrow=eff.sam,ncol=1)
uniquezout<-matrix(rep(0,eff.sam),nrow=eff.sam,ncol=1)

## run the MCMC and save parameter draws:
## alpha, and the number of occupied latent classes
for (i in 1:(eff.sam-m)){
## run model for "thin" more iterations
model$Run(0,thin,1)

## extract output from model$snapshot
output<-model$snapshot

## save alpha and uniquezout
alphaout[i]<-output$alpha
uniquezout[i]<-length(unique(output$z))
}

#####################################################
#####      impute missing data with DPMPM       #####
#####################################################

## initiate matrix to store results
impX<-matrix(NA, dim(origdata)[1], dim(origdata)[2])

## initiate list to store results
impdata<-vector("list",m)

## loop from 1 to m, each iteration generates a synthetic dataset
for (i in 1:m){
## run model for "thin" more iterations
model$Run(0,thin,1)

## store results
output<-model$snapshot
alphaout[eff.sam-(m-i)]<-output$alpha
uniquezout[eff.sam-(m-i)]<-length(unique(output$z))

#retrieve parameters from the final iteration
result <- model$snapshot

#convert ImputedX matrix to dataframe, using proper factors/names etc.
impX <- GetDataFrame(result$ImputedX,X)

## store i-th synthetic dataset to the list: syndata
impdata[[i]] = impX
}
## return results: syndata, origdata, alphaout, uniquezout
res <- list(impdata=impdata,
origdata = origdata,
alphaout = alphaout,
uniquezout = uniquezout
)
return(res)
}

The DPMPMimp_nozeros function returns a list containing: i) impdata, the list of imputed datasets; ii) origdata, the original data X; iii) alphaout, the saved draws of , which can be used to check MCMC convergence; and iv) uniquezout, the saved numbers of occupied mixture components, which can be used to track whether the upper bound K is set large enough.

To run the DPMPMimp_nozeros function to impute missing data for ACS sample 2 with 30% missingness, we can write a simple script as below. For this demonstration, we are running the MCMC for 10,000 iterations with the first 5,000 as burn-ins. We thin every 50 iterations, and let . The hyper parameters for are both set to be 0.25, and we generate synthetic datasets. We use 1234 for the seed.

dj = c(5, 2, 3)
output = DPMPMimp_nozeros(X, dj, 10000, 5000, 50, 80, 0.25, 0.25, 5, 1234)

The output prints out each iteration, which are omitted here. To access the first synthetic dataset, we can do the following.

impdata1 = output$impdata[[1]]

We also include a demonstration to use the combining rules to obtain a 95% credible interval for the proportion of married and last worked within the last 12 months (entry [2,3] in the table of MAR and WKL).

m = 5
Qm = rep(NA, m)
Um = rep(NA, m)

n = dim(output$origdata)[1]
for (l in 1:m){
Qm[l] = (table(output$impdata[[l]]$MAR, output$impdata[[l]]$WKL)/n)[2,3]
Um[l] = sqrt(Qm[l]*(1-Qm[l])/n)
}

Qbarm = mean(Qm)
Bm = sd(Qm)
Ubarm = mean(Um)
Tm = (1+1/m)*Bm + Ubarm
v = (m-1)*(1+Ubarm/(1+1/m)*Bm)^2

Qbarm - qt(0.975, v)*sqrt(Tm)
Qbarm + qt(0.975, v)*sqrt(Tm)

0.4.2 Multiple imputation for data with structural zeros

In this section, we aim to impute missing values for ACS sample 1 with 30% missingness, ss16pusa_sample_zeros_miss, where there are structural zeros present. The general procedure is very similar to the one in the previous section where structural zeros are not present. However, we need to first create MCZ, the data frame containing the structural zeros definition. Moreover, some inputs in the CreateModel function will need to be adjusted for the structural zeros cases.

Create a file to store structural zeros cases

Recall that one of the inputs that the NPBayesImputeCat package takes is a data frame MCZ, the definition of the structural zeros. Previously when there are no structural zeros, MCZ is set to NULL. However here, when there are structural zeros cases in the application, one should write the MCZ data frame, following two general rules:

  1. Variables in MCZ must be factors with the same levels as the original data.

  2. Placeholder components are represented with NAs.

Now we demonstrate how to create a data frame MCZ to store the structural zeros definition shown in Table 2. The script below is a sample script to accomplish this task.

AGEP = c(16, 16, 16, 16, 17, 17, 17, 17)
SCHL = c("Bachelor’s degree", "Doctorate degree", "Master’s degree",
"Professional degree", "Bachelor’s degree", "Doctorate degree",
"Master’s degree", "Professional degree")
MAR = rep(NA, 8)
SEX = rep(NA, 8)
WKL = rep(NA, 8)

MCZ = as.data.frame(cbind(AGEP, MAR, SCHL, SEX, WKL))

First, we create a vector of AGEP consisting of 4 replicates of value 16 and 4 replicates of value 17, and a vector of SCHL consisting the degree types which induce structural zeros cases with AGEP. Second, we create vectors of MAR, SEX, and WKL, each is a vector of length 8, with each element being NA. These are placeholder components, and since the structural zeros cases do not involve these three variables, all elements are NAs. Third, we need to create a data frame using as.data.frame and cbind. It is necessary to input the variables in the same order as in the original data (the order of variables in Table 1). We save the data frame MCZ for later use.

Load the sample

Before creating the initializing the DPMPM model with structural zeros, load ss16pusa_sample_zeros and make sure that both the sample X and the structural zeros data frame MCZ are factors. Below is a short script to achieve this task.

data(ss16pusa_sample_zeros_miss)
X = ss16pusa_sample_zeros_miss
p = dim(X)[2]
for (j in 1:p){
X[,j] = as.factor(X[,j])
MCZ[,j] = factor(MCZ[,j], levels = levels(X[,j]))
}

Create and initialize the DPMPM imputation engine

Next, we can create and initialize the DPMPM imputation engine with structural zeros. Recall the 7 inputs that the function CreateModel takes: i) X, the original sample; ii) MCZ, the data frame containing the structural zeros definition; iii) K, the maximum number of mixture components (i.e. the maximum number of latent classes in the DPMPM imputation engine); iv) Nmax, an upper truncation limit for the augmented sample size; v) aalpha, the hyper parameter in stick-breaking prior distribution in Equation (6); vi) balpha, the hyper parameter in stick-breaking prior distribution in Equation (6); and vii) seed, the seed value.

As a demonstration, we let K be 20, Nmax to be 200,000, and both aalpha and balpha to be 0.25. The one line script below creates and initializes the DPMPM imputation engine with structural zeros for the ACS sample 1 with 30% missingness X with structural zeros definition MCZ.

model <- CreateModel(X, MCZ, 20, 2000000, 0.25, 0.25, 1234)

Run the DPMPM imputation engine

We can then run model for a set of user-specified numbers of burn-ins, MCMC iterations, and thinning. For example, to run 5 burn-ins, 10 MCMC iterations and thin every 1 iteration:

> model$Run(5, 10, 1)
Initializing...
Run model with structural zeros.
iter = 0  kstar = 7 alpha = 0.015 Nmis = 54
iter = 0  kstar = 7 alpha = 0.960284 Nmis = 11
iter = 1  kstar = 7 alpha = 1.32786 Nmis = 19
iter = 2  kstar = 8 alpha = 1.67261 Nmis = 12
iter = 3  kstar = 7 alpha = 1.2456 Nmis = 9
iter = 4  kstar = 8 alpha = 0.63193 Nmis = 17
iter = 5  kstar = 7 alpha = 0.619098 Nmis = 20
iter = 6  kstar = 7 alpha = 1.16728 Nmis = 22
iter = 7  kstar = 8 alpha = 1.14599 Nmis = 15
iter = 8  kstar = 10 alpha = 1.87188 Nmis = 22
iter = 9  kstar = 10 alpha = 1.79102 Nmis = 18

We can see that the output prints out the iteration index as iter, the value of occupied mixture components (i.e. the value of occupied latent classes) as kstar, the value of (the concentration parameter in stick-breaking prior distribution in Equation (6)) as alpha, and the size of the augmented sample as Nmis. For the first 10 iterations, we can see fluctuation of the augmented sample size. It is important to keep track of the value of kstar, as the NPBayesImputeCat package uses the truncated stick-breaking representation of the DP prior (Sethuraman1994SS). If the value of kstar is always K, the maximum number of mixture components, we should re-run the DPMPM synthesizer by specifying larger value of K, to allow large enough number of mixture components to cluster the observations.

Impute missing values for ACS sample 1 with 30% missingness

Finally to create imputed datasets, we can write a function DPMPMimp_zeros that takes the following arguments as input: i) X, the original sample; ii) dj, a vector recording the number of categories of the variables; iii) MCZ, the data frame containing the definition of structural zeros; iv) Nmis, the maximum size of the augmented sample; v) nrun, the total number of MCMC iterations; vi) burn, the number of burn-ins; vii) thin, the number of thinnings; viii) K, the maximum number of mixture components; ix) aalpha and balpha, the hyper parameters and in stick-breaking prior distribution for ; x) m, the number of synthetic datasets; and xi) seed, the seed value.

DPMPMimp_zeros <- function(X, dj, MCZ, Nmis, nrun, burn, thin, K, aalpha, balpha, m, seed){
num_obs = nrow(X)
p = dim(X)[2]
origdata = X

## start DP model
model = CreateModel(origdata, MCZ, K, Nmis, aalpha, balpha)
model$Run(0,burn,1)

## initalize matrices to save parameter draws:
## alpha, and the number of occupied latent classes
eff.sam<-(nrun-burn)/thin
alphaout<-matrix(rep(0,eff.sam),nrow=eff.sam,ncol=1)
uniquezout<-matrix(rep(0,eff.sam),nrow=eff.sam,ncol=1)

## run the MCMC and save parameter draws:
## alpha, and the number of occupied latent classes
for (i in 1:(eff.sam-m)){
## run model for "thin" more iterations
model$Run(0,thin,1)

## extract output from model$snapshot
output<-model$snapshot

## save alpha and uniquezout
alphaout[i]<-output$alpha
uniquezout[i]<-length(unique(output$z))
}

#####################################################
#####      impute missing data with DPMPM       #####
#####################################################

## initiate matrix to store results
impX<-matrix(NA, dim(origdata)[1], dim(origdata)[2])

## initiate list to store results
impdata<-vector("list",m)

## loop from 1 to m, each iteration generates a synthetic dataset
for (i in 1:m){
## run model for "thin" more iterations
model$Run(0,thin,1)

## store results
output<-model$snapshot
alphaout[eff.sam-(m-i)]<-output$alpha
uniquezout[eff.sam-(m-i)]<-length(unique(output$z))

#retrieve parameters from the final iteration
result <- model$snapshot

#convert ImputedX matrix to dataframe, using proper factors/names etc.
impX <- GetDataFrame(result$ImputedX,X)

## store i-th synthetic dataset to the list: syndata
impdata[[i]] = impX
}
## return results: syndata, origdata, alphaout, uniquezout
res <- list(impdata=impdata,
origdata = origdata,
alphaout = alphaout,
uniquezout = uniquezout
)
return(res)
}

The DPMPMimp_zeros function returns a list containing: i) impdata, the list of imputed datasets; ii) origdata, the original data X; iii) alphaout, the saved draws of , which can be used to check MCMC convergence; and iv) uniquezout, the saved numbers of occupied mixture components, which can be used to track whether the upper bound K is set large enough.

To run the DPMPMsyn_zeros function to generate synthetic datasets for ACS sample 1, we can write a simple script as below. For this demonstration, we are running the MCMC for 10,000 iterations with the first 5,000 as burn-ins. We think every 50 iterations, and let . The upper limit of the augmented sample size is set to 200,000. The hyper parameters for are both set to be 0.25, and we generate synthetic datasets. We use 1234 for the seed.

dj = c(7, 5, 9, 2, 3)
output = DPMPMimp_zeros(X, dj, MCZ, 2000000, 10000, 5000, 50, 80, 0.25, 0.25, 5, 1234)

The output prints out each iteration, which are omitted here. To access the first synthetic dataset, we can do the following.

impdata1 = output$impdata[[1]]

We also include a demonstration to use the combining rules to obtain a 95% credible interval for the proportion of married and last worked within the last 12 months (entry [2,3] in the table of MAR and WKL).

m = 5
Qm = rep(NA, m)
Um = rep(NA, m)

n = dim(output$origdata)[1]
for (l in 1:m){
Qm[l] = (table(output$impdata[[l]]$MAR, output$impdata[[l]]$WKL)/n)[2,3]
Um[l] = sqrt(Qm[l]*(1-Qm[l])/n)
}

Qbarm = mean(Qm)
Bm = sd(Qm)
Ubarm = mean(Um)
Tm = (1+1/m)*Bm + Ubarm
v = (m-1)*(1+Ubarm/(1+1/m)*Bm)^2

Qbarm - qt(0.975, v)*sqrt(Tm)
Qbarm + qt(0.975, v)*sqrt(Tm)

0.5 Illustrations for synthetic data

Without loss of generality, suppose we want to generate fully synthetic datasets for the ACS samples. Interested readers can easily adjust the procedure to generate partially synthetic datasets.

In this section, we illustrate how to generate fully synthetic datasets for ACS sample 2, where no structural zeros are present. The NPBayesImputeCat package currently does not accommodate data synthesis with structural zeros.

0.5.1 Synthetic data for data without structural zeros

We aim to generate fully synthetic datasets for ACS sample 2, ss16pusa_sample_nozeros, where there are no structural zeros present.

Load the sample

First, we need to load ACS sample 2 and make sure that all variables are factor.

data("ss16pusa_sample_nozeros")
X = ss16pusa_sample_nozeros
p = dim(X)[2]
for (j in 1:p){
X[,j] = as.factor(X[,j])
}

Create and initialize the DPMPM synthesizer

Second, we create and initialize the DPMPM synthesizer without structural zeros. The function CreateModel takes 7 arguments as input: i) X, the original sample; ii) MCZ, the data frame containing the structural zeros definition - use NULL when structural zeros are not present; iii) K, the maximum number of mixture components (i.e. the maximum number of latent classes in the DPMPM synthesizer); iv) Nmax, an upper truncation limit for the augmented sample size - use 0 when structural zeros are not present; v) aalpha, the hyper parameter in stick-breaking prior distribution in Equation (6); vi) balpha, the hyper parameter in stick-breaking prior distribution in Equation (6); and vii) seed, the seed value.

As a demonstration, we let K be 80, both aalpha and balpha be 0.25, and seed be 1234. The one line script below creates and initializes the DPMPM synthesizer without structural zeros for ACS sample 2, stored in X.

model <- CreateModel(X, NULL, 80, 0, 0.25, 0.25, 1234)

Run the DPMPM synthesizer

We can then run model for a set of user-specified numbers of burn-ins, MCMC iterations, and thinning. For example, to run 5 burn-ins, 10 MCMC iterations and thin every 1 iteration:

> model$Run(5, 10, 1)
Initializing...
Run model without structural zeros.
iter = 0  kstar = 80 alpha = 1 Nmis = 0
iter = 0  kstar = 72 alpha = 21.5913 Nmis = 0
iter = 1  kstar = 73 alpha = 24.2025 Nmis = 0
iter = 2  kstar = 74 alpha = 18.8123 Nmis = 0
iter = 3  kstar = 65 alpha = 23.1972 Nmis = 0
iter = 4  kstar = 68 alpha = 21.7837 Nmis = 0
iter = 5  kstar = 67 alpha = 26.4672 Nmis = 0
iter = 6  kstar = 67 alpha = 19.5297 Nmis = 0
iter = 7  kstar = 68 alpha = 23.57 Nmis = 0
iter = 8  kstar = 69 alpha = 26.5521 Nmis = 0
iter = 9  kstar = 67 alpha = 23.4778 Nmis = 0

We can see that the output prints out the iteration index as iter, the value of occupied mixture components (i.e. the value of occupied latent classes) as kstar, the value of (the concentration parameter in stick-breaking prior distribution in Equation (6)) as alpha, and the size of the augmented sample as Nmis. In our demonstration, Nmis is always 0, as the size of the augmented sample is 0 when there are no structural zeros. It is important to keep track of the value of kstar, as the NPBayesImputeCat package uses the truncated stick-breaking representation of the DP prior (Sethuraman1994SS). If the value of kstar is always K, the maximum number of mixture components, we should re-run the DPMPM synthesizer by specifying larger value of K, to allow large enough number of mixture components to cluster the observations.

Generate synthetic datasets for ACS sample 2

Finally to generate synthetic datasets, we can write a function DPMPMsyn_nozeros that takes the following parameters as input: i) X, the original sample; ii) dj, a vector recording the number of categories of the variables; iii) nrun, the total number of MCMC iterations; iv) burn, the number of burn-ins; v) thin, the number of thinnings; vi) K, the maximum number of mixture components; vii) aalpha and balpha, the hyper parameters and in stick-breaking prior distribution for ; and viii) m, the number of synthetic datasets.

DPMPMsyn_nozeros <- function(X, dj, nrun, burn, thin, K, aalpha, balpha, m, seed){
num_obs<-nrow(X)
p = dim(X)[2]
origdata = X

## start DP model
model = CreateModel(origdata, NULL, K, 0, aalpha, balpha, seed)
model$Run(0,burn,1)

## initalize matrices to save parameter draws:
## alpha, and the number of occupied latent classes
eff.sam<-(nrun-burn)/thin
alphaout<-matrix(rep(0,eff.sam),nrow=eff.sam,ncol=1)
uniquezout<-matrix(rep(0,eff.sam),nrow=eff.sam,ncol=1)

## run the MCMC and save parameter draws:
## alpha, and the number of occupied latent classes
for (i in 1:(eff.sam-m)){
## run model for "thin" more iterations
model$Run(0,thin,1)

## extract output from model$snapshot
output<-model$snapshot

## save alpha and uniquezout
alphaout[i]<-output$alpha
uniquezout[i]<-length(unique(output$z))
}

#####################################################
###      generate synthetic data with DPMPM       ###
#####################################################

## initiate matrix to store results
synX<-matrix(NA, dim(origdata)[1], dim(origdata)[2])

## initiate list to store results
syndata<-vector("list",m)

## loop from 1 to m, each iteration generates a synthetic dataset
for (i in 1:m){
## run model for "thin" more iterations
model$Run(0,thin,1)

## store results
output<-model$snapshot
alphaout[eff.sam-(m-i)]<-output$alpha
uniquezout[eff.sam-(m-i)]<-length(unique(output$z))

## save multinomial probabilities and latent class assignments
psiout_i<-output$psi
zout_i<-output$z

## generate synthetic data record by record with DPMPM
for (j in 1:num_obs){
zj<-zout_i[j] + 1

## synthesize all variables, i.e. fully synthetic data
for (k in 1:p){
Xprob_k<-psiout_i[1:dj[k],zj,k]
synX[j,k]<-which(rmultinom(1,1,Xprob_k)==1)
}
}

## transform synX to data frame syndata0
syndata0 = GetDataFrame(synX - 1, origdata, cols = 1:ncol(origdata))

## store i-th synthetic dataset to the list: syndata
syndata[[i]] = syndata0
}
## return results: syndata, origdata, alphaout, uniquezout
res <- list(syndata=syndata,
origdata = origdata,
alphaout = alphaout,
uniquezout = uniquezout
)
return(res)
}

The DPMPMsyn_nozeros function returns a list containing: i) syndata, the list of synthetic datasets; ii) origdata, the original data X; iii) alphaout, the saved draws of , which can be used to check MCMC convergence; and iv) uniquezout, the saved numbers of occupied mixture components, which can be used to track whether the upper bound K is set large enough.

To run the DPMPMsyn_nozeros function to generate synthetic datasets for ACS sample 2, we can write a simple script as below. For this demonstration, we are running the MCMC for 10,000 iterations with the first 5,000 as burn-ins. We think every 50 iterations, and let . The hyper parameters for are both set to be 0.25, and we generate synthetic datasets. We use 1234 for the seed.

dj = c(5, 2, 3)
output = DPMPMsyn_nozeros(X, dj, 10000, 5000, 50, 80, 0.25, 0.25, 5, 1234)

The output prints out each iteration, which are omitted here. To access the first synthetic dataset, we can do the following.

syndata1 = output$syndata[[1]]

We also include a demonstration to use the combining rules to obtain a 95% credible interval for the proportion of married and last worked within the last 12 months (entry [2,3] in the table of MAR and WKL).

m = 5
Qm = rep(NA, m)
Um = rep(NA, m)

n = dim(output$origdata)[1]
for (l in 1:m){
Qm[l] = (table(output$syndata[[l]]$MAR, output$syndata[[l]]$WKL)/n)[2,3]
Um[l] = sqrt(Qm[l]*(1-Qm[l])/n)
}

Qbarm = mean(Qm)
Bm = sd(Qm)
Ubarm = mean(Um)
Tf = (1+1/m)*Bm - Ubarm
v = (m-1)*(1-m*Ubarm/((m+1)*Bm))^2

Qbarm - qt(0.975, v)*sqrt(Tf)
Qbarm + qt(0.975, v)*sqrt(Tf)

0.6 Concluding remarks

In this paper, we have presented the DPMPM models for multivariate categorical data, and illustrations of using the NPBayesImputeCat package for multiple imputation and synthetic data applications. Users can take the output and extract imputed and synthetic datasets, then conduct statistical analyses of their choice and use the appropriate combining rules to obtain valid estimates. Interested readers can refer to the package documentation for additional features.

While the NPBayesImputeCat package has been developed primarily for multiple imputation and synthetic data purposes, users can also use it for DPMPM model estimation. For example, following the illustrations for synthetic data, a data analyst is able to obtain parameter draws of several key parameters from the MCMC chain: i) the Dirichlet Process concentration parameter , ii) the mixture probability vectors , and iii) the Multinomial probability vectors . The analyst can then further conduct analyses of the clustering of the observations in the MCMC chain, and other questions of interest.

Bibliography