Robust Clustering with Subpopulation-specific Deviations

by   Briana Stephenson, et al.
Duke University

The National Birth Defects Prevention Study (NBDPS) was a case-control study of birth defects conducted across 10 U.S. states. Researchers are interested in characterizing the etiologic role of maternal diet, using data tools such as the food frequency questionnaire. Maternal diet and behaviors have been shown to influence the development of congenital malformations. In a large, heterogeneous population, traditional clustering methods, such as latent class analysis, used to estimate dietary patterns can produce a large number of clusters due to a variety of factors, including study size and regional diversity. These factors result in a loss of interpretability of patterns that may differ due to minor consumption pattern changes. Based on adaptation of the local partition process, we propose a new method, Robust Profile Clustering, to handle these data complexities. Here, participants may be clustered at two levels: (1) globally, where women are assigned to an overall population-level cluster via an overfitted mixture model, and (2) locally, where regional variations in diet are accommodated via a beta-Bernoulli process dependent on subpopulation differences. We use our method to analyze the NBDPS data, deriving pre-pregnancy dietary patterns women in the NBDPS while accounting for regional variability.


page 18

page 21

page 24


Supervised Robust Profile Clustering

In many studies, dimension reduction methods are used to profile partici...

Multiview Hierarchical Agglomerative Clustering for Identification of Development Gap and Regional Potential Sector

The identification of regional development gaps is an effort to see how ...

A Bernoulli Mixture Model to Understand and Predict Children Longitudinal Wheezing Patterns

In this research, we estimate that around 27.99(±2.15)% of the populatio...

Defying the Circadian Rhythm: Clustering Participant Telemetry in the UK Biobank Data

The UK Biobank dataset follows over 500,000 volunteers and contains a di...

Bayesian regional food frequency analysis for large catchments

Regional flood frequency analysis is commonly applied in situations wher...

Fuzzy-Klassen Model for Development Disparities Analysis based on Gross Regional Domestic Product Sector of a Region

Analysis of regional development imbalances quadrant has a very importan...

Geographical Distribution of Biomedical Research in the USA and China

We analyze nearly 20 million geocoded PubMed articles with author affili...

1 Introduction

1.1 Multivariate Categorical Dietary Data

Food frequency questionnaires (FFQ) are often used to measure an individual’s dietary intake over a period of time. The standard FFQ queries on consumption/intake levels of over 100 foods (Subar et al., 2001). Some researchers focus on individual foods or nutrients, but these foods are not consumed in isolation, and many nutritionists argue that a more holistic approach is needed (Motulsky et al., 1989). When data includes a large number of exposures or in the case of an FFQ, food items, data reduction techniques such as factor analysis, latent class analysis, or other clustering approaches are often used (Kant, 2004; Venkaiah et al., 2011; Sotres-Alvarez et al., 2010; Keshteli et al., 2015).

Clustering methods generally assume participants within each cluster share dietary habits and aim to maximize differences across groups. At times, these methods may oversimplify dietary behaviors, and it can be difficult to determine when dimension reduction is generalizable across different populations. The generalizability issue poses a concern for large heterogeneous populations. Subjects from these populations may often share a combination of behaviors that could be general to an overall population, but also specific to a subject’s subpopulation. A subpopulation is defined as any group indicated by a categorical covariate (e.g. state residence, ethnicity, SES, etc.). For example in the United States, if a subpopulation was defined by a subject’s state of residence, the foods consumed to characterize “American” diets would look different, due to regional differences. A healthy diet may incorporate an increased consumption of regional foods indigenous to a specific state (e.g. more avocados in Texas). Reconciling these regional differences with a single overall clustering method presents a loss of granularity.

On the other hand, creating separate models for each subpopulation can greatly diminish statistical power, and can lead to misleading characterizations of diets when generalizing across the entire population sample. Individuals that are classified as having a “healthy” diet in North Carolina, or a subpopulation where poor eating behaviors are prevalent, may be classified as having an “unhealthy” diet in Massachusetts, where a more ‘health-conscious’ population is prevalent. The differences found within these regional patterns are crucial for the improvement of national dietary recommendations that can accommodate heterogeneity of dietary behaviors.

1.2 Standard Clustering Methods

The latent class model, introduced by Lazarsfeld and Henry (1968), is the most common clustering method for handling multivariate categorical response data. Because the number of clusters is typically unknown, models with a varying number of clusters are fit. The best number of clusters is chosen via likelihood ratio tests, Bayesian Information Criteria, Akaike Information Criteria, or the Lo-Mendell-Rubin test (Nylund et al., 2007). In practice, these criteria tend to be “greedy” and select solutions with a large number of clusters. To avoid the challenge of interpreting a large number of clusters, researchers will add interpretability as a criterion for selecting the number of clusters and look for the best solution with a manageable number (Ford et al., 2010; Silverwood et al., 2011). With no separation across subpopulations, this restriction on the number of overall classes masks any localized dietary behaviors that could be pronounced at a subpopulation level.

Nonparametric Bayesian methods allow the number of clusters represented in a sample to group as dimension (sample size, number of variables) increases, namely, the Dirichlet process or overfitted finite mixture models (Figueiredo and Jain, 2002; Zhang et al., 2004; Teh, 2006; Miller and Harrison, 2016; Rousseau and Mengersen, 2011). In heterogeneous populations, a largely prevalent subpopulation may have its behaviors reflected in one of the overall clusters; whereas smaller subpopulation behaviors may still remain hidden in one of these general clusters. Further, while flexibly convenient, these models tend to overestimate the true number of clusters, permitting, at times, nonexistent clusters to appear (Miller and Harrison, 2013)

. Outliers are often assigned to singleton clusters, which measure lack of fit in the model more than a new pattern.

In multi-site studies, often a hierarchical or nested approach is used to accommodate any potential differences amongst subpopulations. The hierarchical Dirichlet process assumes common clusters across groups(Teh et al., 2006). Nested approaches cluster subjects within a subpopulation and only borrow information across subpopulations that share similar behaviors (Rodriguez et al., 2008; Hu et al., 2017)

. While useful in many applications, these techniques contain drawbacks. The number of nonempty clusters derived is highly sensitive to the selection of tuning parameters or hyperparameters. Though unrealistic, strong priors on these parameters are necessary to enforce sparsity and ensure subjects aggregate to a reasonable number of clusters, as interpretability once again becomes an issue.

What strains these clustering methods is the assumption of global clustering, where subjects belonging to the same cluster will exhibit the same expected responses for all variables included in the set. This is where subpopulation granularity is lost, as differences may exist for a subset of variables within that subpopulation. Local partition and hybrid Dirichlet mixture models break this global clustering assumption by apply a two-tiered clustering scheme at a global and local level (Petrone et al., 2009; Dunson, 2009).

The hybrid Dirichlet mixture model assigns local cluster assignments to each individual variable, and then clusters globally based on shared combinatorial similarities of these local clusterings with other subjects using a copula construction. FFQ data are considered semi-quantitative. Quantity of consumption is collected based on a choice from several standardized portion sizes. The frequency of consumption is collected in ordinal group levels (e.g. X times per day, daily, weekly, monthly) (Subar et al., 2001). Given this data structure, the hybrid DP is unable to generate discriminating global clusters because the copula construction used to derive the global clustering scheme is not unique when handling discrete data (Smith and Khaled, 2012).

Instead of clustering every variable individually, the local partition model allows an entire subset of variables to be partitioned to a local or global clustering system. Here, subjects cluster with other subjects who behave similarly for most or some of the variables analyzed. It is useful in characterizing the global cluster patterns because variables that do not provide much information to the overall population patterns can be removed. However, what is considered noise at the global level could be valuable at a subpopulation level. In order to identify which items are important in a general population setting and which items are important in a subpopulation setting, a statistically principled method is needed to identify and discriminate between the two levels of patterns, while still preserving a level of interpretability.

We organize this article as follows. Section 2 introduces the RPC framework. Section 3 describes the algorithm for posterior computation and inference. Section 4 explores performance versus other methods via a simulation study. Section 5 presents a comprehensive analysis of the NBDPS data and describes insights provided by the new methodology. We conclude with a short discussion in Section 6.

2 Robust Profile Clustering

In this section, we propose a novel class of Robust Profile Clustering (RPC) processes, which are designed to produce a robust set of “global” clusters summarizing the overall nutritional profile of an individual. The robustness is achieved by not following the typical approach and restricting all of the measurements from individual to have the same cluster membership, but instead to allow local deviations from the global profile.

As our data are nested within subpopulations, we allow these local deviations to have a subpopulation-specific form. Introducing some notation, we let index individuals in a study, index the known subpopulation (essentially a categorical covariate) of individual , and index the (unknown) global profile membership of subject

. In addition, each subject has a multivariate data vector,

. Individual may not follow her global cluster allocation for all elements of this multivariate vector but may deviate for some of them. We let if item is attributed to global cluster for individual and otherwise. We let denote the local cluster allocation conditionally on and .

An RPC process is then induced through probability models containing 3 components: (

i) the global clustering, , (ii) the local deviation indicator, , and (iii) the local clustering membership, . There are a wide variety of choices that can be used for (i)-(iii), and to put in general form we let


The data are assumed to be drawn independently from , where , . We use an equal number of categories for simplicity in exposition, but the extension to a variable number of categories is easily extendible. The ’s are clustered according to the values of , , , and . In particular, we let


where the cluster- and food item-specific probability vectors a priori for , , and . Although and are conditionally independent given , dependence is induced in marginalizing out the global cluster index , as shown in expression 6 below.

We assume the binary allocation vectors, are independent and identically distributed given with probability of allocation . We model each subpopulation with a Beta-Bernoulli process,


The hyperparameters control the overall weight given to each local component (deviated food item) within each subpopulation. We let as a default to place equal probability a priori on the global and local components, while allowing substantial uncertainty.

For the global clustering process, we assume an overfitted finite mixture model (Rousseau and Mengersen, 2011), which greatly simplifies computation relative to the LPP. Let be a conservative upper bound on the number of clusters (say, K=50). Then we have


For the local clustering process, we use a parallel formulation, letting


The induced likelihood for the dietary indicators for subject conditionally on and but marginalizing out and is given by


3 Posterior Computation and Inference

We propose a simple Gibbs sampler for posterior computation.

  1. Update the global component indicators , where

    for each subject with respective subpopulation index .

  2. Update global cluster index , from its multinomial distribution where

  3. Update local cluster index for all and , repeating for each , from its multinomial distribution conditional on where

  4. Update the global clustering weights

  5. Update the local clustering weights in subpopulation ,

  6. Update the multinomial parameters, where is a flat, symmetric Dirichlet hyperparameter preset at 1

  7. Update .

  8. Update Beta-Bernoulli hyperparameter: .

4 Simulation Example

We use this section to explore the performance of existing methods with the newly proposed RPC via a simulation study. The simulation was designed to mimic the NBDPS dietary dataset containing several subpopulations. Three global patterns were created from categorical variables, with four response levels each. These patterns were defined to be distinctly unique such that no two global patterns shared the same response level for the same variable. Global profile 1 had a response level of 3 for the first 25 variables and a response level of 1 for the last 25 variables, , where is the defined response level of variable in global profile . Global profile 2 had a response level of 2 for the first ten variables and a response level of 4 for all remaining variables, . Global profile 3 had a response level of 1 for the first ten variables, a response level of 2 for the subsequent 20 variables, and a response level of 3 for the remaining twenty variables, . Local clusters were defined by permuting a subset of responses to differ from subpopulation to subpopulation.

Observed variables were randomly drawn from a multinomial distribution, such that if subject belongs to Global profile 2, . The desired variable response for a given global pattern was favored with a heavier probability weight compared to all other possible responses. Globally allocated variables had modal response weights where , all other possible responses were given equal weight, .

Each subpopulation contained a subset of variables that were designated to locally deviate from their assigned global profile. The probability of allocation was pre-assigned for each subpopulation , where , for all and

. The decision to maintain it’s global pattern response value was determined by drawing each variable of each subject from a Bernoulli distribution, where

. If , then . Let denote the desired response level of a locally deviated variable belonging to local cluster from subpopulation , then . All other possible responses were given equal weight, .

A total of 500 replicate datasets were created to evaluate model reproducibility and validity. Each dataset contained 28 different simulated patterns (3 strictly global, 25 global/local hybrids) across ten subpopulations. Nine subpopulations were created with 1200 subjects drawn equally across the three global pattern densities. The tenth subpopulation was created with 1600 subjects containing only two of the three global pattern densities, 400 and 1200 subjects respectively. Each of the ten subpopulations varied in the number of variables allocated to a local cluster. The variables that deviated to a local cluster pattern were consistent across all global clusters in that subpopulation. This served to evaluate the model’s robustness in distinguishing between globally and locally allocated variables.

Each replicate was analyzed under 5 different models: (1) traditional LCA model with four classes (tLCA) (Lazarsfeld and Henry, 1968), (2) Dirichlet process mixture model (DPM) (Dunson and Xing, 2009), (3) over-fitted finite mixture model of classes (oFMM) (Rousseau and Mengersen, 2011), (4) local partition process model (LPP) (Dunson, 2009), and (5) the newly proposed robust profile clustering model (RPC). All of these methods being compared to the RPC ignore subpopulations, but we include them still to compare sparsity and specificity of identifying the true number of clusters in the overall population. Estimation was performed using an MCMC Gibbs sampler of 20,000 runs after 5,000 burn-in. Parameter estimation of large sample studies tend to gravitate to a preferred node and remain there for subsequent iterations. To encourage mixing in the various methods, label switching moves were imposed to prevent multimodality and allow subjects to explore all possible nodes. The random permutation sampler was applied for finite mixture model cases (tLCA, oFMM, RPC) (Frühwirth-Schnatter, 2001). The DPM and LPP, which involve a Dirichlet process mixture model had two label switching moves imposed to favor swapping of both equal and unequal size clusters (Papaspiliopoulos and Roberts, 2008).

Mixing efficiency and convergence were evaluated using trace plots of respective concentration parameters and randomly selected variables. Parameters were relabelled using the Stephens’ label switching method (Stephens, 2000). Dirichlet hyperparameters of the mixture component weights in the models built from a finite mixture model (models 1,3,5) were preset to , where is the preset maximum number of clusters allowed in the model. The concentration hyperparameter of models containing a Dirichlet process (2,4) was preset to 1. Parameters estimated from models 1-4 were sampled in accordance with algorithms presented in their respective literature (Nylund et al., 2007; Dunson and Xing, 2009; Rousseau and Mengersen, 2011; Dunson, 2009).

The cluster patterns identified from each respective method were derived using the posterior median of cluster density parameter estimates. Each cluster density contained a vector of posterior probabilities of a subject responding at a given level within that cluster. The response level containing the maximum probability of each variable was designated as the modal cluster pattern response. Let

, denote the modal pattern response of variable for cluster , such that , where denotes a posterior estimate of model-derived parameter . Heat maps were used to determine concordance of model cluster patterns to the true cluster patterns. Concordance was measured as the sum of variables containing matching modal outcomes with the truth divided by the total number of variables. The expected outcome should identify each subpopulation containing the 3 global clusters, with unique deviation patterns, with the exception of subpopulation 10. That subpopulation should identify one global cluster pattern and one uniquely local cluster pattern.

4.1 Results

All models were run using MATLAB (version 2017a). Examination of trace plots indicated quick convergence of model parameters. Comparison of model performance is summarized in Table 1. The tLCA had the shortest computational run time. The LPP and RPC had the longest computation run times. The RPC showed considerable sparseness in deriving the number of global nonempty clusters with a median of 5 compared to the other models. The DPM and oFMM derived an excess number of nonempty global clusters. The LPP derived a median of 15 nonempty clusters and tLCA filled all four of its predefined clusters.

Model Computational Number of Clusters
Time Median IQR Small clusters
tLCA 1.5 hrs 4 (4, 4) 0
DPM 11 hrs 17 (17, 18) 14
oFMM 12.3 hrs 22 (22, 23) 19
LPP 24.2 hrs 15 (15,16) 11
RPC 22 hrs 5 (4, 5) 2
Table 1: Simulation results of models. A small cluster is defined as any cluster containing less than 5% of the population size

All of the models identified the true global patterns, but differed in specificity, as extra patterns were identified as global. The oFMM and DPM models showed the lowest level of specificity, with a median of 14 and 19 additional clusters, respectively. These clusters were notably small in size (cluster weights ). The LPP identified a median of 11 additional clusters, all of which were less than 5% in size. The RPC identified a median of 2 additional clusters, which was less than 5%. The fourth extraneous cluster identified in the tLCA was comparably larger in size with a median cluster weight of 10%.

The parameter in the RPC model illustrated the weight of each subpopulation to the model’s global or localized components. Lower values indicate a strong representation of global components, whereas higher values indicate a stronger representation of local components. For example, the tenth subpopulation that had an oversized sample of a completely local pattern, had a considerably higher value compared to the other nine subpopulations that did not exceed 3.

5 Analysis of National Birth Defects Prevention Study Data

5.1 Multivariate Categorical Dietary Data

The National Birth Defects Prevention Study is an ongoing multi-state population-based, case-control study of birth defects in the United States (Yoon et al., 2001). Infants were identified via birth defect surveillance in Arkansas, California, Iowa, Massachusetts, New Jersey, New York, Texas, Georgia, North Carolina, and Utah. We focus in this analysis on dietary habits of control participants. Participants for this study included mothers with expected due dates from 1997 to 2009, totaling 9747 controls. Controls were defined as any live-born infant without any birth defects and were randomly selected from birth certificates or hospital records. Following prior analyses, subjects were excluded who had multiple births, a prior history of birth defects, preexisting diabetes, or folate antagonist medication use from three months before pregnancy to the end of pregnancy. In accordance with NBDPS standard practices, mothers with daily energy intake below the 2.5th and above the 97.5th percentiles were also excluded to prevent inclusion of unlikely intake data. After exclusion criteria were applied, a total of 9010 controls were included for this analysis.

Food consumption was measured in grams per day and calculated by multiplying frequency of consumption by the standard portion size for each food item as outlined in Sotres-Alvarez et al. (2013). Single food items are listed in the FFQ. As a result, respondents are prone to over-estimate or under-estimate total intake (Jønneland et al., 1991; Haraldsdottir, 1993). To control for this, percentiles were computed by dividing an individual food item consumed over the total food items consumed, using grams per day as the consumption metric. Grams per day of a food item was determined by multiplying the standard portion size listed on the FFQ and the self-reported frequency consumed. Distribution of food items showed a spike at zero, which is well known in the literature (Kipnis et al., 2009; Zhang et al., 2011). Keeping with common dietary analysis practices, these percentiles were aggregated into four relative consumption levels: no consumption, low consumption (0-33% consumed), medium consumption (33-66% consumed), and high consumption (66-100% consumed). A total of food items were included in the study with four consumption levels fit into a categorical distribution.

Given the standard mixture model setup, where is the membership indicator of subject to cluster ,


where represents the probability of a subject with consumption level from food item given allocation to dietary cluster . We transition to the RPC model, where is split into two variables: and . The former represents the probability of a subject having a consumption level of from food item , given allocation to global dietary profile . The latter representing the probability of a subject from subpopulation having a consumption level from food item given allocation to local dietary cluster . Similarly, would be split into and variables, where is the membership indicator of subject for the global variables and is the membership indicator of subject for deviated variable .

The cluster-specific parameters were each drawn from a flat, symmetric Dirichlet distribution, , , where . The hyperparameters of the Beta-Bernoulli process component of the RPC were drawn from a gamma prior, . To encourage a flat prior setup of the Beta-Bernoulli process, was set to 1 for simplicity.

5.2 MCMC Performance

As performed in simulation, sampling was performed with an MCMC run of 20,000 iterations, after a 5,000 burn-in. Given the tendency, acknowledged in the simulation study, of parameters gravitating to a preferred node and remaining there for subsequent iterations in large samples, the random permutation sampler was applied to encourage mixing (Frühwirth-Schnatter, 2001). Furthermore, the overfitted model is also prone to generating extraneous and redundant clustering (van Havre et al., 2015)

. Redundancies were removed by creating a posterior pairwise comparison matrix based on the full MCMC output. Hierarchical clustering was then performed using the complete linkage approach, restricting to the median number of nonempty clusters larger than 5% in size

(Krebs et al., 1989; Medvedovic and Sivaganesan, 2002). This threshold was determined from the simulation study in order to focus on identifying the clusters of global interest. The trace plots of and

showed a good mixing and rapid convergence. All model parameters were estimated by calculating the posterior median and 95% credible intervals.

5.3 NBDPS Results

The previous models discussed in this paper were also fit using the NBDPS data. Similar to the simulation results, the DPM and oFMM models derived an excessive amount of nonempty clusters. The DPM derived 19 nonempty clusters, of these 11 of them were considered small. We define a small cluster as any nonempty cluster less than 0.05 but larger than 0.005. The oFMM derived 20 nonempty clusters, of these ten of them were classified as small. The tLCA model required post hoc testing. The BIC indicates 15 clusters is the preferred model, which is excessive in handling interpretability and power. Of the 15 clusters fit in the LCA model, 2 of these clusters were classified as small. The LPP model derived 9 nonempty clusters, none of which were classified as small. As illustrated previously, none of these existing methods were able to distinguish deviating behaviors by a known subpopulation. Yet, a subpopulation effect was evident as all of the nonempty clusters identified in each of these methods did not see representation in all 10 states included in the dataset.

Figure 1: Heat map showing modal consumption level of RPC global profiles. Blue - no consumption, yellow - low consumption, orange - medium consumption, red - high consumption.

The RPC model identified a total of seven nonempty global cluster patterns. A heat map illustrating the patterns of the global profiles is provided in Figure 1. Global profile behaviors were described by foods most likely to have a given consumption level within each cluster. Figure 2 illustrates the top five foods with the highest probability of having a given consumption level for each global profile. Note some foods were listed under multiple consumption levels. This demonstrates similar modalities for more than one consumption compared to the other foods. The greater of these modalities is reflected in the consumption map of Figure 1.

Figure 2: Top 5 foods most likely to be consumed at each level of global profiles

The distribution of these profiles by each subpopulation is illustrated in Figure 3. Global profile 3 (gray), which was the largest cluster, was the most prominent profile in all states except for Arkansas and California. This profile indicated a high consumption of chicken, cheese, chicken, and beef. In Arkansas, global profile 5 (light blue) was most prominent, which had a high consumption of soda, tea, potato chips and french fries. Texas had a strong representation of global profile 4 which had a high consumption of foods commonly found in a ‘Tex-Mex’ style diet (tortillas, refried beans, salsa).

Figure 3: Frequency distribution of global profiles by subpopulation (state) for the global profiles identified in the model

While the food patterns found at the global level were shared amongst subjects from different subpopulations. Unique behaviors were more pronounced within each state at the local level. A subset of foods were found to deviate to the subpopulation level for all ten states. A food variable with a tendency to deviate in a given subpopulation, if . Eight foods were found to share a tendency to deviate in favor of non consumption in all ten states (butter, squash, beef liver, beef tongue, coffee, tea, diet soda, folate cereal). Four foods had a tendency to not deviate to any of the local clusters, and remain in accordance with the global profiles (pork, french fries, potato chips, wheat bread). All other food variables with varying consumption patterns by state, are illustrated in Figure 4. These foods at the local level were isolated into a heavily weighted single cluster for each subpopulation.

Figure 4: Heat map illustrating foods with a tendency to deviate from the global profile , by subpopulation. Blue - no consumption, yellow - low consumption, orange - medium consumption, red - high consumption.

States with a high demographic of Latino populations (Texas and California) showed a tendency for a high consumption of foods commonly found in a latino diet (tortillas, refried beans, salsa, avocados), whereas states, such as Massachusetts, New Jersey, and New York showed a tendency of no consumption of those foods. Another highly discriminating food was spaghetti. Spaghetti is a food item that is consumed at a high level for those subjects in global profile 3. This was the most populous global profile in North Carolina and Utah, yet subjects from those states were likely to deviate in favor of a low or medium consumption, respectively. Similarly, subjects from Massachusetts and New Jersey that may not have been allocated to that global profile (i.e. global profile 2 and global profile 5) still favor a high consumption of spaghetti. Utah subjects had a tendency to deviate in favor of a high consumption of fruits (apples, bananas, fruit cocktail) and dairy (egg, reduced milk).

6 Discussion

The RPC method provides a convenient and informative population-based model that is able to adapt and account for potential deviations occurring within subpopulations. This was evident in the application of the NBDPS dietary data. Many preconceived notions of regional food trends were identified in this model. It calls attention to the overgeneralization of dietary practices in the United states as regionally indigenous foods can sometimes be drivers of diet characterization. With an extension to case-control analysis, this method will be able to identify how deviations in a national or state diet can associate with various health outcomes.

Within each subpopulation, deviating items favored a heavier mass into a single cluster. This may be due to the relatively smaller sample size and more homogeneity that exists with a single subpopulation. Clustering at the global level removes information that is not pertinent to the overall population. That information is subjugated to the subpopulation level where it can assume a separate local clustering of individual deviated variables. On occasion, some foods may share a local deviation in favor of non-consumption across all subpopulations. In this case, the global profile reflected the patterns of subjects that shared consumption behaviors with those from other subpopulations, while all non-consumers were localized to the state level.

Specificity of global profiles can be a concern when a subpopulation contains a highly deviant local cluster pattern. The simulation study indicated that these local cluster patterns can be misclassified as a single global pattern, or hybridized with other highly deviant local clusters to a global profile. While it does provide the necessary information about the characteristics of that pattern, it may not be able to differentiate that pattern from the rest of the general population discriminately. Further research is needed to address these concerns.

Appendix A: Supporting Figure

Using one of the replicates from the simulation study, the figure below provides a comparative illustration of variables that were identified as deviating from the global cluster pattern by each subpopulation. The figure on the left shows the variables that deviated as a result of the RPC model and the figure on the right illustrates the true deviated variables. Discrepancies in the predicted and actual results were found in subpopulations 7 and 10. These subpopulations contained patterns where all variables deviated from the global pattern. The RPC model misspecified the completely local pattern as a global pattern for subpopulation 7. Most of the deviation was identified as deviating to local for subpopulation 10. However, some locally allocated variables overlapped with the globally expected value and were therefore considered global in those cases.

Figure 5: Simulation example comparing variables that were allocated to the global cluster set (gray) and local cluster set (black). Side-by-side figure compares RPC-derived results (left) with true deviation results (right).


An example dataset to illustrate the RPC method all source code, data, and expected output are publicly available at

RPC source code:

MATLAB code to perform the RPC method and supporting files can be found at

Simulated data set:

Simulated example data set to illustrate RPC method is available here.

Expected Output:

Execution of the RPC example should produce the following output found at


  • Dunson (2009) Dunson, D. B. (2009). Nonparametric Bayes Local Partition Models for Random Effects. Biometrika, 96(2):249–262.
  • Dunson and Xing (2009) Dunson, D. B. and Xing, C. (2009). Nonparametric Bayes Modeling of Multivariate Categorical Data. Journal of the American Statistical Association, 104(487):1042–1051.
  • Figueiredo and Jain (2002) Figueiredo, M. A. T. and Jain, A. K. (2002). Unsupervised Learning of Finite Mixture Models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(3):381–396.
  • Ford et al. (2010) Ford, J. D., Elhai, J. D., Connor, D. F., and Frueh, B. C. (2010). Poly-victimization and Risk of Posttraumatic, Depressive, and Substance use Disorders and Involvement in Delinquency in a National Sample of Adolescents. Journal of Adolescent Health, 46(6):545–552.
  • Frühwirth-Schnatter (2001) Frühwirth-Schnatter, S. (2001). Markov Chain Monte Carlo Estimation of Classical and Dynamic Switching and Mixture Models. Journal of the American Statistical Association, 96(453):194–209.
  • Haraldsdottir (1993) Haraldsdottir, J. (1993). Minimizing Error in the Field: Quality Control in Dietary Surveys. European Journal of Clinical Nutrition (United Kingdom).
  • Hu et al. (2017) Hu, J., Reiter, J. P., and Wang, Q. (2017). Dirichlet Process Mixture Models for Modeling and Generating Synthetic Versions of Nested Categorical Data. Bayesian Analysis, TBA(TBA):1–18.
  • Jønneland et al. (1991) Jønneland, A., Overvad, K., Haraldsdóttir, J., Bang, S., Ewertz, M., and Jensen, O. M. (1991). Validation of a Semiquantitative Food Frequency Questionnaire Developed in Denmark. International Journal of Epidemiology, 20(4):906–912.
  • Kant (2004) Kant, A. K. (2004). Dietary Patterns and Health Outcomes. Journal of the American Dietetic Association, 104(4):615–635.
  • Keshteli et al. (2015) Keshteli, A. H., Feizi, A., Esmaillzadeh, A., Zaribaf, F., Feinle-Bisset, C., Talley, N. J., and Adibi, P. (2015). Patterns of Dietary Behaviours Identified by Latent Class Analysis are Associated with Chronic Uninvestigated Dyspepsia. British Journal of Nutrition, 113(05):803–812.
  • Kipnis et al. (2009) Kipnis, V., Midthune, D., Buckman, D. W., Dodd, K. W., Guenther, P. M., Krebs-Smith, S. M., Subar, A. F., Tooze, J. A., Carroll, R. J., and Freedman, L. S. (2009). Modeling Data with Excess Zeros and Measurement Error: Application to Evaluating Relationships Between Episodically Consumed Foods and Health Outcomes. Biometrics, 65(4):1003–1010.
  • Krebs et al. (1989) Krebs, C. J. et al. (1989). Ecological Methodology. Technical report, Harper & Row New York.
  • Lazarsfeld and Henry (1968) Lazarsfeld, P. and Henry, N. (1968). Latent Structure Analysis. Houghton, Mifflin.
  • Medvedovic and Sivaganesan (2002) Medvedovic, M. and Sivaganesan, S. (2002). Bayesian Infinite Mixture Model Based Clustering of Gene Expression Profiles. Bioinformatics, 18(9):1194–1206.
  • Miller and Harrison (2013) Miller, J. W. and Harrison, M. T. (2013). A Simple Example of Dirichlet Process Mixture Inconsistency for the Number of Components. In Advances in neural information processing systems, pages 199–206.
  • Miller and Harrison (2016) Miller, J. W. and Harrison, M. T. (2016). Mixture Models with a Prior on the Number of Components. Journal of the American Statistical Association, TBA(just accepted).
  • Motulsky et al. (1989) Motulsky, A. G. et al. (1989). Diet and Health: Implications for Reducing Chronic Disease Risk. National Academies.
  • Nylund et al. (2007) Nylund, K. L., Asparouhov, T., and Muthén, B. O. (2007). Deciding on the Number of Classes in Latent Class Analysis and Growth Mixture Modeling: A Monte Carlo Simulation Study. Structural Equation Modeling, 14(4):535–569.
  • Papaspiliopoulos and Roberts (2008) Papaspiliopoulos, O. and Roberts, G. O. (2008). Retrospective Markov Chain Monte Carlo Methods for Dirichlet Process Hierarchical Models. Biometrika, 95(1):169–186.
  • Petrone et al. (2009) Petrone, S., Guindani, M., and Gelfand, A. E. (2009). Hybrid Dirichlet Mixture Models for Functional Data. Journal of the Royal Statistical Society: Series B, 71(4):755–782.
  • Rodriguez et al. (2008) Rodriguez, A., Dunson, D. B., and Gelfand, A. E. (2008). The Nested Dirichlet Process. Journal of the American Statistical Association, 103(483):1131–1154.
  • Rousseau and Mengersen (2011) Rousseau, J. and Mengersen, K. (2011). Asymptotic Behaviour of the Posterior Distribution in Overfitted Mixture Models. Journal of the Royal Statistical Society: Series B, 73(5):689–710.
  • Silverwood et al. (2011) Silverwood, R. J., Nitsch, D., Pierce, M., Kuh, D., and Mishra, G. D. (2011). Characterizing Longitudinal Patterns of Physical Activity in Mid-adulthood Using Latent Class Analysis: Results from a Prospective Cohort Study. American Journal of Epidemiology, 174(12):1406.
  • Smith and Khaled (2012) Smith, M. S. and Khaled, M. A. (2012). Estimation of Copula Models with Discrete Margins via Bayesian Data Augmentation. Journal of the American Statistical Association, 107(497):290–303.
  • Sotres-Alvarez et al. (2010) Sotres-Alvarez, D., Herring, A. H., and Siega-Riz, A. M. (2010). Latent Class Analysis is Useful to Classify Pregnant Women into Dietary Patterns. The Journal of Nutrition, 140(12):2253–2259.
  • Sotres-Alvarez et al. (2013) Sotres-Alvarez, D., Siega-Riz, A. M., Herring, A. H., Carmichael, S. L., Feldkamp, M. L., Hobbs, C. A., Olshan, A. F., et al. (2013). Maternal Dietary Patterns are Associated with Risk of Neural Tube and Congenital Heart Defects. American Journal of Epidemiology, 177(11):1279–1288.
  • Stephens (2000) Stephens, M. (2000). Dealing with Label Switching in Mixture Models. Journal of the Royal Statistical Society, Series B, 62(4):795–809.
  • Subar et al. (2001) Subar, A. F., Thompson, F. E., Kipnis, V., Midthune, D., Hurwitz, P., McNutt, S., McIntosh, A., and Rosenfeld, S. (2001). Comparative Validation of the Block, Willett, and National Cancer Institute Food Frequency Questionnaires the Eating at America’s Table Study. American Journal of Epidemiology, 154(12):1089–1099.
  • Teh (2006) Teh, Y. W. (2006). A Hierarchical Bayesian Language Model Based on Pitman-Yor Processes. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, pages 985–992. Association for Computational Linguistics.
  • Teh et al. (2006) Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. (2006). Hierarchical Dirichlet Processes. Journal of the American Statistical Association, 101(476):1566–1581.
  • van Havre et al. (2015) van Havre, Z., White, N., Rousseau, J., and Mengersen, K. (2015). Overfitting Bayesian Mixture Models with an Unknown Number of Components. PloS one, 10(7):e0131739.
  • Venkaiah et al. (2011) Venkaiah, K., Brahmam, G., and Vijayaraghavan, K. (2011). Application of Factor Analysis to Identify Dietary Patterns and Use of Factor Scores to Study their Relationship with Nutritional Status of Adult Rural Populations. Journal of Health, Population, and Nutrition, 29(4):327–338.
  • Yoon et al. (2001) Yoon, P. W., Rasmussen, S. A., Lynberg, M., Moore, C., Anderka, M., Carmichael, S., Costa, P., Druschel, C., Hobbs, C., Romitti, P., et al. (2001). The National Birth Defects Prevention Study. Public Health Reports, 116(1 suppl):32–40.
  • Zhang et al. (2004) Zhang, J., Ghahramani, Z., and Yang, Y. (2004).

    A Probabilistic Model for Online Document Clustering with Application to Novelty Detection.

    Advances in Neural Information Processing Systems, 4:1617–1624.
  • Zhang et al. (2011) Zhang, S., Midthune, D., Guenther, P. M., Krebs-Smith, S. M., Kipnis, V., Dodd, K. W., Buckman, D. W., Tooze, J. A., Freedman, L., and Carroll, R. J. (2011). A New Multivariate Measurement Error Model with Zero-inflated Dietary Data, and its Application to Dietary Asssessment. The Annals of Applied Statistics, 5(2B):1456–1487.