1 Introduction
Factor models are routinely used to characterize high dimensional, correlated data. The most commonly used factor model for variate data is
(1.1) 
where it is assumed that data are centered prior to analysis, are latent factors, is a factor loadings matrix, and
is a diagonal matrix of residual variances. Then the overall covariance matrix of the data is
.We are particularly motivated to analyze phthalate concentrations in blood or urine across demographic groups, collected in the National Health and Nutrition Examination Survey (NHANES). Exposures to phthalates are ubiquitous. They are present in soft plastics, including vinyl floors, toys and food packaging. Medical supplies such as blood bags and tubes contain phthalates. They are also found in fragrant products such as soap, shampoo, lotion, perfume and scented cosmetics. Because metabolites of a single parent compound or chemicals in the same family are often moderately to highly correlated, it is common to attempt to identify a small set of underlying factors (for example weissenburger2017principal
). Epidemologists are interested in interpreting these factors and in using them in analyses relating exposures to health outcomes. However, our collaborators have noticed a tendency to estimate very different factors in applying factor analysis to similar datasets or groups of individuals. For reliable interpretability and generalizability, it is important to reduce this brittleness and may be desirable to have a method that can produce a single set of factors that hold across groups.
There has been work on developing factor analysis methods that are robust in the presence of outliers
(pison2003robust). However, our motivation is different. Previous work extending factor analysis to multiple groups has focused on estimation of shared covariance structure. Simple methods include extensions of PCA to multiple groups, while more recently many customized methods have been developed. A key contribution was JIVE (Joint and Individual Variation Explained), which identifies lowrank covariance structure that is shared among multiple data types collected for the same subject (lock2013joint; feng2015non; feng2018angle). In a Bayesian framework, roy2019bayesian proposed TACIFA (Time Aligned Common and Individual Factor Analysis) for a related problem. Other methods include MSFA and BMSFA (Bayesian Multistudy Factor Analysis), which include inference over the number of shared latent factors. BMSFA takes a Bayesian approach to sampling , focusing on situations with , and uses a new posthoc method for recovering the loadings matrix (assmann2016bayesian; de2018bayesian). In these approaches, the total covariance is decomposed into three parts, a common/shared covariance, a groupspecific covariance and an error or residual variance. One drawback to these methods is the computational complexity of determining the number of shared versus groupspecific factors.We aim to identify a single set of factors under the assumption that the data in each group can be aligned to a common latent space via multiplication by a perturbation matrix. We represent the perturbed covariance in group as , where is the common covariance, and is the groupspecific perturbation matrix. As in the common factor model in (1.1), the overall covariance can be decomposed into a component due to common factors and a residual variance. The utility of the perturbation model also extends beyond multigroup settings. In the common factor model, the error terms only account for additive measurement error. We can obtain robust estimates of factor loadings in a single dataset by allowing for observationspecific perturbations. This accounts for both multiplicative and additive measurement error. In this case, we define separate
’s for each data vector
. Here ’s are multiplicative random effects with mean . Thus, and the covariance structure on would determine the variability of .Another challenge in factor analysis is nonidentifiability. In particular, due to nonidentifiability of the loading matrices and factors of (1.1), posterior samples of the loading matrices are not interpretable without identifiability restrictions (seber2009multivariate; lopes2004bayesian; rovckova2016fast; fruehwirth2018sparse). Instead of imposing arbitrary constraints, with corresponding computational complications, a common technique is postprocessing an unconstrained MCMC chain (assmann2016bayesian; roy2019bayesian). assmann2016bayesian calculate a postprocessing estimate by solving an Orthogonal Procrustes problem, but without uncertainty quantification (UQ). roy2019bayesian
propose postprocessing the entire MCMC chain iteratively to be able to draw inference with UQ. The latent factors in our model are assumed to be heteroscedastic, unlike in the usual factor model in (
1.1). This removes rotational ambiguity in the loading matrix, except for permutations, and helps to estimate loading matrices much more accurately even without any postprocessing. We find this heteroscedastic model can also improve performance in estimating the covariance.The next section describes the data and the model in detail. In Section 3, prior specifications are discussed. Our computational scheme is outlined in Section 4. We study the performance of our method in different simulation setups in Section 5. Section LABEL:real considers an application to NHANES data. We end with some concluding remarks in Section 6.
2 Data description and modeling
In the National Health and Nutrition Examination Survey (NHANES), chemical levels in urine are recorded over the span 2009 to 2013 for 2749 individuals. We consider in total eight phthalate metabolite chemicals, Mononbutyl (MnBP), Monoisobutyl (MiBP), Monoethyl (MEP), Monobenzyl (MBeP), Mono2ethyl5carboxypentyl (MECPP), Mono(2ethyl5hydroxyhexyl) (MEHHP), Mono(2ethyl5oxohexyl) (MEOHP) and Mono(2ethyl)hexyl (MEHP), and 5 racial groups, Group 1: Mexican American (Mex), Group 2: Other Hispanic (OH), Group 3: NonHispanic White (NH White), Group 4: NonHispanic Black (NH Black) and Group 5: Other Race (Other/Multi)  Including MultiRacial. We consider race and ethnicity because previous work has shown differences in patterns of use of products that contain phthalates (taylor2018associations) and in measured phthalate concentrations (james2017racial) across racial or ethnic groups. Excess levels of phthalates in blood/urine have been linked to a variety of health outcomes, including obesity (zhang2014age; kim2014phthalate; benjamin2017phthalates).
Mex  OH  NH White  NH Black  Other/Multi  

Number of participants  566  293  1206  516  168 
MnBP  3.85  4.31  3.56  4.21  3.76 
(1.96)  (4.96)  (2.17)  (1.81)  (2.11)  
MiBP  2.88  3.21  2.52  3.47  2.79 
(1.36)  (1.70)  (1.20)  (1.67)  (1.21)  
MEP  8.32  9.51  6.91  11.16  7.33 
(6.64)  (6.99)  (5.43)  (9.38)  (7.57)  
MBeP  2.79  2.78  2.70  3.06  2.56 
(1.58)  (1.61)  (1.67)  (1.77)  (1.68)  
MECPP  4.80  4.55  4.16  4.36  4.34 
(3.68)  (2.68)  (2.40)  (2.38)  (2.64)  
MEHHP  3.83  3.73  3.45  3.79  3.57 
(3.01)  (2.59)  (2.25)  (2.26)  (2.45)  
MEOHP  3.11  3.00  2.79  3.05  2.86 
(2.41)  (1.92)  (1.69)  (1.71)  (1.86)  
MEHP  1.58  1.59  1.33  1.61  1.58 
(1.23)  (1.29)  (0.89)  (0.99)  (1.31) 
Comparison across different groups in terms of number of participants and average chemical levels along with standard deviations in brackets.
Mex  OH  NH White  NH Black  Other/Multi  

Mex  0.00  31.84  108.59  191.93  28.04 
OH  31.84  0.00  156.45  63.48  28.69 
NH White  108.59  156.45  0.00  409.64  55.51 
NH Black  191.93  63.48  409.64  0.00  101.90 
Other/Multi  28.04  28.69  55.51  101.90  0.00 
We provide a summary of the average chemical level across different groups in Table 1. In Table 2, we compute Hoteling statistic between each pair of groups. The three groups Mex, OH and Other/Multi seem to be closer to each other than the other two groups. For each phthalate level, we also fit a oneway ANOVA model to determine differences across the groups taking the group Mex as baseline. The results are included in Tables 6 to 13 in Section More exploratory analysis on NHANES data. For almost all the cases, effect of the group NH White is significantly different from the others. We plot group specific covariances in Figure 1. It is evident that there are shared structures with minor differences across the groups. However, the estimated loading matrices are different when we fit the common factor model in (1.1) separately for each group. In the next subsection, we propose a novel methodology for multigroup factor analysis to identify the common factors.
2.1 Multigroup model
Assume that we observe multiple groups, with the “groups” corresponding to different studies, batches of data, or levels of a covariate within a single study. Each dimensional response , belonging to group , for and , is modeled as:
(2.1) 
The perturbation matrices are of dimension
, and follow a matrix normal distribution, with diagonal row and column covariances
. The latent factors are heteroscedastic, so that is diagonal with nonidentical entries. We discuss the advantages of choosing heteroscedastic latent factors in more detail in Section 2.3. After integrating out the latent factors, the observations are marginally distributed as . If we write , then is the shared loadings matrix, and is a groupspecific loadings matrix. We can quantify the magnitude of perturbation as , where stands for the Frobenius norm. For identifiability of ’s, we consider and . We call our model Perturbed Factor Analysis (PFA).2.1.1 Model properties
Let and be two positive definite (p.d.) matrices. Then there exist nonsingular matrices and , where and . By choosing , we have . If the two matrices and are close, then will be close to the identity. However, is not required to be symmetric. In our multigroup model in (2.1), the s allow for small perturbations around the shared covariance matrix . We define the following class for the ’s,
The index controls the amount of deviation across groups. In (2.1) we define , for some small , which we call the perturbation parameter. By choosing to be diagonal with identical entries, we impose uniform perturbation across all the rows and columns around . However, the perturbation matrices themselves are not required to be symmetric.
Lemma 1.
.
The proof follows from Chebychev’s inequality. This result allows us to summarize the level of induced perturbation for any given . Using this lemma, we can show the following corollary.
Lemma 2.
Therefore the KullbackLeibler divergence between the marginal distributions of any two groups
and can be bounded by up to some constant. We define a divergence statistic between groups and as . This generates a divergence matrix , where larger imply a greater difference between the two groups.In our multigroup model, there is no computational complication regarding specification of the ranks of the shared and individual spaces unlike other methods. We can directly rely on current methods for accommodating unknown number of factors in the single group setting (bhattacharya2011sparse)
. Also, computationally our method is faster than other multigroup factor models such as BMSFA due to smaller number of parameters, and better identifiability, leading to better mixing of Markov Chain Monte Carlo (MCMC).
2.1.2 Tuning the parameter for multigroup data
The hyperparameter controls the level of perturbation across groups. In Section 5, we show that properly tuning is necessary to obtain accurate estimates of the loading matrix . We propose a cross validation technique to choose the optimal . We randomly divide each group into training and test sets. Then for a range of values, we fit the model on the training data, and calculate the predictive loglikelihood of the test set. After integrating out the latent factors, the predictive distribution is . If there are multiple values of with similar predictive loglikelihoods, then the smallest is chosen as optimal.
Alternatively, we can take a fully Bayesian approach and put a prior on . We call this method Fully Bayesian Perturbed Factor Analysis (FBPFA). We see that FBPFA performs similarly to PFA in practice. FBPFA involves a slightly more complex MCMC implementation. PFA avoids sensitivity to the prior for but requires the computational overhead of cross validation, and can potentially be less efficient in requiring a hold out sample, while also not characterizing uncertainty in estimating .
2.1.3 Posterior consistency
Let be the parameter spaces of and , respectively, be the set of positive semidefinite matrices corresponding to the parameter space of , and be the parameter space of . Let be the priors for and ’s. We restate some of the results from bhattacharya2011sparse for our modified factor model. With minor modification, the proofs will remain the same.
Let be a continuous map such that .
Lemma 3.
For any , we have .
The proof is similar to Lemma 1 of bhattacharya2011sparse. In our Bayesian approach, we choose independent priors for , and and that induces a prior on through the map . We also have following proposition.
Proposition 4.
If , then .
The proof is similar to Proposition 1 of bhattacharya2011sparse with minor modifications. Now, we proceed to establish that the posterior of our multigroup model is weakly consistent under a fixed and increasing regime. Let us assume that the complete parameter space is and let be the truth for .
Assumptions:

For some , the true perturbation matrices are , with defined in Section 2.1.1.

There exists some , and such that and for all .
Theorem 5.
Under Assumptions 12, the posterior for is weakly consistent at .
We first show that our proposed prior has large support in the sense that the truth belongs to the KullbackLeibler support of the prior. Thus the posterior probability of any neighbourhood around the truth converges to one in
probability as goes to as a consequence of schwartz1965bayes. Here is the distribution of a sample of observations with parameter . Hence, the posterior is weakly consistent. The proof is in Section Appendix.2.2 Measurementerror model
We can modify the multigroup model to obtain improved factor estimates on a single dataset by considering observationlevel perturbations. Here we observe ’s for , which are many proxies of true observation with multiplicative measurement errors ’s such that . The modified factor model is
(2.2) 
In this model, the ’s apply a multiplicative perturbation to each data vector. We have where is a matrix. Thus, here the measurement errors are and . This model is different from the multiplicative measurement error model of sarkar2018bayesian. In sarkar2018bayesian, observations ’s are modeled as , where denotes the element wise dot product and ’s are independent of with . Thus, the measurement error in the  component (i.e. ) is dependent on primarily through . However in our construction, it depends on the entire vector . Thus, the measurement errors are a linear function of the entire true observation .
This is a much more general setup than sarkar2018bayesian. The identifiability of distributions of and is difficult. For simplicity, we again assume . In this case we have,
Thus, only the diagonal elements of are not identifiable and the perturbation parameter does not influence the dependence structure among the variables. Hence, with our heteroscedastic latent factors we can still recover the loading structure. To tune , we can use the marginal distributions MVN to develop a cross validation technique when for all as in Section 2.1.2. However, when for all , this is not possible. Thus, we propose to estimate
as in FBPFA by using a weakly informative prior concentrated on small values, while assessing sensitivity to hyperparameter choice.
2.3 The special case for all
For a single study dataset without any measurement error, we can modify our PFA method by taking for all . Then the model reduces to a traditional factor model with heteroscedastic latent factors:
(2.3) 
where is assumed to be diagonal with nonidentical entries. Integrating out the latent factors, the marginal likelihood is . Except for the diagonal matrix , the marginal likelihood is almost similar to the marginal likelihood for a traditional factor model. As has nonidentical diagonal entries, the likelihood is no longer invariant under arbitrary rotations. For the factor model in 1.1, and have equivalent likelihood for any orthonormal matrix . This is not the case in our model unless is a permutation matrix. Thus, this simple modification over the traditional factor model helps to efficiently recover the true loading structure. This is demonstrated in Case 1 of Section 5. We can also show that the posterior is weakly consistent based on the results established in Section 2.1.3.
3 Prior
As in bhattacharya2011sparse, we put the following prior on to allow for automatic selection of rank and easy posterior computation:
The parameters control local shrinkage of the elements in , whereas controls global shrinkage of the  column. For , we have stochastically increasing with , which imposes greater shrinkage for columns with higher column index.
For the heteroscedastic latent factors, each diagonal element of has an independent prior:
for some constant . In our simulations, we see that does not influence much the predictive performance of our method. However, as increases, it puts more shrinkage on the latent factors. We choose for most of our simulations. For the residual error variance , we place a weakly information prior on the diagonal elements:
In our simulations, weakly informative IG(0.1,0.1) prior on works well to get good predictive performance and to estimate the loading structure, even for a single study dataset.
4 Computation
Posterior inference is straightforward with a Gibbs sampler, because all of the parameters have conjugate full conditional distributions. For the model in (2.1), the full conditional of the perturbation matrix is:
where , and . The full conditionals for all other parameters are described in bhattacharya2011sparse, replacing by . For the model in (2.2), the full conditional of is:
where , and . Other parameters can again be updated using the results in bhattacharya2011sparse replacing by . If we put a prior on , the posterior distribution of is given by IG for the model in (2.1) and for the model in (2.2), the posterior distribution of is IG.
5 Simulation Study
In this section, we study the performance of our method in various simulation settings. As ground truth, we consider the two loading matrices given in Figure 3. These are similar to the ones considered in rovckova2016fast. The two loading matrices have equal number of columns which is 5. The first loading matrix has 21 rows and the second one has 128 rows. We compare the estimated loading matrices for different choices of hyperparameters. We investigate the perturbation parameter and the shape parameter , which controls the level of shrinkage on the diagonal entries of .
For the single dataset case, we compare with the approach of bhattacharya2011sparse (B&D), which corresponds to the spacial case of our approach that fixes the perturbation matrices and latent factor covariances equal to the identity. A point estimate of the loading matrix for B&D is calculated by postprocessing the posterior samples. We use the algorithm of assmann2016bayesian to rotationally align the samples of , as used in de2018bayesian. In contrast, our method does not require any postprocessing. We simply take the mean of the posterior samples to estimate the loading matrix. For the multigroup case, we compare our estimates with Bayesian Multistudy Factor Analysis (BMSFA), which also requires postprocessing to remove rotational ambiguity. We use the MSFA package from https://github.com/rdevito/MSFA.
We compare the different methods using predictive likelihood. Simulated data are randomly divided into training and test sets. After fitting each model on the training data, we calculate the predictive likelihood of the test set. All methods are run for iterations of the Gibbs sampler, with burnin samples. Each simulation is replicated times.
5.1 Case 1: Single dataset,
We first consider factor analysis of a single dataset. Starting with the two loading matrices in Figure 3, we simulate latent factors from N, and generate datasets of observations, with residual variance . We compare B&D with our method when , so the only adjustment is to use heteroscedastic latent factors. Note that the simulated latent factors actually have identical variances.
In Table 3 and 4, we compare methods by MSE of the estimated versus true covariance matrix. For loading matrix , Bayesian FA had an MSE of , which is dominated by our method across a range of values of .
d  MSE 

0.1  1.04 
10  1.32 
100  1.38 
For loading matrix , Bayesian FA had a MSE of . Again, our method beats this across a range of values of .
d  MSE 

0.1  5.21 
10  9.45 
100  10.32 
We compare the estimated and true loading matrices in Figure 4. Our method performs overwhelmingly better at estimating the true loading structure compared with B&D FA. The first five columns of the estimated loadings based on our method are very close to the true loading structure under some permutation. Having heteroscedastic latent factors give us such improved performance.
Mex  OH  NH White  NH Black  Other/Multi  

Mex  0.00  0.75  0.70  0.59  0.78 
OH  0.75  0.00  0.63  0.75  0.73 
NH White  0.70  0.63  0.00  0.68  0.84 
NH Black  0.59  0.75  0.68  0.00  0.91 
Other/Multi  0.78  0.73  0.84  0.91  0.00 
6 Discussion
Our overarching goal was to develop a less brittle method for estimating factors that hold across similar datasets and groups within a dataset. In many applications, it is important to define a more generalizable set of factors, improving on the current status in which one often obtains dramatically different factors when fitting a factor model separately to different groups or using current additive ANOVAtype factor models. A primary innovation of the article is to incorporate a multiplicative perturbation of the data. As we have shown, this can be included in many different contexts  ranging from meta analysis, to multiple group data, to measurement error modeling, and even to obtain improved performance when there is no group or replicated structure in the data. We have demonstrated exciting improvements in performance in all of these settings.
It is our hope that the proposed method can be used directly by epidemiologists, and scientists in other fields, to obtain more generalizable factors in their applications. With this goal in mind, we have provided code for implementing the proposed approach at https://github.com/royarkaprava/Perturbedfactormodel. To make the method more broadly useful, there are a number of interesting extensions, some trivial and some less so. The first is to adjust for covariates  this can be very easily done without modifying the fundamental approach with a simple tweak to the proposed Gibbs sampler. Another important modification is to allow data that are not simply all continuous variables but may be categorical or mixed categorical and continuous. This can rely on an underlying variable model, as is often used in factor analysis (see, for example, carvalho2008high; zhou2015bayesian). In this case, we would incorporate the perturbation in the underlying variables instead of in the observed data directly. Again, this leads to a minor modification of the proposed Gibbs sampling algorithm.
We conjecture that both the perturbation ideas and the idea of including heterogeneous factor variances to improve identifiability will have impact beyond the specific setting we have considered in this paper. For example, one natural extension is to nonlinear factor models, such as Gaussian process latent variable models (lawrence2004gaussian; lawrence2006local) and variational autoencoders (kingma2013auto; pu2016variational).
Appendix
Proof of Theorem 5
Proof.
For the space of probability measure , let the KullbackLeibler divergences be given by
Let denote the KullbackLeibler divergence
For our model, we have
To prove Theorem 5, we rely on the following Lemma.
Lemma 6.
For any , there exists , and for such that , and , then we have
Due to continuity of the functions such as determinant, trace and , the above result is immediate following the proof of Theorem 2 in bhattacharya2011sparse
. For our proposed priors, the prior probability of the R.H.S. of Lemma
6 is positive. Thus the prior probability of any KullbackLeibler neighborhood around the truth is positive. This proves Theorem 5. ∎More exploratory analysis on NHANES data
For each chemical level, we fit an oneway ANOVA model to analyse groupspecific effects on each phthalate level separately.
Estimate  Std. Error  t value  Pr()  

(Intercept)  3.85  0.11  36.27  0.00 
NH Black  0.36  0.15  2.33  0.02 
NH White  0.29  0.13  2.26  0.02 
OH  0.46  0.18  2.55  0.01 
Other/Multi  0.09  0.22  0.41  0.68 
Estimate  Std. Error  t value  Pr()  

(Intercept)  2.88  0.06  49.23  0.00 
NH Black  0.59  0.08  6.97  0.00 
NH White  0.35  0.07  4.98  0.00 
OH  0.33  0.10  3.28  0.00 
Other/Multi  0.08  0.12  0.68  0.50 
Estimate  Std. Error  t value  Pr()  

(Intercept)  8.32  0.29  28.80  0.00 
NH Black  2.84  0.42  6.79  0.00 
NH White  1.41  0.35  4.02  0.00 
OH  1.18  0.49  2.40  0.02 
Other/Multi  0.99  0.60  1.64  0.10 
Estimate  Std. Error  t value  Pr()  

(Intercept)  2.79  0.07  39.90  0.00 
NH Black  0.27  0.10  2.66  0.01 
NH White  0.09  0.08  1.05  0.29 
OH  0.01  0.12  0.06  0.95 
Other/Multi  0.23  0.15  1.57  0.12 
Estimate  Std. Error  t value  Pr()  

(Intercept)  4.80  0.12  41.52  0.00 
NH Black  0.44  0.17  2.63  0.01 
NH White  0.64  0.14  4.56  0.00 
OH  0.25  0.20  1.28  0.20 
Other/Multi  0.46  0.24  1.89  0.06 
Estimate  Std. Error  t value  Pr()  

(Intercept)  3.83  0.10  36.85  0.00 
NH Black  0.04  0.15  0.26  0.79 
NH White  0.39  0.13  3.06  0.00 
OH  0.10  0.18  0.56  0.58 
Other/Multi  0.26  0.22  1.19  0.23 
Estimate  Std. Error  t value  Pr()  

(Intercept)  3.11  0.08  39.04  0.00 
NH Black  0.06  0.12  0.54  0.59 
NH White  0.33  0.10  3.39  0.00 
OH  0.11  0.14  0.83  0.41 
Other/Multi  0.26  0.17  1.54  0.12 
Estimate  Std. Error  t value  Pr()  

(Intercept)  1.58  0.04  35.44  0.00 
NH Black  0.03  0.06  0.44  0.66 
NH White  0.25  0.05  4.57  0.00 
OH  0.01  0.08  0.10  0.92 
Other/Multi  0.00  0.09  0.02  0.99 
7 Acknowledgments
This research was partially supported by grant R01ES02749801A1 from the National Institute of Environmental Health Sciences (NIEHS) of the National Institutes of Health (NIH).
Comments
There are no comments yet.