# Generalized probabilistic principal component analysis of correlated data

Principal component analysis (PCA) is a well-established tool in machine learning and data processing. tipping1999probabilistic proposed a probabilistic formulation of PCA (PPCA) by showing that the principal axes in PCA are equivalent to the maximum marginal likelihood estimator of the factor loading matrix in a latent factor model for the observed data, assuming that the latent factors are independently distributed as standard normal distributions. However, the independence assumption may be unrealistic for many scenarios such as modeling multiple time series, spatial processes, and functional data, where the output variables are correlated. In this paper, we introduce the generalized probabilistic principal component analysis (GPPCA) to study the latent factor model of multiple correlated outcomes, where each factor is modeled by a Gaussian process. The proposed method provides a probabilistic solution of the latent factor model with the scalable computation. In particular, we derive the maximum marginal likelihood estimator of the factor loading matrix and the predictive distribution of the output. Based on the explicit expression of the precision matrix in the marginal likelihood, the number of the computational operations is linear to the number of output variables. Moreover, with the use of the Matérn covariance function, the number of the computational operations is also linear to the number of time points for modeling the multiple time series without any approximation to the likelihood function. We discuss the connection of the GPPCA with other approaches such as the PCA and PPCA, and highlight the advantage of GPPCA in terms of the practical relevance, estimation accuracy and computational convenience. Numerical studies confirm the excellent finite-sample performance of the proposed approach.

## Authors

• 9 publications
• 15 publications
• ### Bayesian nonparametric Principal Component Analysis

Principal component analysis (PCA) is very popular to perform dimension ...
09/17/2017 ∙ by Clément Elvira, et al. ∙ 0

• ### XPCA: Extending PCA for a Combination of Discrete and Continuous Variables

Principal component analysis (PCA) is arguably the most popular tool in ...
08/22/2018 ∙ by Clifford Anderson-Bergman, et al. ∙ 6

• ### Multivariate time-series modeling with generative neural networks

Generative moment matching networks (GMMNs) are introduced as dependence...
02/25/2020 ∙ by Marius Hofert, et al. ∙ 9

• ### Optimal Latent Representations: Distilling Mutual Information into Principal Pairs

Principal component analysis (PCA) is generalized from one to two random...
02/09/2019 ∙ by Max Tegmark, et al. ∙ 0

• ### Generalized Matrix Factorization

Unmeasured or latent variables are often the cause of correlations betwe...
10/06/2020 ∙ by Łukasz Kidziński, et al. ∙ 13

• ### Consistent estimation of high-dimensional factor models when the factor number is over-estimated

A high-dimensional r-factor model for an n-dimensional vector time serie...
11/01/2018 ∙ by Matteo Barigozzi, et al. ∙ 0

• ### Divide-and-Conquer: A Distributed Hierarchical Factor Approach to Modeling Large-Scale Time Series Data

This paper proposes a hierarchical approximate-factor approach to analyz...
03/26/2021 ∙ by Zhaoxing Gao, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Principal component analysis (PCA) is one of the oldest and most widely known approaches for dimension reduction. It has been used in many applications, including exploratory data analysis, regression, time series analysis, image processing, and functional data analysis. The most common solution of the PCA is to find a linear projection that transforms the set of original correlated variables onto a projected space of new uncorrelated variables by maximizing the variation of the projected space (Jolliffe, 2011). This solution, despite its wide use in practice, lacks a probabilistic description of the data.

A probabilistic formulation of the PCA was first introduced by Tipping and Bishop (1999), where the authors considered a Gaussian latent factor model, and then obtained the PCA (principal axes) as the solution of a maximum marginal likelihood problem, where the latent factors were marginalized out. This approach, known as the probabilistic principal component analysis (PPCA), assumes that the latent factors are independently distributed following a standard normal distribution. However, the independence assumption of the factors is usually too restrictive for many applications, where the variables of interest are correlated between different inputs, e.g. times series, image studies, and spatial statistics. A few efforts have been made to extend the latent factor model to incorporate the dependent structure of the factors in the literature. For example, the linear model of coregionalization (LMC) is studied in modeling multivariate outputs of spatially correlated data (Gelfand et al., 2004, 2010), where each factor is modeled by a Gaussian process (GP) to account for the spatial correlation in the data. When the factor loading matrix is shared, the LMC becomes a semiparameteric latent factor model, introduced in machine learning literature (Seeger et al., 2005; Alvarez et al., 2012) and is widely applied in emulating computationally expensive computer models with multivariate outputs (Higdon et al., 2008; Fricker et al., 2013)

, where each factor is modeled by a GP over a set of inputs such as the physical parameters and initial condition of the partial differential equations. However, the PCA solution is no longer the maximum marginal likelihood estimator of the factor loading matrix when the factors at two inputs are correlated.

In this work, we propose a new approach called generalized probabilistic principal component analysis (GPPCA), as an extension of the PPCA for the correlated output data. We assume each column of the factor loading matrix is orthonormal for the identifiability purpose. Based on this assumption, we obtain a closed-form solution for the maximum marginal likelihood estimation of the factor loading matrix when the covariance function of the factor processes is shared. This result is an extension of the PPCA for the correlated factors, and the connection between these two approaches is studied. When the covariance functions of the factor processes are different, the maximum marginal likelihood estimation of the factor loading matrix is equivalent to an optimization problem with orthogonal constraints, sometimes referred as the Stiefel manifold. A fast numerical search algorithm for the optimization problem on the Stiefel manifold is introduced by (Wen and Yin, 2013) to find the numerical solution.

There are several approaches for estimating the factor loading matrix for the latent factor model and semiparameteric latent factor model in the Frequentist and Bayesian literature. One of the most popular approaches for estimating the factor loading matrix is PCA (see e.g., Bai and Ng (2002); Bai (2003); Higdon et al. (2008)

). Under the orthonormality assumption for the factor loading vectors, the PCA can be obtained from the maximum likelihood estimator of the factor loading matrix. However, the correlation structure of each factor is not incorporated for the estimation. In

Lam et al. (2011) and Lam and Yao (2012), the authors considered estimating the factor loading matrix based on the sample covariance of the output data at the first several time lags when modeling high-dimensional time series. We will numerically compare our approach to the aforementioned Frequentist approaches.

Bayesian approaches have also been widely studied for factor loading matrix estimation. West (2003) points out the connection between PCA and a class of generalized singular g-priors, and introduces a spike-and-slab prior that induces the sparse factors in the latent factor model assuming the factors are independently distributed. Another prior that induces the sparsity is introduced by Bhattacharya and Dunson (2011) under the independent assumptions of the factors, and its asymptotic behaviors are also discussed. Nakajima and West (2013); Zhou et al. (2014) introduce a method to directly threshold the time-varying factor loading matrix in Bayesian dynamic linear models. When modeling spatially correlated data, priors are also discussed for the spatially varying factor loading matrices (Gelfand et al., 2004) in LMC. The closed-form marginal likelihood obtained in this work is more computationally feasible than the previous results, as the inverse of the covariance matrix is shown to have an explicit form.

Our proposed method is also connected to the popular kernel approach, which has been used for nonlinear component analysis (Schölkopf et al., 1998) by mapping the output data to a high-dimensional feature space through a kernel function. This method, known as the kernel PCA, is widely applied in various problems, such as the image analysis (Mika et al., 1999)

(Hoffmann, 2007). However, the main focus of our method is to apply the kernel function for capturing the correlation of the outputs at different inputs (e.g. the time point, the location of image pixels or the physical parameters in the PDEs).

We highlight a few contributions of this paper. First of all, we derive the closed-form maximum marginal likelihood estimator (MMLE) of the factor loading matrix, when the factors are modeled by GPs. Note our expression of the marginal likelihood (after integrating out the factor processes) is more computationally feasible than the previous result, because the inverse of the covariance matrix is shown to have an explicit form, which makes the computational complexity linear to the number of output variables. Based on this closed-form marginal likelihood, we are able to obtain the MMLE of the other parameters, such as the variance of the noise and kernel parameters, and the predictive distribution of the outcomes. Our second contribution is that we provide a full probabilistic analysis when some covariates are included in the mean structure of the factor model. The previous work on PPCA and LMC

(Tipping and Bishop, 1999; Higdon et al., 2008) typically subtracts the empirical mean of data before estimating the factor loading matrix, which does not fully addresses the uncertainty quantification when the output is linearly dependent on some covariates. Here we model the mean structure and manage to marginalize out the parameters in the mean structure to obtain a closed-form expression of the marginal likelihood for estimating the factor loading matrix without increasing the computational complexity. Our real data application examples demonstrate the improvements in terms of better prediction accuracy when the mean structure is incorporated in the data analysis. Lastly, the proposed GPPCA estimators of the factor loading matrix are closely connected to the PCA and PPCA, and we will discuss how the correlation in the factors affects the estimators of the factor loading matrix and predictive distributions. Both the simulated and real examples show the improved accuracy in estimation and prediction, if the output data are correlated.

The rest of the paper is organized as follows. The main results of the closed-form marginal likelihood and the maximum marginal likelihood estimator of the factor loading matrix are introduced in Section 2.1. In Section 2.2, we provide the maximum marginal likelihood estimator for the noise parameter and kernel parameters, after marginalizing out the factor processes. Section 2.3 discusses the estimators of the factor loading matrix and other parameters when some covariates are included in the model. The comparison between our approach and other approaches in estimating the factor loading matrix is studied in Section 3, with a focus on the connection between GPPCA and PPCA. Simulation results are provided in Section 4, for both the correctly specified and mis-specified models with unknown noise and covariance parameters. Two real data examples are shown in Section 5 and we conclude this work with discussion on several potential extensions in Section 6.

## 2 Main results

We state our main results in this section. In section 2.1, we derive a computationally feasible expression of the marginal distribution for the latent factor model after marginalizing out the factor processes, based on which we show the maximum marginal likelihood estimator of the factor loading matrix. In Section 2.2, we discuss the parameter estimation and predictive distribution. We extend our method to study the factor model by allowing the intercept and additional covariates in the mean structure in Section 2.3.

### 2.1 Generalized probabilistic principal component analysis

To begin with, let be a -dimensional real-valued output vector at a -dimensional input vector . Let be a matrix of the observations at inputs . In this subsection and the next subsection, we assume that each row of the is centered at zero.

Consider the following latent factor model

 y(x)=Az(x)+ϵ, (1)

where is a vector of the independent normally distributed noises, with being the identity matrix. The factor loading matrix relates the -dimensional outputs to a -dimensional factor processes , where .

In many applications, each output is correlated. For example, model (1) is widely used in analyzing multiple time series, where ’s are correlated across different time points for every . Model (1) is also used for analyzing multivariate outputs from spatially correlated data, often referred as the linear model of coregionalization (LMC) (Gelfand et al., 2010). In these studies, each factor is modeled by a zero-mean Gaussian process (GP), meaning that for any set of inputs , follows a multivariate normal distribution

 ZTl∼MN(0,Σl), (2)

where the entry of is parameterized by a covariance function for and . We defer the discussion of the kernel in the Section 2.2.

Denote the vectorization of the output and the latent factor matrix at inputs . After marginalizing out , follows a multivariate normal distribution as follows,

 Yv∣A,σ20,Σ1,...,Σd∼MN(0,d∑l=1Σl⊗(alaTl)+σ20Ink). (3)

The form in (3) appeared in the previous literature (e.g. Gelfand et al. (2004)) and its derivation is given in Appendix B. However, directly computing the marginal likelihood by expression (3) may be expensive, as the covariance matrix is . A computationally more feasible form of the marginal likelihood will be introduced in Lemma 5.

Note that the model (1) is unchanged if one replaces the pair by

for any invertible matrix

. As pointed out in Lam et al. (2011), only the -dimensional linear subspace of , denoted as , can be uniquely identified, since for any invertible matrix . Due to this reason, we assume the columns of in model (1) are orthonormal for identifiablity purpose (Lam et al., 2011; Lam and Yao, 2012).

###### Assumption 1
 ATA=Id. (4)

Note the Assumption 1 can be relaxed by assuming where is a positive constant which can potentially depend on , e.g. . As each factor process has the variance , typically estimated from the data, we thus derive the results based on Assumption 1 herein. We first give the marginal distribution of with the explicit form for the inverse of the covariance matrix in Lemma 5 .

Under Assumption 1, the marginal distribution of in model (1) is the multivariate normal distribution as follows

 Yv∣A,σ20,Σ1,...,Σd∼MN⎛⎝0,σ20(Ink−d∑l=1(σ20Σ−1l+In)−1⊗(alaTl))−1⎞⎠, (5)

for .

Our new expression (5) of the marginal likelihood is computationally more feasible than the expression (3), as the inverse of the covariance matrix of is derived explicitly in (5). Based on the marginal likelihood in (5), we derive the maximum marginal estimation of where the covariance matrix for each latent factor is assumed to be the same as in Theorem 7 below.

For model (1), assume . Under Assumption 1, after marginalizing out , the likelihood function is maximized when

 ^A=UR, (6)

where is a matrix of the first

principal eigenvectors of

 G=Y(σ20Σ−1+In)−1YT, (7)

and is an arbitrary orthogonal rotation matrix.

By Theorem 7, the solution is not unique because of the arbitrary rotation matrix. However, the linear subspace of the column space of the estimated factor loading matrix, denoted by , is uniquely determined by (6).

In general, the covariance function of each factor can be different. We are able to express the maximum marginal likelihood estimator as the solution to an optimization problem with the orthogonal constraints, stated in Theorem 2.1.

Under Assumption 1, after marginalizing out , the maximum marginal likelihood estimator of in model (1) is

where

 Gl=Y(σ20Σ−1l+In)−1YT. (9)

The subset of matrices that satisfies the orthogonal constraint is often referred as the Stiefel manifold. Unlike the case where the covariance of each factor processes is shared, no closed-form solution of the optimization problem in (2.1) has been found. A numerical optimization algorithm that preserves the orthogonal constraints in (8) is introduced in Wen and Yin (2013). The main idea of their algorithm is to find the gradient of the objective function in the tangent space at the current step, and iterates by a curve along the projected negative descent on the manifold. The curvilinear search is applied to find the appropriate step size that guarantees the convergence to a stationary point. We implement this approach to numerically optimize the marginal likelihood to obtain the estimated factor loading matrix in Theorem 2.1.

We call the method of estimating in Theorem 7 and Theorem 2.1 the generalized probabilistic principal component analysis (GPPCA) of correlated data, which is a direct extension of the PPCA in Tipping and Bishop (1999). Although both approaches obtain the maximum marginal likelihood estimator of the factor loading matrix, after integrating out the latent factors, the key difference is that in GPPCA, the latent factors at different inputs are allowed to be correlated, whereas the latent factors in PPCA are assumed to be independent. A detailed numerical comparison between our method and other approaches including the PPCA will be given in Section 3.

Another nice feature of the proposed GPPCA method is that the estimation of the factor loading matrix can be applied to any covariance structure of the factor processes. In this paper, we use kernels to parameterize the covariance matrix as an illustrative example. There are many other ways to specify the covariance matrix or the inverse of the covariance matrix, such as the Markov random field and the dynamic linear model, and these approaches are readily applicable in our latent factor model (1).

### 2.2 Parameter estimation and predictive distribution

The probabilistic estimation of the factor loading matrix depends on the variance of the noise and the covariances of the factor processes. We discuss the estimation of these parameters by assuming that the covariances of the factors are parameterized by a product of the kernel functions for demonstration purposes. We also obtain the predictive distribution of the data in this subsection. The probabilistic estimation of the factor loading matrix in the GPPCA can be also applied when the covariances of the factors are specified or estimated in other ways.

For a function with a -dimensional input, a product kernel is often used to model the covariance function (Sacks et al., 1989), meaning that for the th factor,

 σ2lKl(xa,xb)=σ2lp∏m=1Klm(xam,xbm), (10)

for any input and , where is a one-dimensional kernel function of the th factor that models the correlation of the th coordinate of any two inputs.

Some widely used one-dimensional kernel functions include the power exponential kernel and the Matérn kernel. For any two inputs , the Matérn kernel is

 Klm(xam,xbm)=12νlm−1Γ(νlm)(|xam−xbm|γlm)νlmKνlm(|xam−xbm|γlm), (11)

where is the gamma function and is the modified Bessel function of the second kind with a positive roughness parameter and a nonnegative range parameter for and . The Matérn kernel contains a wide range of different kernel functions. In particular, when , the Matérn kernel becomes the exponential kernel, , and the corresponding factor process is the Ornstein-Uhlenbeck process, which is a continuous autoregressive process with order 1. When , the Matérn kernel becomes the Gaussian kernel, i.e., , where the factor process is infinitely differentiable. The Matérn kernel has a closed-form expression when . For example, the Matérn kernel with has the following form

 Klm(xam,xbm)=(1+√5|xam−xbm|γlm+5|xam−xbm|23γ2lm)exp(−√5|xam−xbm|γlm), (12)

for any inputs and with and . In this work, we use the Matérn kernel in (12) for the simulation and real data analysis for demonstration purposes. Specifying a sensible kernel function depends on real applications and our results in this work apply to all commonly used kernel functions. We will also numerically compare different approaches when the kernel function is misspecified in Section 4.2.

We denote as the signal’s variance to noise ratio (SNR) for the th factor process, as a transformation of in (10). Furthermore, let the correlation matrix of the th factor process be with the th term being . After this transformation, the estimator of in Theorems 7 and 2.1 becomes a function of the parameters and . Under Assumption 1, after marginalizing out , the maximum likelihood estimator of becomes a function of , and as

 ^σ20 =^S2nk, (13)

where . Ignoring the constants, the likelihood of and by plugging and satisfies

 L(τ,γ∣Y,^A,^σ20)∝{d∏l=1|τlKl+In|−1/2}|^S2|−nk/2. (14)

A derivation of Equation (14) is given in the Appendix. Since there is no closed-form expression for the parameter estimates in the kernels, one often numerically maximizes the Equation (14) to estimate these parameters

 (^τ,^γ):=argmax(τ,γ)L(τ,γ∣Y,^A,^σ20). (15)

After obtaining and from (13) and (15), respectively, we transform the expressions back to get the estimator of as

 ^σ2l=^τl^σ20,

for . Since both the estimator of and in Theorem 7 and 2.1 can be expressed as a function of , in each iteration, one can use the Newton’s method (Nocedal, 1980) to find based on the likelihood in (14), after plugging the estimator of and .

We have a few remarks regarding the expressions in (13) and (15). First, under Assumption 1, the likelihood of in (14) can also be obtained by marginalizing out using the objective prior , instead of maximizing over .

Second, consider the first term at the right hand side of (13). As each row of has a zero mean, let be the sample covariance matrix for . One has , where is the

th eigenvalue of

. The second term at the right hand side of (13) is the variance explained by the projection. In particular, when the conditions in Theorem 7 hold, i.e. , one has , where is the th largest eigenvalues of . The estimation of the noise is then the average variance being lost in the projection. Note that the projection in the GPPCA takes into account the correlation of the factor processes, whereas the projection in the PPCA assumes the independent factors. This difference makes the GPPCA more accurate in estimating the subspace of the factor loading matrix when the factors are correlated, as shown in various numerical examples in Section 4.

Thirdly, although the model in (1) is regarded as a nonseparable model (Fricker et al., 2013), the computational complexity of our algorithm is the same with that for the separable model (Gu and Berger, 2016; Conti and O’Hagan, 2010). Instead of inverting an covariance matrix, the expression of the likelihood in (14) allows us to proceed in the same way when the covariance matrix for each factor has a size of . The number of computational operations of the likelihood is at most , which is much smaller than the for inverting an covariance matrix, because one often has . When the input is one-dimensional and the Matérn kernel in (11) is used, the computational operations are only for computing the likelihood in (14) without any approximation (see e.g. Hartikainen and Sarkka (2010)). There is an R package in CRAN that implements the fast algorithm (Gu, 2019).

Note that the estimator in (14) is known as the Type II maximum likelihood estimator, which is widely used in estimating the kernel parameters. When the number of the observations is small, the estimator in (14) is not robust, in the sense that the estimated range parameters can be very small or very large, which makes the covariance matrix either a diagonal matrix or a singular matrix. This might be unsatisfactory in certain applications, such as emulating computationally expensive computer models (Oakley, 1999). An alternative way is to use the maximum marginal posterior estimation that prevents the two unsatisfying scenarios of the estimated covariance matrix. We refer to Gu et al. (2018b) and Gu (2018) for the theoretical properties of the maximum marginal posterior estimation and an R package is available on CRAN (Gu et al. (2018a)).

Given the parameter estimates, we can also obtain the predictive distribution for the outputs. Let be the th kernel function after plugging the estimates and let be the estimator of the covariance matrix for the th factor, where the element of is , with and . We have the following predictive distribution for the output at any given input.

Under the Assumption 1, for any , one has

 Y(x∗)∣Y,^A,^γ,^σ2,^σ20∼MN(^μ∗(x∗),^Σ∗(x∗)),

where

 ^μ∗(x∗)=^A^z(x∗), (16)

with , with , for , and

 (17)

with being a diagonal matrix, and its th diagonal term, denoted as , has the following expression

 ^Dl(x∗)=(^σ2l^Kl(x∗,x∗)+^σ20−^ΣTl(x∗)(^Σl+^σ20In)−1^Σl(x∗)),

for .

Next we give the posterior distribution of in Corollary 2.2.

[Posterior distribution of ] Under the Assumption (1), the posterior distribution of is

 (AZ∣Y,^A,^γ,^σ2,^σ20)∼MN(^A^Z,^σ20d∑l=1^Dl⊗^al^aTl),

where , , and , for .

The Corollary 2.2 is a direct consequence of Theorem 2.2, so the proof is omitted. Note that the uncertainty of the parameters and the factor loading matrix are not taken into consideration for predictive distribution of in Theorem 2.2 and the posterior distribution of in Corollary 2.2, because of the use of the plug-in estimator for

. The resulting posterior credible interval may be narrower than it should be when the sample size is small to moderate. The uncertainty in

and other model parameters can be obtained by a full Bayesian analysis with a prior placed on these parameters. This is a possible future working direction based on the computationally feasible marginal likelihood approach we proposed in this paper.

### 2.3 Mean structure

In many applications, the outputs are not centered at zero. For instance, Bayarri et al. (2009) and Gu and Berger (2016) studied emulating the height of the pyroclastic flow generated from TITAN2D computer model, where the flow volume in the chamber is positively correlated to height of the flow at each spatial coordinate. Thus, modeling the flow volume as a covariate in the mean function typically improves the accuracy of the emulator. When is not centered around zero, one typically normalizes each row of when modeling each factor process by a GP in computer emulation and calibration (Higdon et al., 2008; Paulo et al., 2012)

. The full Bayesian inference of the regression parameters are discussed in coregionalization models of multivariate spatially correlated data (see e.g.

Gelfand et al. (2004)

) using the Markov Chain Monte Carlo (MCMC) algorithm, but the computation may be too complex to implement in many studies.

Here we derive the closed-form expression of the marginal likelihood and predictive distribution of the GPPCA, by marginalizing out the regression parameters using the objective prior. Consider the latent factor model with a mean structure for a -dimensional output vector at the input ,

 y(x)=(h(x)B)T+Az(x)+ϵ, (18)

where is known mean basis function related to input and possibly other covariates, is a matrix of the mean (or trend) parameters. The mean parameters could be different for each row of the outcomes, and is a vector of the independent normally distributed noises, with being the identity matrix.

For any set of inputs , we assume follows a multivariate normal distribution

 ZTl∼MN(0,Σl), (19)

where the entry of is parameterized by a kernel function for and .

Denote . We have the following lemma for the marginal likelihood estimator of the variance. Consider an objective prior . Under Assumption 1, after marginalizing out and , the maximum likelihood estimator for is

 ^σ20=S2Mk(n−q),

where . Moreover, the marginal density of the data satisfies

 p(Y∣A,τ,γ,^σ20) ∝{d∏l=1|τlKl+In|−1/2∣∣HT(τlKl+In)−1H∣∣−12}∣∣S2M∣∣−(k(n−q)2). (20)

Under Assumption 1, the likelihood for in (20) can also be obtained by marginalizing out both and using the objective prior , instead of maximizing over .

Since there is no closed-form expression for the parameters in the kernels, one can numerically maximize the Equation (20) to estimate and other parameters.

 ^A =argmaxAd∑l=1aTlGl,Mal,s.t.ATA=Id, (21) (^τ,^γ) =argmax(τ,γ)p(Y∣^A,τ,γ). (22)

When , the closed-form expression of can be obtained similarly in Theorem 7. In general, we can use the approach in Wen and Yin (2013) for solving the optimization problem in (22). After obtaining and , we transform them to get for .

Let be a matrix with the -term as , where is the kernel function after plugging the estimator for . We have the following predictive distribution for the outputs at any given input.

Under the Assumption 1, after marginalizing out and by the objective prior , the predictive distribution of model (18) for any is

 Y(x∗)∣Y,^A,^γ,^σ2,^σ20∼MN(^μ∗M(x∗),^Σ∗M(x∗)).

Here

 (23)

where , , with , and with , for . Moreover,

 (24)

where is a diagonal matrix with the th diagonal term, denoted by , having the following expression

 ^Dl,M(x∗) =^σ2l^Kl(x∗,x∗)+^σ20−^ΣTl(x∗)~Σ−1l^Σl(x∗) +(hT(x∗)−HT~Σ−1l^Σl(x∗))T(HT~Σ−1lH)−1(hT(x∗)−HT~Σ−1l^Σl(x∗)),

with for .

In Theorem 2.3, the estimated mean parameters are , which could be used for inferring the trend of some given covariates (e.g. the gridded temperature example in Section 5.2).

Denote where and are two vectors of dimensions and (), respectively. Assuming the same conditions in Theorem 2.3 hold, if one observes both and , the predictive distribution of follows

 Y2(x∗)∣Y1(x∗),Y,^A,^γ,^σ2,^σ20∼MN(^μ∗M,2|1(x∗),^Σ∗M,2|1(x∗)). (25)

where with and being the first and last entries of ; with , and being the first , last entries in the diagonals and entries in the off-diagonals of , respectively.

## 3 Comparison to other approaches

In this section, we compare our method to various other frequently used approaches and discuss their connections and differences using examples. First of all, note that the MLE of the factor loading matrix under the Assumption 1 is (without marginalizing out ), where is the first ordered eigenvectors of and is an arbitrary orthogonal rotation matrix. This corresponds to the solution of principal component analysis, which is widely used in the literature for the inference of the latent factor model. For example, Bai and Ng (2002) and Bai (2003) assume that and estimate by in modeling high-dimensional time series. The estimation of factor loading matrix by the PCA is also applied in emulating multivariate outputs from a computer model (Higdon et al., 2008)

, where the factor loading matrix is estimated by the singular value decomposition of the standardized output matrix.

The principal axes of the PCA are the same with those obtained from the PPCA, in which the factor loading matrix is estimated by the maximum marginal likelihood, after marginalizing out the independent and normally distributed factors (Tipping and Bishop, 1999). The estimator of the factor loadings is found to be the first columns of , where is a diagonal matrix whose th diagonal term is the th largest eigenvalues of and is an arbitrary orthogonal rotation matrix.

The PPCA gives a full probabilistic model of the PCA by modeling via independent normal distributions. However, when outputs are correlated across different inputs, modeling the factor processes as independent normal distributions along with the latent factor model in (1) may not be a good sampling model in some applications. In comparison, in GPPCA the factors are allowed to be correlated; and we marginalize the factors out to estimate to take account for the uncertainty. This is why our approach can be viewed as a generalized approach of the PPCA for the correlated data.

The second observation is that the estimation of the factor loading matrix in the PCA or PPCA typically assumes the data are standardized. However, the standardization process could cause a loss of information and the uncertainty in the standardization is typically not considered. This problem is also resolved by GPPCA, where the intercept and other covariates can be included in the model and the mean parameters can be marginalized out in estimating the factor loading matrix, as discussed in Section 2.3.

Next we illustrate the difference between the GPPCA and PCA using Example 3.

The data is sampled from the model (1) with the shared covariance matrix , where is equally spaced from to and the kernel function is assumed to follow (12) with and . We choose , and . Two scenarios are implemented with and , respectively. The parameters are assumed to be unknown and estimated from the data.

Note the linear subspace spanned from the column space of estimated loading matrix by the PCA or PPCA is the same, which is , even though the form of these estimators are slightly different. Thus we only compare the GPPCA to the PCA in Figure 1 where

is a two-dimensional vector generated from a uniform distribution on the Stiefel manifold

(Hoff, 2018). The signal to noise ratio (SNR) is and for the upper and lower panels in Figure 1, respectively.

From Figure 1, we observe that when the SNR is large, two rows of the outputs are strongly correlated, as shown in the upper left panel, with the empirical correlation being around between two rows of the output . The estimated subspaces by the PCA and GPPCA both match the true equally well in this scenario, shown in the upper right panel. When the variance of the noise gets large, the outputs are no longer very correlated. For example, the empirical correlation between two simulated output variables is only around . As a result, the angle between the estimated subspace and the column space of by the PCA is large, as shown in the right lower panel.

The GPPCA by Theorem 7 essentially transforms the output by , graphed in the middle panels, where with and being a matrix of eigenvectors and a diagonal matrix of the eigenvalues from the eigendecomposition of , respectively, where variance parameter and kernel parameter are estimated by the MMLE discussed in Section 2.2. The two rows of the transformed outputs are strongly correlated, shown in the middle panels. The empirical correlation between two rows of the transformed outputs graphed in the lower panel is about , even though the variance of the noise is as large as the variance of the signal. By Theorem 7, the subspace by the GPPCA is equivalent to the first eigenvector of the transformed output for this example, and it is graphed as the blue dashed curves in the right panels. The estimated subspace by the GPPCA is very close to the truth in both scenarios, even when the variance of the noise is large in the second scenario.

For PCA, the mean of the outputs is typically estimated by the maximum likelihood estimator , where (Bai and Ng, 2002). In Figure 2, the PCA estimation of the mean for Example 3 is graphed as the red curves and the posterior mean of the output in the GPPCA in Corollary 2.2 is graphed as the blue curves. The PCA underestimates the variance of the noise and hence has a large estimation error. In comparison, the estimated mean of the output by the GPPCA is more accurate, as the correlation in each output variable is properly modeled through the GPs of the latent factors.

Note that we restrict to satisfy when simulating data examples in Figure 1. In practice, we find this constraint only affects the estimation of the variance parameter in the kernel, , because the meaning of this parameter changes.

There are some other estimators of the factor loading matrix in modeling high-dimensional time series. For example, Lam et al. (2011); Lam and Yao (2012) estimate the factor loading matrix of model (1) by , where is the sample covariance at lag of the output and is fixed to be a small positive integer. This approach is sensible, because is shown to be spanned from under some reasonable assumptions, where is the underlying lag- covariance of the outputs. It is also suggested in Lam and Yao (2012) to estimate the latent factor by , meaning that the mean of the output is estimated by . This estimator and the PCA are both included for comparison in Section 4..

## 4 Simulated examples

In this section, we numerically compare different approaches studied before. We use several criteria to examine the estimation. The first criterion is the largest principal angle between the estimated subspace and the true subspace . Let be the principal angles between and , recursively defined by

subject to

 ||a||=||^a||=1,aTai=0,^aT^ai=0,i=1,...,d−1,

where denotes the norm. The largest principal angle is , which quantifies how close two linear subspaces are. When two subspaces are identical, all principal angles are zero. When the columns of the and form orthogonal bases of the and , the cosine of the largest principal angle is equal to the smallest singular value of (Björck and Golub, 1973; Absil et al., 2006; Reynkens, 2018). Thus the largest principal angle can be calculated efficiently through the singular value decomposition of .

We numerically compare four approaches for estimating . The first approach is the PCA, which estimates by , where is the first eigenvectors of . Note the other version of the PCA and the PPCA both have the same largest principal angle between the estimated subspace of and the true subspace of , so the results are omitted. The GPPCA is the second approach. When the covariance of the factor processes is the same, the closed-form expression of the estimator of the factor loading matrix is given in Theorem 7. When the covariance of the factor processes is different, we implement the optimization algorithm in Wen and Yin (2013) that preserves the orthogonal constraints to obtain the maximum marginal likelihood estimation of the factor loading matrix in Theorem 2.1. In both cases, the estimator