Gaussian process (GP) is a non-parametric supervised learning technique that estimates a posterior distribution of predictors from the dataset. In this study, we consider meta-learning for GP. Meta-learning is a framework used to improve the precision of new tasks by estimating a structure of a set of tasks [25, 11, 12]. Most conventional meta-learning methods for GP estimate a prior distribution for new tasks [17, 5, 13, 9]. We propose an extension of principal component analysis for a set of Gaussian process posteriors (GP-PCA) to estimate a low-dimensional subspace on a space of GP posteriors. Since GP-PCA estimates a subspace on a space of GPs, the method can generate GP posteriors for new tasks. Therefore, we can estimate the GP posterior for the new tasks accurately from a small size of the dataset.
To this end, we have to consider a space of GPs with an infinite-dimensional parameter. A structure of a probability space is nontrivial since Euclidean space is inappropriate as a structure of the space. For a finite parametric probability distribution, we can define a structure of its space using information geometry. However, even if we use the information geometry, it is not easy to define a space of GPs.
To overcome this problem, we consider defining the space of GP posteriors under the assumption that GP posteriors have the same prior. Then, we can show that the set of GP posteriors lies on a finite-dimensional subspace in an infinite-dimensional space of GP. By using this fact, we can reduce the task of GP-PCA to a task of estimating a subspace on finite-dimensional space. Additionally, we developed a fast approximation method for GP-PCA using a sparse GP based on variational inference.
The remainder of the paper is organized as follows. In Section 2, we explain the information geometry, principal component analysis for exponential families and GP regression. In Section 3, after defining a set of GP posteriors in terms of information geometry, we propose the GP-PCA and show that the task can be reduced to a finite-dimensional case. In Section 4, we present the related works. In Section 5, we demonstrate the effectiveness of the proposed method. Finally, Section 6 presents the conclusion.
In this section, we explain the information geometry of the exponential family, dimensionality reduction technique on the exponential family, and GP.
2.1 Information geometry of the exponential family
The exponential family is a distribution parameterized by as follows.
In information geometry, a set of is regarded as a Riemannian manifold denoted by . Then, a metric of is defined by Fisher information
and a connection is defined by -connection, where is a parameter of -connection. When , can be regarded as a flat manifold, i.e., curvature and torsion of are zero. When , is a flat manifold defined in a coordinate system by , which is called e-coordinate system. On the other hand, when , is a flat manifold defined in a coordinate system by , which is called m-coordinate system.
There is a bijection between and , and the bijection can be described as Legendre transform. The following equation with respect to and holds.
where and are potential functions of and , respectively. From the equation, and can be mutually transformed by Legendre transformation as follows.
From this fact, and are in a nonlinear relationship in general. Therefore, e-flat manifold does not always become m-flat manifold and vice versa. If a manifold becomes e-flat and m-flat simultaneously, the manifold is called a dually flat manifold. Since holds e-flat and m-flat, is a dually flat manifold.
In a dually flat manifold, we can consider two kinds of linear subspaces: e-flat and m-flat subspaces. Let and be an e-coordinate and m-coordinate of . While an e-flat subspace is defined as a linear combination of , an m-flat subspace is defined as a linear combination of . Let and be e-flat and m-flat subspaces, respectively. Then, and are described as follows.
where . When , and are called e-geodesic and m-geodesic, respectively.
By using and , we define a Kullback-Leibler (KL) divergence between the two points as follows.
We denote the KL divergence by using e-coordinates or m-coordinates depending on the situation, i.e., and . The following theorems show an interesting duality of e-coordinate and m-coordinate.
Theorem 1 (Pythagorean theorem ).
Let , and be points on . If an e-geodesic between and and an m-geodesic between and are orthogonal, i.e., holds. Then, the following relationship holds.
When an m-geodesic between and are orthogonal, is called m-projection from to . Similarly, when an e-geodesic between and are orthogonal, is called e-projection from to . From the Pythagorean theorem, the following theorem holds.
Theorem 2 (Projection theorem).
An m-projection from to uniquely exists and it minimizes . Similarly, an e-projection from to uniquely exists and it minimizes .
2.2 PCA for exponential families
Let be a set of exponential families. Since there are two types of subspaces on : e-flat and m-flat subspaces, we can consider two PCAs for a dataset . One is e-PCA, which estimates an e-flat affine subspace . The other is m-PCA, which estimates an m-flat affine subspace . Although we only explain e-PCA, the same argument holds for m-PCA.
Assume that can be described by basis and offset , where
. It means that using a weight vectorand basis , any point on can be represented as
When is obtained, the task of e-PCA is to estimate and minimizing the following objective function.
Because and minimizing cannot be obtained analytically in general, e-PCA alternatively estimates and using a gradient method. Let and be m-coordinates of and , respectively. We denote matrices of and by and , respectively. The gradients of Eq. (4) with respect to and are given by the following equations.
For multivariate normal distributions, each probabilistic distribution can be parameterized a mean vectorand covariance matrix . Letting and , the e-coordinate can be described as follows:
where , . On the other hand, the m-coordinate can be described as follows:
where , . Then, the transform between and can be described as follows:
2.3 Gaussian process (GP)
First, we present the definition of notations. An output vector of function corresponding to input set is denoted by or . When an input set is denoted with a subscript, such as , the corresponding output vector is also denoted with the subscript such as . Similarly, while a vector of kernel between and is denoted by , a gram matrix between and is denoted by . The treatment of the subscript is the same as a function.
GP is a stochastic process with respect to a function . It is parameterized by the mean function and covariance function . The GP has a marginalization property. It means that a vector corresponding to an arbitrary input set can consistently follow a multivariate normal distribution , where and . Therefore, GP can be regarded as an infinite-dimensional multivariate normal distribution intuitively.
2.1 Gaussian process regression (GPR)
Let and be an input vector and output, respectively. We assume that the relationship between and is denoted as , where is a noise. The task of regression is to estimate a function from an input set and corresponding output vector
. We assume that a likelihood function is a Gaussian distribution with mean
and variance, and the prior distribution is a GP with a mean function and covariance function . For any , is obtained as
Since the posterior distribution is a conditional distribution of given , the mean and covariance function for a new input of the posterior distribution can be obtained by closed form as , where
We can also interpret that the posterior is obtained using Bayes’ theorem. Whenand are observed, the posterior for is derived as follows:
By using , the predictive distribution for new input data is described as follows:
where is a conditional prior. Letting , and are obtained as follows:
Furthermore, the predictive distribution is derived as
Since is a prior distribution, is determined uniquely when is given. By using this property, we define a space of GP posteriors and propose GP-PCA.
3 PCA for Gaussian processes (GP-PCA)
Similar to e-PCA and m-PCA, we consider two types of GP-PCA: GP-ePCA and GP-mPCA. In this study, we only explain the GP-ePCA, but the same argument holds for GP-mPCA. Let be a GP posterior obtained . When a set of posteriors is given, the task of GP-ePCA is to estimate an e-flat subspace minimizing KL divergence between GP posteriors and their corresponding points on the subspace. However, it is nontrivial to define a structure of GPs since GP has an infinite-dimensional parameter.
This study shows that a set of GP posteriors is a finite-dimensional dually flat space under the assumption that each posterior has the same prior and reduces the task of GP-ePCA to a task of estimating a subspace on finite space. To explain our approach, we introduce the two probabilistic spaces shown in Fig 1. One is a space consisting of Gaussian distributions for an output vector corresponding to a training set . The other is a space consisting of Gaussian distributions for an output vector corresponding to a set which is a union of the training input set and an arbitrary test input set . The former is denoted by and the latter is denoted by . Both and are dually flat spaces since they are a set of Gaussian distributions. Note that can be regarded as an infinite-dimensional space since the cardinal of can be any number. In our approach, we define a space consisting of GP posteriors as a subspace on denoted by . Then, we estimate an e-flat subspace on and transform to using an affine map instead of estimating an e-flat subspace on . Since there is no guarantee that is equivalent to , this study proves this.
In this Section, after defining and GP-ePCA, we prove that and are equivalent. Next, we describe the standard algorithm and its sparse approximation algorithm.
3.1 Definition of the structure of GP posteriors and GP-ePCA
Let and be a union set of and test set. We consider estimating given in each task. Then, -th task’s predictive distribution for is derived as . Suppose that GP posteriors have a common prior, is determined uniquely given . From this fact, the affine subspace spanned by GP posteriors is defined as follows:
Let , and be an input set, test set, and a union set of input and test sets, respectively. We denote the size of by . Let be a Gaussian distribution with , where is a pair of -dimensional vector and positive-definite symmetric matrix . Then, a probability space consisting of GP posteriors corresponding to with a common prior is defined by the following equation:
where is a conditional distribution of the prior and is any Gaussian distribution with a parameter . In particular, when holds, i.e., is an empty set, and are denoted by and , respectively.
Satisfying the assumption, is contained in . Let be a parameter of . can be described as follows.
where , , , and . Therefore, we can define a space of GP posteriors as .
Since is a dually flat space, can be represented by e-coordinate and m-coordinate denoted by and , respectively. We denote e-coordinate and m-coordinate for a point on parameterized by as and , respectively. From the definition of , when , holds since and hold. It means that is also a dually flat space. Therefore, we denote e-coordinate and m-coordinate of by and , respectively.
By using the definition of , we define GP-ePCA in the respective spaces of and .
Let be a set of GPs on the . Then, the objective function of GP-ePCA on is defined as follows:
GP-ePCA estimating e-flat submanifold minimizing Eq. (14) is called GP-ePCA. Here, is e-coordinate of denoted by a linear combination of with weight , where .
Similarly, when is observed, we call the ePCA minimizing the following equation GP-ePCA.
Here, and are e-coordinate of and , which is a linear combination of with weight , where .
In this study, we guarantee that GP-ePCA() is equivalent to GP-ePCA() by the following theorem.
3.2 Proof of Theorem 3
The proof of the Theorem 3 is composed of the proof of the following three statements.
For , there is satisfying .
For , , holds.
For a subspace minimizing , holds.
From (S1) and (S2), denoting a subspace minimizing Eq. (15) by , we can prove that also minimizes Eq. (14) in a set of subspaces on . However, since a subspace minimizing Eq. (14) does not always lie on , we confirm this by (S3).
To prove the statements, we present the following Lemmas.
Let be a parameter of . Then, there is an affine map satisfying the following equation.
Let and are two arbitrary parameters, and let us take two points and in , and and in . Then, the following equation holds:
Suppose be a dually flat manifold and be a -dimensional submanifold. If is a dually flat and a set of points , the -dimensional e-flat submanifold minimizing Eq. (14) for is included in when .
Let be a parameter of . Then, there is a linear mapping satisfying the following equation.
The proofs of (S1) and (S2) are obvious from Lemma 1 and Lemma 2. From Lemma 3, (S3) can be proved by showing that is a dually flat for arbitrary test set . When , i.e., the test set is empty, then is a dually flat since . When , by the linear relation proved in Lemma 1 and Lemma 4, the Lemma also holds in the general case. Thus, Theorem 3 is proved.
3.3 Algorithm of GP-ePCA
From the above discussion, GP-ePCA can be reduced to GP-ePCA. In this Section, we explain a concrete algorithm of GP-ePCA.
Let and be a training input and corresponding output dataset of -th task, where is the size of . We denote a union set of the input sets by , i.e., and define the probability space of GP posteriors as Eq. (11). Then, we denote the GP posterior given by . From Theorem 3, the task of GP-ePCA is to estimate a subspace for and transform to .
In training phase, GP-ePCA calculates the and transforms the m-coordinates . The is calculated using Eqs. (12) and (13) and from is transformed from by and . Next, GP-ePCA estimates the subspace using e-PCA. That is, estimating and minimizing Eq. (15) through gradient descent iterations. Algorithm 1 shows the summary of the algorithm.
In the prediction phase, GP-ePCA predict outputs corresponding to a test data using the following equations.
Since this algorithm requires calculating the inverse matrix, the calculation cost of the algorithm becomes , where . Since this algorithm is impractical, we derive a faster approximation below.
3.2 Sparse GP-ePCA
Most sparse approximation methods for GP reduce a calculation cost by approximating the gram matrix for input set using inducing points . Let and be a set of inducing points and gram matrix between inputs. The gram matrix is approximated as
where , . By using this approximation, we consider a set of GPs for instead of a set of GPs for . Denoting the set of GPs for by , the sparse GP-ePCA estimates a subspace on and transforms the subspace to . Then, we reduce the calculation cost of GP-ePCA from to , where is the size of inducing points.
We adopt a sparse GP based on variational inference proposed by Titsias . The variational inference-based sparse GP minimizes the KL-divergence between a true posterior and variational distribution , that is,
Then, the variational distribution minimizing the equation is derived as follows:
where . The predictive distribution for new input is as follows:
We regard and as a parameter of Eq. (11). That is, denoting a parameter of -th task’s variational distribution by , the sparse GP-ePCA estimates a subspace minimizing Eq. (15) for and transforms the subspace to by the affine map .
In practice, to stabilize the sparse GP-ePCA, we re-parametrize as follows:
We denote a space of by . Letting , , and , the following relationships between and hold.
Furthermore, using the equations, we can show the equivalence between the KL-divergence of and that of . That is, for any and , the following equation holds:
From the above relationships, and are isomorphic. Therefore, we estimate a subspace on instead of estimating a subspace on . The algorithm is summarized by algorithm 2.
4 Related works
4.1 Meta-learning and multi-task learning
Meta-learning is a framework that estimates a common knowledge of tasks through similar but different learning tasks and adapts to new tasks [25, 11, 12]. As a framework similar to meta-learning, multi-task learning improves the predictive accuracy of each task by estimating a common knowledge of tasks . Since the approach of the meta-learning for GP is the same as that of the multi-task learning for GP, we explain the conventional meta-learning and multi-task learning methods.
Most conventional meta-learning methods for GP estimate a prior of each task. The simplest approach is to estimate a common prior between tasks, and the prior is estimated based on hierarchical Bayes modeling or deep neural network (DNN)[9, 18, 28, 16]
. The approach models common knowledge between tasks but does not model individual knowledge of each task. As an approach for estimating common and individual knowledge of the tasks, there are feature learning and cross-covariance approaches. In the feature learning approach, meta-learning selects input features in each task by estimating hyperparameters of auto-relevant determination kernel or multi-kernel[20, 22]. In the cross-covariance approach, meta-learning assumes that a covariance function of priors is defined by the Kronecker product of a covariance function of samples and that of tasks and estimates the covariance function of tasks . In geostatistics, the approach is called linear models of coregionalization (LMC) and various methods have been proposed . The conbination of feature learning and cross-covariance approaches has been proposed . Although these approaches estimate common and individual knowledge of the task, they estimate covariance function but do not estimate the mean function. GP-PCA estimates a subspace on a space of GP posteriors. Therefore, GP-PCA enables the estimation of mean and covariance functions of each task, including a new task.
This study interprets meta-learning for GP from the information geometry viewpoint. Transfer learning and meta-learning are often addressed from the information geometry perspective[21, 26, 8]. However, to our best knowledge, there is no research of meta-learning for GP addressed from the information geometry viewpoint.
4.2 Dimension reduction methods for probabilistic distributions
Dimensionality reduction techniques for probability distributions have been proposed in various fields. For example, there are dimension reduction techniques of a set of categorical distributions  and a set of mixture models [2, 7]. Especially, e-PCA and m-PCA are closely related to this study [6, 1]
. e-PCA and m-PCA are proposed in the context of information geometry for the dimension reduction method of a set of exponential distribution families, which becomes the basic framework for conducting this study. This study differs from previous studies in that it deals with GP sets that are infinite-dimensional stochastic processes.
4.3 Functional PCA
GP-PCA can also be interpreted as a functional PCA (fPCA). fPCA is a method for estimating eigenfunctions from a set of functions. Let be a set of functions. fPCA estimates eigenfunctions to minimize the following objective function.
where . In fPCA, each function is represented as a linear combination of basis functions. Let , is obtained as