1 Introduction
Gaussian process (GP) is a nonparametric supervised learning technique that estimates a posterior distribution of predictors from the dataset
[15]. In this study, we consider metalearning for GP. Metalearning is a framework used to improve the precision of new tasks by estimating a structure of a set of tasks [25, 11, 12]. Most conventional metalearning methods for GP estimate a prior distribution for new tasks [17, 5, 13, 9]. We propose an extension of principal component analysis for a set of Gaussian process posteriors (GPPCA) to estimate a lowdimensional subspace on a space of GP posteriors. Since GPPCA estimates a subspace on a space of GPs, the method can generate GP posteriors for new tasks. Therefore, we can estimate the GP posterior for the new tasks accurately from a small size of the dataset.To this end, we have to consider a space of GPs with an infinitedimensional parameter. A structure of a probability space is nontrivial since Euclidean space is inappropriate as a structure of the space. For a finite parametric probability distribution, we can define a structure of its space using information geometry
[4]. However, even if we use the information geometry, it is not easy to define a space of GPs.To overcome this problem, we consider defining the space of GP posteriors under the assumption that GP posteriors have the same prior. Then, we can show that the set of GP posteriors lies on a finitedimensional subspace in an infinitedimensional space of GP. By using this fact, we can reduce the task of GPPCA to a task of estimating a subspace on finitedimensional space. Additionally, we developed a fast approximation method for GPPCA using a sparse GP based on variational inference.
The remainder of the paper is organized as follows. In Section 2, we explain the information geometry, principal component analysis for exponential families and GP regression. In Section 3, after defining a set of GP posteriors in terms of information geometry, we propose the GPPCA and show that the task can be reduced to a finitedimensional case. In Section 4, we present the related works. In Section 5, we demonstrate the effectiveness of the proposed method. Finally, Section 6 presents the conclusion.
2 Preliminaries
In this section, we explain the information geometry of the exponential family, dimensionality reduction technique on the exponential family, and GP.
2.1 Information geometry of the exponential family
The exponential family is a distribution parameterized by as follows.
In information geometry, a set of is regarded as a Riemannian manifold denoted by . Then, a metric of is defined by Fisher information
and a connection is defined by connection, where is a parameter of connection. When , can be regarded as a flat manifold, i.e., curvature and torsion of are zero. When , is a flat manifold defined in a coordinate system by , which is called ecoordinate system. On the other hand, when , is a flat manifold defined in a coordinate system by , which is called mcoordinate system.
There is a bijection between and , and the bijection can be described as Legendre transform. The following equation with respect to and holds.
where and are potential functions of and , respectively. From the equation, and can be mutually transformed by Legendre transformation as follows.
From this fact, and are in a nonlinear relationship in general. Therefore, eflat manifold does not always become mflat manifold and vice versa. If a manifold becomes eflat and mflat simultaneously, the manifold is called a dually flat manifold. Since holds eflat and mflat, is a dually flat manifold.
In a dually flat manifold, we can consider two kinds of linear subspaces: eflat and mflat subspaces. Let and be an ecoordinate and mcoordinate of . While an eflat subspace is defined as a linear combination of , an mflat subspace is defined as a linear combination of . Let and be eflat and mflat subspaces, respectively. Then, and are described as follows.
(1) 
(2) 
where . When , and are called egeodesic and mgeodesic, respectively.
By using and , we define a KullbackLeibler (KL) divergence between the two points as follows.
(3) 
We denote the KL divergence by using ecoordinates or mcoordinates depending on the situation, i.e., and . The following theorems show an interesting duality of ecoordinate and mcoordinate.
Theorem 1 (Pythagorean theorem [4]).
Let , and be points on . If an egeodesic between and and an mgeodesic between and are orthogonal, i.e., holds. Then, the following relationship holds.
When an mgeodesic between and are orthogonal, is called mprojection from to . Similarly, when an egeodesic between and are orthogonal, is called eprojection from to . From the Pythagorean theorem, the following theorem holds.
Theorem 2 (Projection theorem[4]).
An mprojection from to uniquely exists and it minimizes . Similarly, an eprojection from to uniquely exists and it minimizes .
2.2 PCA for exponential families
Let be a set of exponential families. Since there are two types of subspaces on : eflat and mflat subspaces, we can consider two PCAs for a dataset . One is ePCA, which estimates an eflat affine subspace . The other is mPCA, which estimates an mflat affine subspace . Although we only explain ePCA, the same argument holds for mPCA.
Assume that can be described by basis and offset , where
. It means that using a weight vector
and basis , any point on can be represented asWhen is obtained, the task of ePCA is to estimate and minimizing the following objective function.
(4) 
Because and minimizing cannot be obtained analytically in general, ePCA alternatively estimates and using a gradient method. Let and be mcoordinates of and , respectively. We denote matrices of and by and , respectively. The gradients of Eq. (4) with respect to and are given by the following equations.
(5)  
(6) 
where .
For multivariate normal distributions, each probabilistic distribution can be parameterized a mean vector
and covariance matrix . Letting and , the ecoordinate can be described as follows:(7) 
where , . On the other hand, the mcoordinate can be described as follows:
(8) 
where , . Then, the transform between and can be described as follows:
2.3 Gaussian process (GP)
First, we present the definition of notations. An output vector of function corresponding to input set is denoted by or . When an input set is denoted with a subscript, such as , the corresponding output vector is also denoted with the subscript such as . Similarly, while a vector of kernel between and is denoted by , a gram matrix between and is denoted by . The treatment of the subscript is the same as a function.
GP is a stochastic process with respect to a function . It is parameterized by the mean function and covariance function . The GP has a marginalization property. It means that a vector corresponding to an arbitrary input set can consistently follow a multivariate normal distribution , where and . Therefore, GP can be regarded as an infinitedimensional multivariate normal distribution intuitively.
2.1 Gaussian process regression (GPR)
Let and be an input vector and output, respectively. We assume that the relationship between and is denoted as , where is a noise. The task of regression is to estimate a function from an input set and corresponding output vector
. We assume that a likelihood function is a Gaussian distribution with mean
and variance
, and the prior distribution is a GP with a mean function and covariance function . For any , is obtained asSince the posterior distribution is a conditional distribution of given , the mean and covariance function for a new input of the posterior distribution can be obtained by closed form as , where
(9)  
(10) 
We can also interpret that the posterior is obtained using Bayes’ theorem. When
and are observed, the posterior for is derived as follows:By using , the predictive distribution for new input data is described as follows:
where is a conditional prior. Letting , and are obtained as follows:
Furthermore, the predictive distribution is derived as
Since is a prior distribution, is determined uniquely when is given. By using this property, we define a space of GP posteriors and propose GPPCA.
3 PCA for Gaussian processes (GPPCA)
Similar to ePCA and mPCA, we consider two types of GPPCA: GPePCA and GPmPCA. In this study, we only explain the GPePCA, but the same argument holds for GPmPCA. Let be a GP posterior obtained . When a set of posteriors is given, the task of GPePCA is to estimate an eflat subspace minimizing KL divergence between GP posteriors and their corresponding points on the subspace. However, it is nontrivial to define a structure of GPs since GP has an infinitedimensional parameter.
This study shows that a set of GP posteriors is a finitedimensional dually flat space under the assumption that each posterior has the same prior and reduces the task of GPePCA to a task of estimating a subspace on finite space. To explain our approach, we introduce the two probabilistic spaces shown in Fig 1. One is a space consisting of Gaussian distributions for an output vector corresponding to a training set . The other is a space consisting of Gaussian distributions for an output vector corresponding to a set which is a union of the training input set and an arbitrary test input set . The former is denoted by and the latter is denoted by . Both and are dually flat spaces since they are a set of Gaussian distributions. Note that can be regarded as an infinitedimensional space since the cardinal of can be any number. In our approach, we define a space consisting of GP posteriors as a subspace on denoted by . Then, we estimate an eflat subspace on and transform to using an affine map instead of estimating an eflat subspace on . Since there is no guarantee that is equivalent to , this study proves this.
In this Section, after defining and GPePCA, we prove that and are equivalent. Next, we describe the standard algorithm and its sparse approximation algorithm.
3.1 Definition of the structure of GP posteriors and GPePCA
Let and be a union set of and test set. We consider estimating given in each task. Then, th task’s predictive distribution for is derived as . Suppose that GP posteriors have a common prior, is determined uniquely given . From this fact, the affine subspace spanned by GP posteriors is defined as follows:
Definition 1.
Let , and be an input set, test set, and a union set of input and test sets, respectively. We denote the size of by . Let be a Gaussian distribution with , where is a pair of dimensional vector and positivedefinite symmetric matrix . Then, a probability space consisting of GP posteriors corresponding to with a common prior is defined by the following equation:
(11) 
where is a conditional distribution of the prior and is any Gaussian distribution with a parameter . In particular, when holds, i.e., is an empty set, and are denoted by and , respectively.
Satisfying the assumption, is contained in . Let be a parameter of . can be described as follows.
(12)  
(13) 
where , , , and . Therefore, we can define a space of GP posteriors as .
Since is a dually flat space, can be represented by ecoordinate and mcoordinate denoted by and , respectively. We denote ecoordinate and mcoordinate for a point on parameterized by as and , respectively. From the definition of , when , holds since and hold. It means that is also a dually flat space. Therefore, we denote ecoordinate and mcoordinate of by and , respectively.
By using the definition of , we define GPePCA in the respective spaces of and .
Definition 2.
Let be a set of GPs on the . Then, the objective function of GPePCA on is defined as follows:
(14) 
GPePCA estimating eflat submanifold minimizing Eq. (14) is called GPePCA. Here, is ecoordinate of denoted by a linear combination of with weight , where .
Similarly, when is observed, we call the ePCA minimizing the following equation GPePCA.
(15) 
Here, and are ecoordinate of and , which is a linear combination of with weight , where .
In this study, we guarantee that GPePCA() is equivalent to GPePCA() by the following theorem.
3.2 Proof of Theorem 3
The proof of the Theorem 3 is composed of the proof of the following three statements.

For , there is satisfying .

For , , holds.

For a subspace minimizing , holds.
From (S1) and (S2), denoting a subspace minimizing Eq. (15) by , we can prove that also minimizes Eq. (14) in a set of subspaces on . However, since a subspace minimizing Eq. (14) does not always lie on , we confirm this by (S3).
To prove the statements, we present the following Lemmas.
Lemma 1.
Let be a parameter of . Then, there is an affine map satisfying the following equation.
(17) 
proof.
The proof is shown by Appendix B ∎
Lemma 2.
Let and are two arbitrary parameters, and let us take two points and in , and and in . Then, the following equation holds:
(18) 
proof.
The proof is shown by Appendix B ∎
Lemma 3.
Suppose be a dually flat manifold and be a dimensional submanifold. If is a dually flat and a set of points , the dimensional eflat submanifold minimizing Eq. (14) for is included in when .
proof.
The proof is shown by Appendix B ∎
Lemma 4.
Let be a parameter of . Then, there is a linear mapping satisfying the following equation.
(19) 
proof.
The proof is shown by Appendix B ∎
The proofs of (S1) and (S2) are obvious from Lemma 1 and Lemma 2. From Lemma 3, (S3) can be proved by showing that is a dually flat for arbitrary test set . When , i.e., the test set is empty, then is a dually flat since . When , by the linear relation proved in Lemma 1 and Lemma 4, the Lemma also holds in the general case. Thus, Theorem 3 is proved.
3.3 Algorithm of GPePCA
From the above discussion, GPePCA can be reduced to GPePCA. In this Section, we explain a concrete algorithm of GPePCA.
3.1 GPePCA
Let and be a training input and corresponding output dataset of th task, where is the size of . We denote a union set of the input sets by , i.e., and define the probability space of GP posteriors as Eq. (11). Then, we denote the GP posterior given by . From Theorem 3, the task of GPePCA is to estimate a subspace for and transform to .
In training phase, GPePCA calculates the and transforms the mcoordinates . The is calculated using Eqs. (12) and (13) and from is transformed from by and . Next, GPePCA estimates the subspace using ePCA. That is, estimating and minimizing Eq. (15) through gradient descent iterations. Algorithm 1 shows the summary of the algorithm.
In the prediction phase, GPePCA predict outputs corresponding to a test data using the following equations.
Since this algorithm requires calculating the inverse matrix, the calculation cost of the algorithm becomes , where . Since this algorithm is impractical, we derive a faster approximation below.
3.2 Sparse GPePCA
Most sparse approximation methods for GP reduce a calculation cost by approximating the gram matrix for input set using inducing points [14]. Let and be a set of inducing points and gram matrix between inputs. The gram matrix is approximated as
where , . By using this approximation, we consider a set of GPs for instead of a set of GPs for . Denoting the set of GPs for by , the sparse GPePCA estimates a subspace on and transforms the subspace to . Then, we reduce the calculation cost of GPePCA from to , where is the size of inducing points.
We adopt a sparse GP based on variational inference proposed by Titsias [23]. The variational inferencebased sparse GP minimizes the KLdivergence between a true posterior and variational distribution , that is,
Then, the variational distribution minimizing the equation is derived as follows:
where . The predictive distribution for new input is as follows:
We regard and as a parameter of Eq. (11). That is, denoting a parameter of th task’s variational distribution by , the sparse GPePCA estimates a subspace minimizing Eq. (15) for and transforms the subspace to by the affine map .
In practice, to stabilize the sparse GPePCA, we reparametrize as follows:
We denote a space of by . Letting , , and , the following relationships between and hold.
Furthermore, using the equations, we can show the equivalence between the KLdivergence of and that of . That is, for any and , the following equation holds:
From the above relationships, and are isomorphic. Therefore, we estimate a subspace on instead of estimating a subspace on . The algorithm is summarized by algorithm 2.
4 Related works
4.1 Metalearning and multitask learning
Metalearning is a framework that estimates a common knowledge of tasks through similar but different learning tasks and adapts to new tasks [25, 11, 12]. As a framework similar to metalearning, multitask learning improves the predictive accuracy of each task by estimating a common knowledge of tasks [29]. Since the approach of the metalearning for GP is the same as that of the multitask learning for GP, we explain the conventional metalearning and multitask learning methods.
Most conventional metalearning methods for GP estimate a prior of each task. The simplest approach is to estimate a common prior between tasks, and the prior is estimated based on hierarchical Bayes modeling or deep neural network (DNN)
[9, 18, 28, 16]. The approach models common knowledge between tasks but does not model individual knowledge of each task. As an approach for estimating common and individual knowledge of the tasks, there are feature learning and crosscovariance approaches. In the feature learning approach, metalearning selects input features in each task by estimating hyperparameters of autorelevant determination kernel or multikernel
[20, 22]. In the crosscovariance approach, metalearning assumes that a covariance function of priors is defined by the Kronecker product of a covariance function of samples and that of tasks and estimates the covariance function of tasks [5]. In geostatistics, the approach is called linear models of coregionalization (LMC) and various methods have been proposed [3]. The conbination of feature learning and crosscovariance approaches has been proposed [13]. Although these approaches estimate common and individual knowledge of the task, they estimate covariance function but do not estimate the mean function. GPPCA estimates a subspace on a space of GP posteriors. Therefore, GPPCA enables the estimation of mean and covariance functions of each task, including a new task.This study interprets metalearning for GP from the information geometry viewpoint. Transfer learning and metalearning are often addressed from the information geometry perspective
[21, 26, 8]. However, to our best knowledge, there is no research of metalearning for GP addressed from the information geometry viewpoint.4.2 Dimension reduction methods for probabilistic distributions
Dimensionality reduction techniques for probability distributions have been proposed in various fields. For example, there are dimension reduction techniques of a set of categorical distributions [10] and a set of mixture models [2, 7]. Especially, ePCA and mPCA are closely related to this study [6, 1]
. ePCA and mPCA are proposed in the context of information geometry for the dimension reduction method of a set of exponential distribution families, which becomes the basic framework for conducting this study. This study differs from previous studies in that it deals with GP sets that are infinitedimensional stochastic processes.
4.3 Functional PCA
GPPCA can also be interpreted as a functional PCA (fPCA). fPCA is a method for estimating eigenfunctions from a set of functions
[19]. Let be a set of functions. fPCA estimates eigenfunctions to minimize the following objective function.where . In fPCA, each function is represented as a linear combination of basis functions. Let , is obtained as
where