## 1 Introduction

In standard settings of learning from independent and identically distributed *(iid)* data, labels of training and test instances are drawn independently and are governed by
a fixed conditional distribution .
A great variety of problem settings relax this assumption; they are widely referred to as *transfer learning*

. We study a general transfer learning setting in which the conditional

is assumed to vary as a function of additional observable variables . The variables can identify a specific domain that an observation was drawn from (as in*multitask learning*), or can be continuous attributes that describe, for instance, the time or location at which an observation was made (sometimes called

*concept drift*).

A natural model for this setting is to assume a conditional with parameters
that vary with . Such models are known as *varying-coefficient models*
(*e.g.,* Hastie and Tibshirani, 1993; Gelfand et al., 2003).
In iid learning, it is common to assume an isotropic Gaussian prior over model parameters. When the parameters vary as a function of a task variable , it is natural to instead assume a Gaussian process (GP) prior over functions that map values of to values of .
A Gaussian process implements a prior over functions that
couple parameters for different values of and
make it possible to generalize over different domains, time, or space.
While this model allows to extend Bayesian inference naturally to a variety of transfer learning problems,
inference in these varying-coefficient models for large problems is
often impractical:
It involves Kronecker products that result in matrices of size , with the number of instances and the number of attributes (Gelfand et al., 2003; Wheeler and Calder, 2006).

Alternatively, varying-coefficient models can be derived in a regularized risk minimization framework. Such models infer point estimates of parameters

for different observed values of under some model that expresses how changes smoothly with (Fan and Zhang, 2008). At test time, point estimates of are required for all observed at the test data points. This is again computationally challenging because typically a separate optimization problem needs to be solved for each test instance. Most prominent are estimation techniques based on kernel-local smoothing (Fan and Zhang, 2008; Wu and Chiang, 2000; Fan and Huang, 2005).In this paper, we explore Bayesian varying-coefficient models in conjunction with isotropic Gaussian process priors. An isotropic prior encodes the assumption that elements of the vector of model parameters are generated independently of one another; isotropic GP priors are in direct analogy to isotropic Gaussian priors that are widely used in

iid learning. Our main theoretical result is that Bayesian inference in varying-coefficient models with isotropic Gaussian process priors is equal to Bayesian inference in a standard Gaussian process with a specific product kernel. The main practical implication of this result is that inference for varying-coefficient models becomes practical by using standard GP tools. Our theoretical result also leads to insights regarding existing transfer learning methods: First, we identify the exact modeling assumptions under which Bayesian inference amounts to multitask learning using a Gaussian process with task kernels and instance kernels (Bonilla et al., 2007). Secondly, we show that hierarchical Bayesian multitask models (*e.g.*, Gelman et al., 1995; Finkel and Manning, 2009) can be represented as Gaussian process priors; inference then resolves to inference in standard Gaussian processes with multitask kernels based on graph Laplacians (Evgeniou et al., 2005; Álvarez et al., 2011).

Our main empirical result is that varying-coefficient models with GP priors are an effective and efficient model for prediction problems in which the conditional distribution of the output given the input varies in time and geographical location. In our experiments, varying coefficient models outperform reference models for the problems of predicting rents and real-estate prices.

The paper is structured as follows. Section 2 describes the problem setting and the varying-coefficient model. Section 3 studies Bayesian inference and presents our main results. Section 4 presents experiments on prediction of real estate sales prices and monthly rents; Section 5 discusses related work and concludes.

## 2 Problem Setting and Model

This section defines a generative process which models a wide class of applications that are characterized by a conditional distribution whose parameterization varies as a function of additional variables . Figure 1 shows a plate representation of the model.

A fixed set of instances with is observable, along with values of a *task variable*.
The process starts by drawing a function according to a prior .
The function associates any task variable with a corresponding parameter vector
that defines the conditional distribution for task .
The domain of the task variable depends on the application at hand. In the simplest case of multitask learning,
is a set of task identifiers.
In hierarchical Bayesian multitask models,
a tree over the tasks
reflects how tasks are related; we represent this tree by its adjacency matrix .
We also study the setting of concept drift or non-stationary learning in which the conditional distribution of given varies smoothly in the task variables that can, for instance, comprise time or space. In this case, is a continuous-valued space.

We model using a zero-mean Gaussian process

(1) |

that generates vector-valued functions . The process is specified by a matrix-valued kernel function that reflects closeness in . Here, is the matrix of covariances between components of the vectors and for . We assume that the kernel function is isotropic; that is, for a positive semidefinite kernel function . This corresponds to the assumption that each dimension of the vector-valued function is generated by an independent Gaussian process, and these Gaussian processes share a common kernel function . Note that this decoupling is not an independence assumption on attributes; it is instead analogous to the assumption of an isotropic normal prior for model parameters that justifies the standard -regularization. We use to denote the matrix given by evaluations of the kernel function . The process evaluates function for all to create parameter vectors . The process then concludes by generating labels from an appropriate observation model,

(2) |

for instance, a standard linear model with Gaussian noise for regression or a logistic function of the inner product of and for classification.

The prediction problem is to infer the distribution of the label for a new observation with task variable . For notational convenience, we aggregate the training instances into matrix with row vectors , the task variables into matrix with row vectors , the parameter vectors associated with training observations into a matrix with row vectors , and the labels into vector .

In this model, the Gaussian process prior over functions couples
parameter vectors for different values of the task variable.
The hierarchical Bayesian model of multitask learning assumes a coupling of parameters
based on a hierarchical Bayesian prior (*e.g.*, Gelman et al., 1995; Finkel and Manning, 2009).
We will now show that the varying-coefficient model with isotropic GP prior subsumes hierarchical Bayesian multitask models by choice of an appropriate kernel
function of the Gaussian process that defines .
Together with results on inference presented in Section 3, this result shows how inference for hierarchical Bayesian multitask models
can be carried out using a Gaussian process.

The following definition formalizes the hierarchical Bayesian multitask model. [Hierarchical Bayesian Multitask Model] Let denote a tree structure over a set of tasks given by an adjacency matrix , with the root node. Let denote a vector with entries . The following process generates the distribution over labels given instances , task variables , the task hierarchy

, and variances

: The process first samples parameter vectors according to(3) | |||

(4) |

where for , is the unique node with ; then, the process generates labels , where is the same conditional distribution over labels given an instance and a parameter vector as was chosen for the varying-coefficient model in Equation 2. This process defines the hierarchical Bayesian multitask model. The following proposition shows that the varying-coefficient model presented in Section 2 subsumes the hierarchical Bayesian multitask model. Let denote a tree structure over a set of tasks given by an adjacency matrix . Let be a vector with entries . Let be given by , where denotes the entry at row and column of the matrix

and denotes the diagonal matrix with entries . Let be given by and let be the marginal distribution over labels given instances and task variables defined by the varying-coefficient model. Then it holds that . Proposition 2 implies that performing Bayesian prediction in the varying-coefficient model with the specified kernel function is identical to performing Bayesian inference in the hierarchical Bayesian multitask model. The proof is included in the appendix. In Proposition 2, entries of represent a task similarity derived from the tree structure . Instead of a tree structure over tasks, feature vectors describing individual tasks may also be given (Bonilla et al., 2007; Yan and Zhang, 2009). In this case, can be computed from the task features; the varying-coefficient model then subsumes existing approaches for multitask learning with task features (see Section 3.3).

## 3 Inference

We now address the problem of inferring predictions for instances , and task variables . Section 3.1 presents exact Bayesian solutions for regression; Section 3.2 discusses approximate Bayesian inference for classification. Section 3.3 derives existing multitask models as special cases.

### 3.1 Regression

This subsection studies linear regression models of the form

. Note that by substituting for the slightly heavier notation , this treatment also covers finite-dimensional feature maps. The predictive distribution for test instance with task variable is obtained by integrating over the possible parameter values of the conditional distribution that has generated value :(5) |

where the posterior over is obtained by integrating over the joint parameter values that have generated the labels for instances and task variables :

(6) |

Posterior distribution in Equation 6 depends on the likelihood function—the linear model—and the GP prior . The extrapolated posterior for test instance with task variable depends on the Gaussian process. The following theorem states how the predictive distribution given by Equation 5 can be computed.

[Bayesian Predictive Distribution] Let , , and let the kernel matrix be positive definite. Let be a matrix with components and be a vector with components . Then, the predictive distribution for the varying-coefficient model defined in Section 2 is given by

(7) |

with

Before we prove Theorem 3.1, we highlight three observations about this result. First, the distribution has a surprisingly simple form. It is identical to the predictive distribution of a standard Gaussian process that uses concatenated vectors as training instances, labels , and the product kernel function .

Secondly, instances only enter Equation 7 in the form of inner products. The model can therefore directly be kernelized by defining the kernel matrix as with kernel function where maps to a reproducing kernel Hilbert space. When the feature space is finite, then maps the to a finite-dimensional and Theorem 3.1 implies a Bayesian predictive distribution derived from the generative process that Section 2 specifies. When the reproducing kernel Hilbert space does not have a finite dimension, Section 2 does no longer specify a corresponding proper generative process because

would otherwise become infinite-dimensionally normally distributed. However, given the finite sample

and , a Mercer map (see, e.g., Schölkopf and Smola, 2002, Section 2.2.4) constitutes a finite-dimensional space for which Section 2 again characterizes a corresponding generative process.Thirdly and finally, Theorem 3.1 shows how Bayesian inference in varying-coefficient models with isotropic priors can be implemented much more efficiently than in general varying-coefficient models. Bayesian inference in varying-coefficient models in the parameter space generally involves matrices of size because it needs to take the overall covariance structure into account; the algorithm of Gelfand et al. infers the covariance matrix under an inverse Wishart prior using a sliced Gibbs sampler over parameter values Gelfand et al. (2003). This makes inference impractical for large-scale problems. Theorem 3.1 shows that under the isotropy assumption, the latent parameter vectors can be integrated out, which results in a GP formulation in which the covariance structure over parameter vectors resolves to an product-kernel matrix.

of Theorem 3.1. Let and denote the -th elements of vectors and , and let and denote the -th elements of vectors and . Let with and . Because are evaluations of the function drawn from a Gaussian process (Equation 1

), they are jointly Gaussian distributed and thus

are also jointly Gaussian (e.g., Murphy, 2012, Chapter 10.2.5). Because is drawn from a zero-mean process, it holds that as well as and thereforewhere denotes the covariance matrix. For the covariances it holds that

(8) | ||||

(9) |

In Equations 8 and 9 we exploit the isotropy of the Gaussian process prior: the covariance is the element in row and column of the matrix obtained by evaluating the kernel function at ; the isotropy assumption means that this matrix is diagonal with for and (see Section 2). We analogously derive

(10) | |||

(11) |

Equations 9, 10 and 11 define the covariance matrix , yielding

where . For it now follows that

(12) |

The claim now follows by applying standard Gaussian identities to compute the conditional distribution from Equation 12.

### 3.2 Classification

The result given by Theorem 3.1 can be extended to classification settings with by using non-Gaussian likelihoods that generate labels given outputs of the linear model. [Bayesian predictive distribution for non-Gaussian likelihoods] Let . Let be given by a generalized linear model, defined by and . Let be given by and . Let furthermore .

Let the kernel matrix be positive definite, and let be a matrix with components and a vector with components . Then, the predictive distribution for the GP model defined in Section 2 is given by

(13) |

with

A straightforward calculation shows that Equation 13 is identical to
the predictive distribution of a standard Gaussian process that uses concatenated
vectors as training instances,
labels , the product kernel ,
and likelihood function .
For non-Gaussian likelihoods, exact inference in Gaussian processes is generally intractable, but approximate inference methods based on, *e.g.*, Laplace approximation, variational inference or expectation propagation are available.

### 3.3 Product Kernels in Transfer Learning

Sections 3.1 and 3.2 have shown that inference in the varying-coefficient model is equivalent to inference in standard Gaussian processes with products of task kernels and instance kernels. Similar product kernels are used in several existing transfer learning models. Our results identify the generative assumptions that underlie these models by showing that the product kernels which they employ can be derived from the assumption of a varying-coefficient model with isotropic GP prior and an appropriate kernel function.

Bonilla et al. (2007) study a setting in which there is a discrete set of tasks, which are described by task-specific attribute vectors . They study a Gaussian process model based on concatenated feature vectors and a product kernel , where reflects instance similarity and reflects task similarity. Theorems 3.1 and 3.2 identify the generative assumptions underlying this model: a varying-coefficient model with isotropic Gaussian process prior and kernel generates task-specific parameter vectors in a reproducing Hilbert space of the instance kernel ; a linear model in that Hilbert space generates the observed labels.

Evgeniou et al. (2005) and Álvarez et al. (2011) study multitask-learning problems in which task similarities are given in terms of a task graph. Their method uses the product of an instance kernel and the graph-Laplacian kernel of the task graph. We will now show that, when the task graph is a tree, that kernel emerges from Proposition 2. This signifies that, when the task graph is a tree, the graph regularization method of Evgeniou et al. (2005) is the dual formulation of hierarchical Bayesian multitask learning, and therefore Bayesian inference for hierarchical Bayesian models can be carried out efficiently using a standard Gaussian process with a graph-Laplacian kernel. [Graph-Laplacian Multitask Kernel] Let denote a weighted undirected graph structure over a set of tasks given by a symmetric adjacency matrix , where defines the positive weight of the edge between tasks and or if no such edge exists. Let denote the weighted degree matrix of the graph, and the graph Laplacian, where a diagonal matrix that acts as a regularizer has been added to the degree matrix (Álvarez et al., 2011). The kernel function given by

where is the pseudoinverse of , will be referred
to as the *graph-Laplacian multitask kernel*.

The following proposition states that the graph-Laplacian multitask kernel is equal to the kernel that emerges in the dual formulation of hierarchical Bayesian multitask learning (Definition 1). Let denote a directed tree structure given by an adjacency matrix . Let be a vector with entries . Let denote the diagonal matrix with entries , let denote the diagonal matrix with entries , let , and let be defined as in Proposition 2. Then

Note that in Proposition 3.3, is an adjacency matrix in which an edge from node to node is weighted by the respective precision of the conditional distribution (Equation 4); adding the transpose yields a symmetric matrix of task relationship weights. The precision of the root node prior is subsumed in the regularizer . The proof is included in the appendix.

## 4 Empirical Study

In this section, we study the efficiency and accuracy of different varying-coefficient models and baselines for geospatial and temporal regression and classification problems. We focus on the problems of predicting real estate prices and monthly housing rents.

For real estate price prediction, we acquire records of real-estate sales in New York City
for sales dating from January 2003 to December 2009 in June 2013 through the NYC Open Data initiative^{1}^{1}1https://nycopendata.socrata.com/.

. Input variables include the floor space, plot area, property class (such as family home, residential condominium, office, or store), date of construction of the building, and the number of residential and commercial units in the building. After binarization of multi-valued attributes there are 94 numeric attributes in the data set. For regression, the sales price serves as target variable

; we also study a classification problem in which is a binary indicator that distinguishes between transactions with a price above the median of 450,000 dollars from transactions below it. Date and address for every sale are available; we transform addresses into geographical latitude and longitude using an inverse geocoding service based on OpenStreetMap data. We encode the sales date and geographical latitude and longitude of the property as task variable .Price and attributes in sales records vary widely; for instance, prices range from one dollar to four billion dollars, and the floor space from one square foot to fourteen million square feet. A substantial number of records contain either errors or document transactions in which the valuations do not reflect the actual market values: for instance, Manhattan condominiums that sold for one dollar, and one-square-foot lots that sold for massive prices. In order to filter most off-market transactions by means of a simple policy, we only include records of sales within a price range of 100,000 to 1,000,000 dollars, a property area range of 500 to 5,000 square feet, and a land area range of 500 to 10,000 square feet. Approximately 80% of all records fall into these brackets. Additionally, we remove all records with missing values. After preprocessing, the data set contains 231,708 sales records. We divide the records, which span dates from January 2003 to December 2009, into 25 consecutive blocks. Models are trained on a set of instances sampled randomly from a window of five blocks of historical data and evaluated on the subsequent block; results are averaged over all blocks.

For rent prediction, we acquire records on the monthly rent paid for privately rented apartments and houses in the states of California and New York from the 2013 American Community Survey’s ASC public use microdata sample files^{2}^{2}2http://factfinder.census.gov/faces/affhelp/jsf/pages/metadata.xhtml?lang=en&type=document&id=document.en.ACS_pums_csv_2013#main_content..
Input variables include the number of rooms, number of bedrooms, the duration for which the contract has been running, the construction year of the building, the type of building (mobile home, trailer, or boat; attached or detached family house; apartment building), and variables that describe technical facilities (e.g., variables related to internet access, type of plumbing, and type of heating). After binarization of multi-valued attributes there are 24 numerical attributes in the data.
We study a regression problem in which the target variable is the monthly rent, and a classification problem in which is a binary indicator that distinguishes contracts with a monthly rent above the median of 1,200 dollars from those with a rent below the median.
For each record, the geographical location is available in the form of a public use microdata area (PUMA) code^{3}^{3}3https://www.census.gov/geo/reference/puma.html.. We translate PUMA codes to geographical latitude and longitude by associating each record with the longitude-latitude-centroid of the corresponding public use microdata area;
these geographical latitudes and longitudes constitute the task variable .
We remove all records with missing values. The preprocessed data sets contain 36,785 records (state of California) and 17,944 records (state of New York).
Models are evaluated using 20-fold cross validation; in each fold, a random subset of training instances is sampled randomly from the respective training fold.

We study the varying-coefficient model with isotropic GP prior introduced in Section 2 with a Matérn kernel . Predictions are obtained from Theorem 3.1, using either a linear or also a Matérn kernel function (denoted by isoVCM and isoVCM, respectively). We compare with the varying-coefficient model with nonisotropic GP prior by Gelfand et al. (2003), in which the covariances are inferred from data (denoted by Gelfand). Furthermore, we compare with the kernel-local smoothing varying-coefficient model of Fan and Zhang (2008) that infers point estimates of model parameters. We study this model using a linear feature map (Fan & Zhang) and a nonlinear feature map constructed from a Matérn kernel (Fan & Zhang). Fan and Zhang (2008) do not regularize parameter estimates in their original model, we added an -regularizer as this improved predictive performance.

We finally compare against an *iid* baseline that assumes that is constant in , implemented by a standard Gaussian process with a linear (GP) or Matérn (GP) kernel, and with a standard Gaussian process that simply concatenates instance and task attribute vectors into vectors (denoted GP and GP).

For classification, we use logistic likelihood functions in our model (Theorem 3.2), and also in the GP baselines and the kernel-local smoothing varying-coefficient model of Fan and Zhang (2008). All kernel parameters, as well as the observation noise parameter of Theorem 3.1 and the observation noise parameters of the standard GP models are tuned according to marginal likelihood on the training data. The regularization parameter of the kernel-local smoothing varying-coefficient model and its kernel parameter (see Fan and Zhang, 2008) are tuned on the training data by cross-validation. The isoVCM model and all GP baselines are implemented based on the GPML Gaussian process toolbox (Rasmussen and Nickisch, 2010). Inference is carried out using the FITC approximation based on a low-rank approximation to the exact covariance matrix with 1,000 randomly sampled inducing points (Snelson and Ghahramani, 2005), and using Laplace approximation for classification.

First, we compare the execution time of the GP inference that results from Theorem 3.1 with the execution time of the primal inference procedure of Gelfand et al. (2003) and the execution time of the kernel-local smoothing varying-coefficient model of Fan and Zhang (2008). Figure 2 shows the execution time for model training and prediction on one block of test instances in the real estate price prediction task as a function of the training set size (CPU core seconds, Intel Xeon 5520, 2.26 GHz). For the model of Gelfand et al., the most expensive step during inference is computation of the inverse of a Cholesky decomposition of an matrix, which needs to be performed within each Gibbs sampling iteration. Figure 2 shows the execution time of 5,000 iterations of this step (3,000 burn-in and 2,000 sampling iterations, according to Gelfand et al., 2003), yielding a lower bound on the overall execution time. An experimental run with Bayesian inference for nonisotropic GP priors requires 230 CPU core days even for 100 training instances; as matrix inversion scales nearly cubically in , it is impractical for this application. We therefore exclude this method from the remaining experiments. By contrast, full Bayesian inference in our GP model takes less than a second. The execution time of the kernel-local smoothing varying-coefficient model by Fan and Zhang (2008) substantially differs for the regression and classification task. In this model, separate point estimates of model parameters have to be inferred for each test instance, for which a separate optimization problem needs to be solved. For regression, efficient closed-form solutions for parameter estimates are available, while for classification more expensive numerical optimization is required (Fan and Zhang, 2008).

Mean absolute error for predicting real estate prices in New York City (left) and mean zero-one loss for classifying real estate transactions (right) over training set size

. Error bars indicate the standard error.

In all subsequent experiments, each method is given 30 CPU core days of execution time; experiments are run sequentially for increasing number of training instances and results are reported for values of for which the cumulative execution time is below this limit.

Figure 3 shows the mean absolute error for real estate price predictions (left) and the mean zero-one loss for classifying sales transactions (right) as a function of training set size . For regression, Fan & Zhang and Fan & Zhang partially completed the experiments; for classification, both methods did not complete the experiment for the smallest value of . All other methods completed the experiments within the time limit. For regression, we observe that isoVCM is substantially more accurate than GP, GP, and Fan & Zhang; isoVCM is more accurate than GP and GP with for all training set sizes according to a paired -test. Significance values of paired -test comparing isoVCM and Fan & Zhang fluctuate between and for different , indicating that isoVCM is likely more accurate than Fan & Zhang. For classification, isoVCM substantially outperforms GP and GP; isoVCM outperforms GP and GP ( for ).

Figure 4 shows the mean absolute error for predicting monthly housing rent (left) and the mean zero-one loss for classifying rental contracts (right) for rental contracts in the state of California (upper row) and the state of New York (lower row) as a function of training set size . Fan & Zhang completed the regression experiments within the time limit and partially completed the classification experiment; Fan & Zhang partially completed the regression experiment but did not complete the classification experiment for the smallest value of . We again observe that isoVCM yields the most accurate predictions for both classification and regression problems; isoVCM always yields more accurate predictions than Fan & Zhang and more accurate predictions than GP for training set sizes larger than .

## 5 Discussion and Related Work

Varying-coefficient models reflect applications in which a conditional distribution of given is a function of task variables . The task variables can, for instance, be continuous, discrete, or nodes in a tree—as in hierarchical Bayesian multitask learning. The functional dependency between the conditional distribution of the output given the input and the task variables can be modeled with a GP prior. Theorem 3.1 shows that, for isotropic GP priors, Bayesian inference in varying-coefficient models can be carried out efficiently by using a standard Gaussian process with a kernel that is defined as the product of a task kernel and an instance kernel. This result clarifies the exact modeling assumptions required to derive the multitask kernel of Bonilla et al. (2007). This result also highlights that Bayesian inference for hierarchical Bayesian learning can be carried out efficiently by using a standard Gaussian process with graph-Laplacian kernel (Evgeniou et al., 2005).

Product kernels play a role in other multitask learning models. In the linear coregionalization model, several related functions are modeled as linear combinations of Gaussian processes; the covariance function then resolves to a product of a kernel function on instances and a matrix of mixing coefficients (Journel and Huijbregts, 1978; Álvarez et al., 2011). A similar model is studied by Wang et al. (2007) in the context of style-content separation in human locomotion data; here mixing coefficients are given by latent variables that represent an individual’s movement style. Zhang and Yeung (2010) study a model for learning task relationships, and show that under a matrix-normal regularizer the solution of a multitask-regularized risk minimization problem can be expressed using a product kernel. Theorem 3.1 can be seen as a generalization of their result in which the regularizer is replaced by a prior over functions, and the regularized risk minimization perspective by a fully Bayesian analysis.

Non-stationarity can also be modeled in Gaussian processes by assuming that either the residual variance (Wang and Neal, 2012), or the length scale of the covariance function (Schmidt and O’ Hagan, 2003), or the amplitude of the output (Adams and Stegle, 2008) are input-dependent. The varying-coefficient model differs from these models in that the source of non-stationarity is observed in the task variable.

In the domain of real estate price prediction, the dependency between property attributes and the market price changes continuously with geographical coordinates and time. We observe that primal Bayesian inference in varying-coefficient models with nonisotropic GP priors is all but impractical in this domain, while for isotropic GP priors, inference based on Theorem 3.1 is more efficient by several orders of magnitude. Empirically, we observe that the linear and kernelized isoVCM models predict real estate prices and housing rents more accurately over time and space than kernel-local smoothing varying-coefficient models, and are also more accurate than linear and kernelized models that append the task variables to the attribute vector or ignore the task variables.

We would like to thank Jörn Malich and Ahmed Abdelwahab for their help in preparing the data sets of monthly housing rents. We gratefully acknowledge support from the German Research Foundation (DFG), grant LA 3270/1-1.

## Appendix

of Proposition 2.

The marginal is defined by the generative process of drawing , evaluating for the different tasks to create parameter vectors , and then drawing for . The marginal is defined by the generative process of generating parameter vectors according to Equations 3 and 4 in Definition 1, and then drawing for . Here, the observation models and are identical. It therefore suffices to show that .

The distribution can be derived from standard results for Gaussian graphical models. Let denote the matrix with row vectors , and let

denote the vector of random variables obtained by stacking the vectors

on top of another. According to Equations 3 and 4, the distribution over the random variables within is given by a Gaussian graphical model (e.g., Murphy (2012), Chapter 10.2.5) with weight matrix , where is the all-one vector. It follows that the distribution over is given bywith

(see Murphy (2012), Chapter 10.2.5), where denotes the diagonal matrix with entries .

The distribution is given directly by the Gaussian process defining the prior over vector-valued functions (see Equation 1). Let denote the matrix with row vectors , then the Gaussian process prior implies

(see, e.g., Álvarez et al. (2011), Section 3.3). A straightforward calculation now shows and thereby proves the claim.

of Proposition 3.3. In the following we use the notation that is introduced in Proposition 2 and Definition 3.3. We first observe that by the definition of the graph Laplacian multitask kernel it is sufficient to show that . Since the matrix is invertible, this is equivalent to .

We prove the claim by induction over the number of nodes in the tree . If , then we have , , and . This leads to

and proves the base case. Let us now assume that we have a tree with nodes. Let be a leaf of this tree and shall be its unique parent. Suppose we have and w.l.o.g. we assume that . Let furthermore be the tree which we get by removing the node and its adjacent edge from the tree . Let and denote the adjacency matrices and and the degree matrices of and . Let be the vector with entries , and be the vector with entries . Let denote the diagonal matrix with entries , and the diagonal matrix with entries . Let denote the diagonal matrix with entries and the diagonal matrix with entries . Let and . Let and .

In the following, we write to denote a diagonal matrix with entries . We then have

is the ()-dimensional unit vector. Using this notation we can write

In the last line we applied the induction hypothesis to the tree . Using the definitions of , , and , we can easily finish the proof:

Comments

There are no comments yet.