I Introduction
Multitask learning [48]
with Gaussian process (GP) regression tasks is of major concern in the statistical signal processing and machine learning communities. Many contemporary instances of this framework have been reported, including methods based on convolved GPs
[40, 6, 27, 1], convolutional GPs [43], product GPs [47, 26], deep (nested) GPs [9, 14, 7], and deep neural networks
[46]. All these methods adopt a single global model to describe the relationship among source and target tasks, involving a joint probability distribution of all latent and outputdata processes, which is consistent with the foundational methodology of Bayesian learning [4]. The flexibility of multitask GP regression is evidenced by a wide range of applications, such as reinforcement learning
[20], Bayesian optimization [39], Earth observation [8][49], and clinical data analysis [11].This paper is specifically interested in Bayesian transfer learning (BTL) [42, 29, 45, 23]. Without loss of generality, we define BTL as the transfer of knowledge from a source task to a target task, a definition which is widely extensible to multitask settings. Under the Bayesian framework, source and target modellers separately adopt distinct probability distributions to model local unknown quantities of interest, i.e. our approach involves two stochastically independent modellers without requiring consistency to be established between them, whereas conventional multitask learning involves—as already stated—a single global modeller. These modellers each have exclusive access to their respective local data but not to each other’s data. Therefore, the target cannot condition its model on the source data, the essential assumption of conventional global multitask learning. The target’s model can be a local model, as in [10, 30, 32, 13, 31, 33]. Alternatively, it may itself extend its focus beyond its local target task and adopt a global model involving both source and target tasks, as in the standard multitask setting. This is the setup we adopt in this paper for the first time. To emphasize: the target modeller globally models the interactions between its own task and the source task, but is informed only by its local target data. Meanwhile, the source modeller, adopting a local Bayesian model, is informed by its local (source) data, and then communicates its knowledge to the target only via a source probability distribution instantiated by its (source) sufficient statistics. Therefore, only the target takes care of interactions and only the source communicates source probabilistic knowledge. The key distinctions between global and local modelling of the source and target tasks are captured in Fig. 1, whose technical details will be explained fully in Section II. The target modeller has no prescriptive rule of probability calculus with which to condition on the source distribution in this BTL scenario (Fig. 1(c)). The reason for this is that the joint model of the source distribution and the target’s quantities of interest is not available. The challenge for the target modeller then lies in deciding how to condition on the source probability distribution in this incompletely modelled case. Fully probabilistic design (FPD) [16, 35]—the optimal Bayesian decisionmaking strategy based on the minimum crossentropy principle [38]—addresses this problem. Under this framework, the target modeller conditions on the transferred source knowledge by solving a constrained optimization problem.
The principal novelty of this paper, therefore, is that the target modeller is a global modeller (in sympathy with the aforementioned standard multitask learning) that conditions on a local source probability distribution. Recall, from Bayesian foundations, that this latter distribution is a function of both its local source data and local source model. What is implied by this BTL framework is that the global target modeller independently adopts a different model from the source. This leads to important advances beyond the current available notions and techniques for BTL, all of which are explored in this paper:

[leftmargin=12pt]

The introduction of the global target modeller facilitates learning about the correlation structure between the source and target tasks, extending our own previous BTL framework which involved only locally modelled tasks [10, 30, 32, 31, 33]. It also generalizes FPDbased approaches to pooling [18, 15], as well as conventional (global) Bayesian multitask learning [40, 6, 27, 1, 43, 47, 26, 9, 14, 7, 46]. We demonstrate this formally, and also via extensive experimental evidence.

We stand to benefit from the fact that the local source modeller is independent of the target and is therefore, potentially, a more informed—i.e. expressive or expert—modeller of its local data. This can provide a major performance dividend over standard multitask learning solutions that adopt only a single global (typically centralized) model.

The source probability distribution is represented by its sufficient statistics, thereby achieving an optimal encoding (compression) of source knowledge extracted from its (often highdimensional) data and (possibly complex) model structure. This can avoid transferred message overheads that arise in standard multitask learning, which relies on the transfer of unprocessed source data for its conditioning task (i.e. learning).

Our previous FPDbased BTL schemes were unable to transfer higher moments of the source distribution, a resource that is necessary in achieving robust transfer (i.e. the rejection of imprecise source knowledge) [3, 10, 30, 31]. By careful design of the FPDbased constrained optimization problem in this paper, we formally solve this problem, obviating the informal adaptations in [10, 30], and the computationally expensive augmentation strategies in [32, 33].
The paper is organized as follows: Section II defines the BTL problem, where the central aim is to transfer the source outputdata predictor to the target modeller. The new concept of the global target modeller is introduced and the general solution based on the FPD framework is provided. Section III instantiates this general setting of BTL in the case of GP regression tasks. Section IV provides a thorough exploration of these advances via synthetic data experiments, focussing particularly on its ability to capture correlation structure, to achieve robust transfer and to benefit from the transfer of an expressive local model in the source. This will allow us to comment in an informed way on the performance limits of our approach. Section V validates our BTL scheme in a realdata context. Section VI provides further discussion of the key properties of our method, including its source knowledge compression and computational complexity aspects. Section VII concludes the paper with suggestions for future work.
Ii Bayesian transfer learning from a local
to global modeller
Consider a (nonparametric) regression task where the aim is to infer unknown (latent) function values, , based on known input data (regressors), , and known output data, , where each is an indirect (e.g. noisy) observation of . A Bayesian modeller (Fig. 1(a)) addresses this problem by adopting a complete joint probability model, , and then computing the posterior distribution, , for some fixed and finitedimensional parameter, , given a realization of . This constitutes standard Bayesian conditioning in the context of the complete joint model^{1}^{1}1In the sequel, we will suppress the details of the conditioning where this is evident. Moreover, and denote known (fixedform) and unknown (variationalform) distributions, respectively. We use “” to denote “is defined to be equal to”., .
This paper addresses the problem of learning in a pair of GP regression tasks, called the source and target learning tasks, respectively. The source task is local, being informed by local output data, , available only to the source. Its local processing of knowledge, , and data, , is to be exploited in order to improve learning at the target task. The standard approach is that a global multitask modeller (Fig. 1(b)) elicits a joint model,
, capturing interactions between the source and target processes, and conditions on source and target output data,
and , respectively, to which it must have access [40, 6, 27, 1, 43, 47, 26, 9, 14, 7, 46]. This yields(1) 
This scheme requires the prior elicitation of the joint model of the twotask learning system. We refer to this as complete modelling.
In this work, our aim is to relax these restrictive assumptions of the conventional multitask learning framework, as follows (Fig. 1(c)):

[leftmargin=16pt]

The target will not have direct access to the source data, ; and

the target’s joint model^{2}^{2}2We adopt the following notation: undecorated distributions, and , are target models, with denoting source distributions., , is distinct from the source’s local model, , and each is elicited independently.
Specifically, the goal of the present paper is to substitute the inaccessible source data, , with appropriately specified probabilistic source knowledge, . Technically, the inferential objective of the global target modeller is to compute , conditioned now on the transferred source distribution, , but not on . This conditioning on is incompletely specified because the target does not have a joint model of the type, , i.e. one that involves the target’s hierarchical specification of the source’s transferred distribution. Instead, the target’s learning task becomes one of optimally choosing its knowledgeconditioned model, , in this incompletely modelled case (necessarily, then, is unknown). The fact that the global target modeller (Fig. 1(c)) processes (a result of learning) is an important progression beyond conventional global multitask learning (Fig. 1(b)), and we reserve the term Bayesian transfer learning for the case in (Fig. 1(c)). Among its consequences are the following:

[leftmargin=16pt]

the target learns only from the sufficient statistics of the source, without itself receiving the source data, (of course, it also processes its local target data, ); and

the target can be enriched by the (possibly expert) local source model.
It remains to decide on the optimal form of the target’s conditioned joint model, . Here, , i.e. the source transfers its sourcedataconditioned predictor of an unrealized (i.e. unobserved) output, denoted by , for which the known input is . Hence, the target’s unspecified joint model must be augmented, to yield
(2) 
which follows from the chain rule. Fully probabilistic design (FPD) is the axiomatically justified Bayesian decisionmaking framework
[17, 16] for optimally choosing the required conditional distribution. FPD is closely related to the minimum crossentropy principle [38], as further explained in [35]. It solves the following constrained optimization problem:(3) 
where
is the KullbackLeibler divergence
[19] from to , and is the expected value under . The FPDoptimal model (3) is the solution (i) that respects the knowledge constraints given by the set of possible models, , and (ii) that is closest to the ideal model, . The latter must declare the target’s preferences about . These are as follows:
[leftmargin=16pt]

The set of possible models, , is delineated by restricting the functional form of (2),
(4) where we assume that is conditionally independent of given . (2) then has the form
(5) where the conditioned factor remains unknown and unrestricted. Since is fixed, then is the only unknown quantity in (5), and the set of possible models is therefore
(6) 
Adopting the same chain rule expansion as in (2), the target’s ideal model is specified as
(7) Comparing with (2), the following ideal declarations have been made by the target:
(8) (9) Specifically, in (8), the target assumes as its ideal that is conditionally independent of given , and—in order to transfer the source outputdata predictor, —adopts the source’s predictor of , i.e. , as its ideal model for . Similarly, in (9), the target ideally models conditionally independently of given its own , and adopts its own joint model, , as its ideal, .
These specifications imply a unique design for the target’s sourceknowledgeconstrained optimal distribution (3), as established in the following proposition.
Proposition 1.
The target modeller constrains the unknown model (2) to belong to the set of possible models, (6), and adopts the ideal model (7). Then the target’s FPDoptimal model (3) is
(10) 
where
(11) 
Proof.
See Appendix AB. ∎
The design (11) is the optimal Bayesian conditioning mechanism we have been looking for. It forms the update from the preposterior, , to the FPDoptimal, conditioned, posterior, . The exponential structure in (11) constitutes the updating term which incorporates . It has the Boltzmann structure that is typical of entropic designs [35, 36].
(AL12a)  
(AL12b) 
(AL13a)  
(AL13b) 
Iii Local and global Gaussian process
regression modellers
We instantiate the framework in Section II to the case of source and target GP regression tasks [5, 24]. For this purpose, consider a stochastic process, , over a scalar random function, , such that any dimensional collection of evaluation points (i.e. regressors), , induces a joint probability distribution over the function values, . The GP, , is conditioned on known mean function, , assumed (without loss of generality) to be zero. The known covariance (kernel) function is
. In this way, all joint distributions are specified to be multivariate Gaussian,
. Here, and are the known,dimensional mean vector, and
dimensional covariance matrix, respectively. We suppress —the parameters of and —in the sequel for notational simplicity.This GP, , is indirectly (i.e. noisily) observed as at the evaluation points, , via an additive, uncorrelated, Gaussian noise process. Hence, the joint distribution is
(14a)  
(14b) 
with the GP as prior (14b). Here,
is the conditional outputdata variance,
is thedimensional identity matrix, and
are all known parameters. In the rest of this section, we restore the explicit conditioning on , as it is relevant to the calculus.We now adopt this stochastic structure in the isolated (i.e. unconditioned on transferred knowledge) local source task of BTL (Fig. 1(c)). Our purpose is to transfer an outputdata predictor via construction of the GP posterior at unobserved test points, as follows.
Remark 1.
The isolated local source modeller adopts (14), which is assumed to be strictsense stationary. Then, the posterior distribution—evaluated at a single test point, (now shown explicitly in the conditioning)—is
where
(15a)  
(15b) 
with , , and . Here, is a second order test pair. The posterior predictor of unobserved outputs, , for which the (known) test points are , is
(16) 
where .
Proof.
See, for example, [37], Chapter 2. ∎
Recall that only the source is a local modeller of the kind above. In contrast, the target modeller augments its model to assess the global sourcetarget system (Fig. 1(c)). Therefore, the joint prior distribution, , adopted by the global target modeller before processing and , is specified by
(17a)  
(17b)  
(17c) 
where
(18) 
is the symmetric covariance matrix describing the target’s prior beliefs about the interactions between the source and target GP function values,
. It is expressed in a blockmatrix form, involving three matrix degreesoffreedom (
denotes matrix transposition). Specific cases will be considered in Section IV. Also, is the target’s conditional variance of the source output data at any test point, and is the target’s conditional variance of any target output datum.It remains to compute the target’s posterior joint inference of the GP function values, , under its joint model (17), i.e. after it has processed its local data, , and the transferred knowledge, , respectively.
Proposition 2.
The global target modeller elicits the joint prior distribution (17) and processes the source outputdata predictor (16). Then, the FPDoptimal posterior distribution (11)—evaluated at a test point —is
(19) 
where
with and being specified in (AL12) for .
Proof.
See Appendix AC. ∎
The proposed global target GP regression modeller, supported by FPDoptimal BTL from the local source GP regression modeller, is summarized in Algorithm 1.
Iv Synthetic Data Experiments
In this section, we investigate the performance of our algorithm (Algorithm 1) against a number of alternative transfer learning algorithms, in particular focussing on distinctions between the analysis model underlying each algorithm and the synthesis model used to generate the synthetic data. We illustrate the following key properties of Algorithm 1:

[leftmargin=12pt]

the ability to achieve robust knowledge transfer—i.e. rejection of imprecise source knowledge—due to successful transfer of all moments of (16);

the ability to process known correlation between the source and target latent functions, , via the target’s specification of the covariance structure, (18);

the experimental bounding of the performance of Algorithm 1, which holds under the ideal condition when the synthesis and analysis models are equal; and
The first three issues are explored in Section IVA, and the fourth in Section IVB.
Algorithm  Description  
Source No Transfer  (SNT)  Remark 1  (Fig. 1a) 
Target No Transfer  (TNT)  Remark 1  (Fig. 1a) 
Intrinsic Coregionalization Model  (ICM)  [1]  (Fig. 1b) 
Fully Probabilistic Design  (FPDa)  Algorithm 1  (Fig. 1c) 
Fully Probabilistic Design  (FPDb)  [31] 
Throughout, the synthesis model is the (rank1) intrinsic coregionalization model (ICM) [1],
(20)  
and so . Here, and . It follows that the covariance matrix (18) is
(21) 
where
(22) 
is the coregionalization matrix [2], and denotes the Kronecker product of the matrix arguments. We adopt the mean absolute error (MAE) as the performance measure:
(23) 
where is the true target function value (Fig. 1),
is the lower subvector of the posterior predictive mean estimate (
AL12a), is the absolute value, and are the test points.Covariance function  Expression  

Constant  (C)  
Linear  (L)  
Polynomial  (P)  
Cosine  (Co)  
Squared Exponential  (SE)  
Rational Quadratic  (RQ)  
Matérn ()  (M) 
Iva The bounding performance of our BTL algorithm (Fig. 2)
Settings: We choose in (21) as the squared exponential (covariance) kernel function with the parameters (the full list of kernel functions involved in Section IV are defined in Table II). The complete parameterization of (20) is therefore . Furthermore, scalar input data (i.e. ) are generated via , where
is the uniform distribution on the open interval
. The test points, , are placed on a uniform grid in the interval .Performance: We compare the performance of the algorithms in Table I as we adapt the source knowledge quality via (20) and the correlation between the source and target latent functions via in (22), holding all other parameters in constant. In this experiment, we eliminate misspecification between the ICM synthesis model (20) and the analysis models of the various algorithms in Table I; i.e. they all have perfect knowledge of . Fig. 2(a) and (b) show the MAE for target weights, and , respectively, as a function of source variance, . The TNT algorithm (i.e. isolated target task) defines the baseline performance since it does not depend on the source knowledge, and so it is invariant to all source settings. If any of the BTL algorithms (FPDa and FPDb) yields an MAE below or above this level, it is said to deliver positive or negative transfer, respectively. If any algorithm saturates at this level, it is said to achieve robust transfer. The source MAE of the SNT algorithm (also an isolated task) is computed in terms of the quantities and , and, therefore, is unrelated to the target MAE (23). Indeed, we present this source MAE to track the influence of changing and on the performance of the SNT algorithm. The latter provides the transferred source outputdata predictor to the FPDa and FPDb algorithms. Consequently, the performance of SNT will impact FPDa and FPDb, as explained in the next paragraph.
The ICM algorithm delimits the optimal performance for all , since it adopts the ICM synthesis model (20) as its analysis model (i.e. misspecification is completely eliminated). Our FPDa algorithm (Algorithm 1) very closely follows the performance of the ICM algorithm (Fig. 2), despite (i) adopting incomplete modelling, and (ii) processing source statistics rather than the raw source output data, ; i.e. FPDa is a robust BTL algorithm. This improves on our previous FPDb algorithm, which achieves positive transfer only for (highquality source knowledge) but negative transfer for (lowquality source knowledge); i.e. it is nonrobust.
Fig. 2 depicts the MAE for fixed target weight , and varying source weight, (Fig. 2(c) and (d)). The FPDa algorithm again achieves close tracking of the ICM algorithm for all the parameter settings. We note that the critical point where the performance curves converge occurs when . For this setting, we explore the influence of in Fig. 2(c), holding . The main feature here is the symmetry of all the performance curves: around for FPDa, ICM and SNT; around for FPDb; and (trivially) everywhere for TNT. In all cases, this is due to the fact that quadratically enters the correlation structure, (22), of the global target modeller. When (e.g. the case in Fig. 2(d) where ), the main point to note is that FPDb cannot benefit from the improved positive transfer available from this more precise source. FPDa does benefit from this source precision because it, uniquely, exploits the target’s correlation structure, .
In summary, Fig. 2 illustrates the fact that the ICM algorithm—with exactly matched synthesis and analysis models—delimits the best predictive performance that can be attained by the FPDa algorithm. That said, misspecification of the ICM analysis model—as will occur with probability one in realdata settings—will undermine its performance, and provide opportunity for FPDa to outperform it. This is so, since FPDa allows source and target to elicit their models independently (Fig. 1(c)). We demonstrate this crucial model robustness feature of our BTL algorithm in the next section.
IvB Transfer of the source’s (local) analysis model (Fig. 3)
In our BTL framework (i.e. FPDa), the source modeller independently chooses a different model from the global target modeller to drive its source learning task. The source distribution, (16), therefore encodes not only the information from the local source data, , but also the source model structure. In this section, we demonstrate the features of this approach under conditions of mismatch (i.e. disagreement) between the source’s and target’s model of . Specifically, we will now design an experiment where the source modeller more closely captures the synthesis model than the global target modeller does; i.e. the source exhibits local expertise, as is often the case in practice.
Settings: The purpose of this study is to explore the impact of kernel structure mismatch between the global target’s analysis model and the ICM synthesis model. In all these cases, we arrange for the source analysis model to match the synthesis model, achieving local source expertise (above). The structures of the synthesis and analysis models are varied only via the kernel function, (21), and its parameters, , for the seven cases in Table II. The specific choices of these parameters in our current study are as follows: . Recall that—in the previous simulation study (Fig. 2)—the remaining parameters, , influence the transfer learning properties of the ICM and FPDa algorithms. We choose the settings, and , where all the transfer algorithms benefit equally from the source data, and so the only differential benefit between the algorithms will be in respect of the quality of the analysis model adopted by the source in computing its transferred knowledge.
Performance: Fig. 3 shows the MAE for all possible combinations of the kernel functions in the ICM synthesis model (Table II), and, again adopted in the TNT, ICM, and FPDa analysis models of the respective algorithms in Table I (i.e. these are kernel combinations in toto). Since the SNT algorithm is arranged to capture perfectly the source part of the ICM synthesis model, therefore it requires only a single column in Fig. 3(a). We do not present the FPDb algorithm, since—with the currently adopted experimental settings—it reduces to the FPDa algorithm (Fig. 2(c)). The TNT algorithm delineates the baseline MAE performance of no transfer (as it did in Section IVA), allowing us to assess the extent if positive transfer for the various kernel settings in ICM and FPDa. Note that both algorithms stay in the positive transfer regime for all combinations of the synthesis and analysis models, thus benefiting from the source knowledge even in the misspecification cases. For this reason, we have illustrated the absolute value differential MAE in Fig. 3(c) and (d) (i.e. higher values are better in these two subfigures). Furthermore, the red asterisks indicate those kernel combinations for which our FPDa algorithm outperforms the ICM algorithm. Fig. 3(c) and (d) reveal that the ICM algorithm slightly outperforms the FPDa algorithm on all diagonal entries, i.e. where there is no mismatch between the synthesis and analysis models. This is consistent with our finding in Section IVA, namely that the ICM algorithm achieves the optimal performance in this (unrealistic) case of perfect analysissynthesis model matching. Conversely, the FPDa algorithm demonstrates improved robustness to kernel structure misspecification when compared to the ICM algorithm, particularly when FPDa adopts a more complex analysis model—i.e. SE, QR, M (Table II)—and the synthesis model is simple, i.e. C, L, P, CO. This can be seen from the upperright yellow quadrant in Fig. 3(d). These more complex kernel functions for analysis are, indeed, a common choice for real data, where it is often hard to choose a kernel function based on a mere visual inspection only. The remaining analysissynthesis combinations—especially those under the diagonal in Fig. 3(d)—yield FPDa predictive performances in the target that are almost indistinguishable from those of the ICM algorithm.
V RealData Experiments
The Swiss Jura geological dataset^{3}^{3}3https://sites.google.com/site/goovaertspierre/ records the concentrations of seven heavy metals—measured in units of partspermillion (ppm)—at (spatial) locations in a 14.5 region. This dataset has been widely studied in the context of multitask learning (see, for example, [1]). In our study, the objective is to predict concentrations of the less easily detected metal, cadmium (Cd), based on knowledge of the more easily detected metal, nickel (Ni). In order to demonstrate our BTL approach (Algorithm 1) in this context, the source learning task Ni concentrations, and transfers its probabilistic outputdata predictor to the target learning task of Cd concentration prediction. Note that we perform transfer with only this Ni source task, whereas, in [1], zinc (Zn) concentrations were also processed at the source.
Alignment with the FPD framework: The SNT algorithm processes the Ni concentrations measured at all locations (i.e. source data, ) and uses these to compute the Ni outputdata predictor (16). The FPDa algorithm processes of the measured Cd concentrations (i.e. target data, ), along with the transferred Bayesian predictive moments, and , of the source Ni outputdata predictor (16), in order to improve the prediction of the Cd concentrations at the holdout locations, , with spatial coordinates . The aim is to compute and , where and are the lower subvector and lowerright submatrix of (AL12a) and (AL12b), respectively. In the Ni source task (SNT), the MAE for the GP predictive mean (15a) was computed across a wide range of candidate kernel functions, weighting the inputs, , via automatic relevance determination (ARD) [37]. The optimal choice was the rational quadratic kernel (Table II). In the isolated Cd target task (also with ARD), the Matérn kernel proved to be optimal, and was adopted for all the target learning algorithms (Table I).
Parameter learning: The notion of a synthesis model (of the type in (20)) does not, of course, arise in this realdata context. Instead, we adopt maximum likelihood (ML) estimation [37] in order to learn the parameters of the analysis models underlying the various algorithms in Table I. For this purpose, we use the iterative LBFGSB algorithm [50]—widely available as a library tool—to compute a local optimum of the logmarginal likelihood surface. The maximum number of iterations is limited to 20,000. We initialize the parameters (Table II) of the rational quadratic kernel with and the parameters of the Matérn kernel with , where the value of the lengthscale, , now applies to both input dimensions of in accordance with the ARD mechanism. The coefficients, and , parameterizing the coregionalization matrix, (22), are initialized as . For this Jura dataset, we found that the adopted ML procedure was insensitive to perturbations of these initial parameters.
Algorithm  MAE 

TNT  5.9273e01 0.0001e01 
ICM  5.2808e01 0.0003e01 
FPDa  4.9966e01 0.0005e01 
FPDb  1.9510e01 0.0003e01 
Performance: To assess quantitatively the prediction performance of the algorithms in Table I, we compute the MAE (23) across the holdout locations where we wish to predict Cadmium. We find that the FPDa algorithm delivers positive transfer (i.e. it has a lower MAE than the baseline TNT algorithm), and, importantly, it outperforms the ICM algorithm. This improves on our former FPDb algorithm which suffers negative transfer because—as explained in Section IV—it is not equipped to adopt and learn the target’s correlation structure via (22). Our FPDa performance is close to that reported in [1], despite the fact that we transfer predictive source knowledge only from Ni concentrations, whereas Zn concentrations are also processed in [1], as already noted above. Returning to our own FPDa algorithm and the alternatives in Table I, we summarize our findings in Fig. 4. In the top row, we illustrate the spatial maps of the marginal predictive mean concentration of Ni (SNT) and Cd (TNT, ICM, FPDa), equipping these with uncertainty represented by the corresponding marginal predictive standard deviation maps (bottom row). These have been computed on a uniform spatial grid of the input, . In this sense, the current study can be characterized as an image reconstruction task, driven by nonuniformly sampled and noisy measurements (i.e. pixel values), with registration between the source and target images. The Matérn kernel provides the required prior regularization that induces spatial smoothness in the (Cd) reconstruction [28, 41]. Note that the FPDb algorithm is not illustrated in Fig. 4, since it performs poorly (Table III).
Visual inspection of the TNT, ICM and FPDa columns of Fig. 4 reveals that our algorithm (FPDa) localizes the Cd deposits more sharply than either the ICM algorithm, or (unsurprisingly) the TNT algorithm, which does not benefit from any Ni source learning. In particular, note the horizontal blurring of the Cd predictive mean map for ICM (third column, top), an artefact that is avoided in FPDa (fourth column, top). This localization of the FPDa prediction supports the exploration of Cd deposits in a focussed area around . In contrast, the ICM predictive mean map does not sufficiently resolve the coordinate to support on exploration decision in this area.
Vi Discussion
We will now assess our FPDoptimal BTL framework (i.e. Algorithm 1 (FPDa)) in the context of the conventional Bayesian multitask learning framework (via the ICM algorithm). We will focus on a number of specific themes that emerge from the evidence in Sections IV and V.
Transfer of moments: In the previously considered Gaussian settings [10, 30, 31] of our FPDoptimal BTL framework, the secondorder moments of the outputdata predictor, , were not successfully transferred, leading to negative transfer. This is also true of FPDb in this paper (Fig. 2(a) and (b) and Table III). In all those papers, the order of arguments in the KLD was the reverse of the one proposed in this paper (11), a reversal which is essential for the robust transfer property of our FPDa algorithm. This reversal has ensured the successful transfer of the secondorder moment, , of (16), even in the current Gaussian setting, yielding the blockdiagonal covariance structure (AL12). Equipping the FPDa algorithm with delivers robust knowledge transfer, as demonstrated in the following remark.
Remark 2.
Consider (AL12) for (i.e. the target after transfer) in the limit where the source knowledge becomes uniform (and, therefore, noninformative [12]). In this case,
being the moments of the isolated TNT learning task. This rejection of noninformative source knowledge by the target in our FPDa algorithm is what we mean by robust transfer.
The role of in effecting robust transfer—as well as in strengthening the positive transfer above threshold which we observed in Fig. 2(a) and (b)—will be studied more technically in a forthcoming paper.
Weighting of the transferred knowledge: Recall that the matrix, (18)—which is instantiated in the ICM case in (21) and (22)—expresses the target’s interaction with the source. Inspecting (AL12) and (AL13), we also see that controls the weighting mechanism in the FPDa algorithm, tuning the influence of the transferred knowledge, , on the optimal target conditional, (11). This is seen in the fact that its target moments, and , are functions of . The study of this behaviour can be formalized, by adopting similar limit arguments as in Remark 2. Furthermore, the optimization of
—which is a hyperparameter of
(see (19))—via maximum likelihood estimation (Section V) ensures that this weighting is datadriven (i.e. adaptive). This is also an advance beyond our previous proposals for FPDoptimal BTL [10, 30, 31], in which no transfer weighting mechanism was induced. This progress has been achieved only because the target modeller is an extended modeller of the interacting sourcetarget tasks, unlike the situation in our previous work where the target models only its local task. Technical details relating to this inducing of a datadriven transfer weighting will be developed further in future work.Robustness to analysissynthesis model mismatch: In simulations contexts—such as in Section IV—the data are simulated from the chosen synthesis model, while each assessed algorithm is consistent with an analysis model. This provides an opportunity to assess the robustness of each algorithm to model misspecification, as studied in Section IVB. We have already noted that the ICM algorithm achieved a performance optimum across all the studied algorithms in Section IV since its (ICM) analysis model equals the (ICM) synthesis model used to generate all data in that study. Nevertheless, the FPDa algorithm performs almost as well (Fig. 2), despite the fact that it does not rely on complete instantiation of the synthesis model. Indeed, as explained in Section II, FPDoptimal BTL does not require completion of the interaction model between the source and target at all, and has therefore proved to be more robust to ignorance of the synthesis model than the ICM algorithm has. In realdata applications—such as in Section V—there is, of course, no synthesis model. The benefit of our incompletely modelled approach in terms of robustness has been demonstrated: FPDa can provide a closer fit to the data than ICM, as presented in Fig. 4 and Table III. The SNT task involves a second independent local modeller (Fig. 1(c)) which constructs its source outputdata predictor, (16), using the same source data that the ICM algorithm processes. The source’s local modelling expertise (Section IVB) has enriched the knowledge transfer, and provided a supplementary learning resource which is not available to ICM.
Computational load of our BTL algorithm: The principal computational cost of all the algorithms in Table I resides in inverting the linear systems involved in the standard Gaussian (conditional) data update (15). This cost scales cubically with the number of data in the source and target tasks, respectively, i.e. , [22]. Owing to the sequential nature of the knowledge processing in FPDb (i.e. source processing of raw data , then target processing of the transferred source Bayesian predictor, , along with the target raw data, ), the net computational load of FPDb is . In contrast, our new BTL algorithm (FPDa) shares the property with ICM of bidirectional knowledge flow (Fig. 1(c)). This leads to the matrix augmentation form of the inverse systems in (AL12) and, correspondingly, a more computationally expensive——algorithm. There is a rich literature on efficient inversions for GP regression, exploiting possible reductions (notably sparsity) in the covariance matrix, (18). These same reductions have been applied in multitask GP learning [22], such as in ICM. Since our new FPDa BTL algorithm shares the same augmented matrix structure (AL12) as ICM, it, too, can benefit from these reductions.
Source knowledge compression: Conventional multitask learning—such as in Fig. 1(b)—requires the communication of all of the source’s extended (training) data, , to the processor (either centrally or in some distributed manner), i.e. the transferred message size (quantified as the number of scalars) is . Conversely, our FPDa algorithm (Fig. 1(c)) is a truly Bayesian transfer learning algorithm, whose communication overhead resides only in the transfer of the source’s sufficient statistics, , of its outputdata predictor at predictive inputs, , i.e. (16). Therefore, the transferred knowledge is fully encoded by these source sufficient statistics without the requirement to transfer the raw extended source data, . The message size for our BTL algorithm is therefore . It follows that the condition which must hold for compression to be achieved in our BTL algorithm versus conventional raw data transfer is^{4}^{4}4Here, we quote the implied inequality for the special case, . , i.e. the number of predictive points must be lower than two times the dimension of each input (minus one). This condition expresses the objective of avoiding dilution of source knowledge in the predictor [44]. Note also that the source sufficient statistics are functions of the source GP kernel structure, (Remark 1). This can be highly expressive—with many degrees of freedom—in cases of local source expertise. We have provided a preliminary study of the benefit to the target of a distinct source covariance structure (Section IVB, and see also the third discussion theme above). Indeed, far more expressive covariance structures in —involving weighted combinations of canonical kernel functions (Table II)—can also be adopted by the source [34]. In all cases, our Bayesian knowledge transfer compresses this resource into dimensioninvariant sufficient statistics for subsequent processing by the target task.
———
We now summarize key findings in respect of our BTL framework (and its incomplete modelling of interaction between the independent source and target modellers (Fig. 1(c))) versus conventional multitask learning (and its complete and unitary modelling of all sourcetarget variables (Fig. 1(b))).

[leftmargin=12pt]

Our FPDoptimal BTL algorithm (Fig. 1(c)) is truly a transfer learning algorithm—as opposed to a multitask learning algorithm—because it does not require complete modelling of sourcetarget interactions. Among the consequences we discovered in this paper are (i) the possibility to avoid problems of model misspecification, and (ii) the opportunity to transfer potentially compressed source sufficient statistics of the datapredictive distribution, , instead of the raw data themselves.

The source and target modellers (Fig. 1(c)) are independent, and, indeed, our BTL algorithm is truly a multiple model algorithm, in contrast to the single global modeller of multitask learning (Fig. 1(b)). We have shown that a direct consequence is the ability of our algorithm to transfer source knowledge enriched by an expressive local kernel structure.

Our BTL algorithm delegates the processing of local data to the local (source) task before the resulting stochastic message, , is transferred for subsequent processing by the target task (along with its local (target) data). This opens up the possibility for distributed and parallel processing in a way that is not intrinsic to multitask learning.

In our formulation of BTL (Fig. 1(c)), we have exploited the opportunity to transfer knowledge to a target task that is, itself, a global modeller. The paper has presented evidence to show that this extension of the target ensures that unreliable source knowledge is rejected (i.e. robust transfer is achieved (Remark 2)), and our positive transfer above threshold is as good as that of conventional multitask learning, such as ICM.
Vii Conclusion
A vulnerability of conventional multitask learning arises from the fact that there must exist a global modeller of all the tasks in the system. This imposes a requirement on the global modeller to capture task interactions accurately, and performance is undermined otherwise. Furthermore, it is intrinsic to that framework that the global modeller must process raw data from the remote sources. This potentially incurs a communication overhead and forces the target to process remote source data about which it may lack local expertise. Against this, the complete nature of multitask modelling ensures its optimal performance in the (admittedly unlikely) case where task interactions are accurately modelled.
The FPDoptimal Bayesian transfer learning (BTL) framework developed and tested in this paper has achieved important progress beyond the conventional stateoftheart above. Its key advance is that it does not require elicitation of a model of dependence between the interacting tasks (a defining aspect of BTL, in our opinion, called incomplete modelling), and so our approach avoids the misspecification that inevitably arises in conventional multitask learning. Instead, it chooses the conditional target distribution—from among the uncountable cases consistent with the partial model and with the knowledge constraints—in a decisiontheoretically optimal manner. A number of important practical benefits flow from this. Firstly, the local source processes its local data—exploiting its local modelling expertise—into a Bayesian (i.e. probabilistic) predictor. Secondly, only the sufficient statistics of this predictor need be transferred to the target. Thirdly, the target then sequentially has the opportunity to process this source knowledge along with its local target data. Fourthly, our framework allows this target to be a global modeller of both tasks, ensuring that optimal positive transfer is preserved in the case specified at the end of the previous paragraph.
The success of our algorithm hinges on its ability to transfer all moments of the source predictor, an advance beyond our earlier variants of BTL, and achieved via careful specification of the KLD objective (11). The source covariance matrix, , has been vital in knowledgedriving the weighting attached to the transferred predictor by the target. This has obviated the need for hierarchical relaxation of the target model which was necessary in our previous work in order to learn this weight, and which incurred relatively expensive variational approximation [32, 33].
The FPDoptimal BTL design of the target’s sourceconditional learning (11) can be interpreted as a binary operator, closed within the class of probability distributions. It operates (noncommutatively) on the transferred source outputdata predictor, , and on the target’s posterior distribution, , yielding the optimalknowledgeconditional target posterior inference, . This closure of our BTL operator within the class of probability distributions recommends it in a wide range of continual learning tasks [28, 25]
. For instance, in incompletely modelled networks, the requirement for all BTL knowledge objects to be distributions ensures that local computational resources onboard local nodes are exploited in processing local data into these local distributions. Furthermore, target nodes can recursively act as source nodes in a continual learning process, so that probabilistic decisionmaking is distributed across a network, subject to specification of the network architecture. Such applications of our work in Bayesian networks—and, indeed, in deep learning contexts
[21]—can provide a rich opportunity for our Bayesian transfer learning paradigm to address key technology challenges at this time.Appendix A Proofs
Aa Preliminaries
Lemma 1.
Let us consider the following factorized, jointly Gaussian model:
Then, the joint distribution is
(24) 
Proof.
The proof follows from standard analysis. ∎
Lemma 2.
Let us consider a slight generalization of the Gaussian distribution in (
24):Then, the conditional and marginal densities are
where
Proof.
The proof follows from standard analysis. ∎
Comments
There are no comments yet.