Methods of data and information fusion are receiving much attention at present, because of their range of applications in industry 4.0, in the navigation and localization problems, in sensor networks, robotics, and so on Fou:17 – AlaHan:20 .
The terms data fusion and information fusion are often used as synonyms. However, in some scenarios, the term data fusion is used for raw data that are obtained directly from the sensors, while the term information fusion concerns processed or transformed data. Other terms associated with data fusion include data combination, data aggregation, multi-sensor data fusion, and sensor fusion Cas:13 .
Data fusion techniques combine multiple sources in order to obtain improved (less expensive, higher quality, or more relevant) inferences and decisions compared to a single source. These techniques can be classified into three non-exclusive categories: (i) data association, (ii) state estimation, and (iii) decision fusionCas:13 . In this paper, we focus on state estimation methods.
Conventional data fusion methods work with multiple data channels from one common domain, and originating from the same source. In contrast, cross-domain fusion methods work with data in different domains, but related by a common latent object Zhe:2015 . Data from different domains cannot be merged directly. Instead, knowledge—or “information” above—has to be extracted from these data and only then fused. One method of knowledge fusion is transfer learning, also known as knowledge transfer. This framework aims to extract knowledge from a source domain via a source learning task and to use it in the target domain with a given target learning task. The domains or tasks may differ between the source and target PanYan:10 . Examples of successful deployment of transfer learning in data fusion are found in OuyLow:20 – LinHuXiaAlhPir:20 . In accordance with the DIKW classification scheme proposed in Bed:20 , we will refer to transfer learning-based fusion as knowledge fusion.
The performance of transfer learning methods can be improved using computational intelligence Leetal:15 . Bayesian inference provides a consistent approach to building in computational intelligence. It does so via probabilistic uncertainty quantification in decision-making, taking into consideration the uncertainty associated with model parameters, as well as, the uncertainty associated with combining multiple sources of data. In the Bayesian transfer learning (BTL) framework—to be championed in this paper—the source and target can be related through a joint prior distribution, as in KarQiaDou:18 –WanTsu:20
. BTL usually adopts a complete stochastic modelling framework, such as Bayesian networksLiWanLiWan:20
, Bayesian neural networksChaKap:19 or hierarchical Bayesian approaches WilFerTad:12 . As already noted, these methods require a complete model of source-target interaction. In contrast, in PapQui:21
, BTL is defined as the task of conditioning a target probability distribution on a transferred source distribution. A dual-modeller framework is adopted, where the target modeller conditions on a probabilistic data predictor provided by an independent local source modeller. No joint interaction model between the source and target is specified, and so the source-conditional target distribution is non-unique and can be optimized in this incomplete modelling scenario. The target undertakes this distributional decision-making task optimally, by minimizing an approximate Kullback-Leibler divergenceKulLei:51 . This generalized approach to Bayesian conditioning in incomplete modelling scenario is known as fully probabilistic design QuiKarGuy:16 .
Our aim in this paper is to derive a BTL algorithm for knowledge fusion that will use knowledge from several source state-space filters to improve state estimation in a single target state-space filter. All (observational and modelling) uncertainties are assumed to be bounded. State estimation under bounded noises represents a significant focus for state filtering methods, since, in practice, the statistical properties of noises are rarely known, with only their bounds being available. They avoid the adoption of unbounded noises, that can lead to over-conservative design Ono:13 . To the best of our knowledge, the topic of BTL-based multi-task/filter state estimation with bounded noises has not yet been addressed in the literature, except in the author’s previous publications JirPavQui:19a ; JirPavQui:20 . In those papers, BTL between a pair of filters affected by bounded noises is presented. The source knowledge is represented by a bounded output (i.e. data) predictor. The optimal target state filtering distribution is then designed via FPD. In JirPavQui:19a , the support of the state inference is an orthotope, while in JirPavQui:20 , it is relaxed to a parallelotope.
There are fusion techniques for state estimation with bounded noises, but these are conventional fusion methods as defined above. Data fusion methods using set membership estimation are addressed, for instance, in WanSheXiaZhu:19 – XiaYanQin:18 . In HanHor:00 , set membership and stochastic estimation are combined. In CheHoYu:17 , local Kalman-like estimates are computed in the presence of bounded noises. Particle filters Lietal:16 can also effectively solve the Bayesian estimation problem with bounded noises. However, they are computationally demanding. When used in data fusion context, reduced computational complexity is obtained in HoaDenHarSlo:15 . In WanSheZhuPan:16 and BalCaiCri:06 , particle filtering techniques and set membership approaches are combined.
The current paper significantly extends and formalizes results on BTL reported in the above-mentioned authors’ papers JirPavQui:19a and JirPavQui:20 . Both of those papers report an improvement in target performance in the case of concentrated source knowledge (positive transfer) and rejection of diffuse source knowledge (robust transfer). However, the improvement was only minor compared to the performance of the isolated target, whereas ad hoc proposed variants exhibited significantly improved positive transfer. In the current paper, we formalize the above-mentioned informal variant, showing it to be FPD-optimal. The task of transfer learning-based knowledge fusion with bounded noises is solved in the case where the transferred knowledge is the source’s probabilistic state predictor. An extension to multiple sources is also provided in this paper.
The paper is organized as follows: This section ends with a brief summary of the notation used throughout the paper. Section 2 presents the general problem of FPD-optimal Bayesian state inference and estimation in the target, conditioning on transferred knowledge from a source in the form of the probabilistic state predictor. In Section 3, these general results are specialized to source and target state-space models with uniform noises, and are finally extended to the case of multiple sources in Section 3.3. Section 4 provides the extensive simulation evidence to illustrate the performance of our FPD-optimal BTL scheme. Comparison with a contemporary (non-Bayesian) fusion method for uniformly driven state-space models is also provided, as well as comparison with a completely modelled Bayesian network approach. Section 5 concludes the paper. The proofs of all the theorems are provided in Appendix A.
Matrices are denoted by capital letters (e.g.
), vectors and scalars by lowercase letters (e.g.). is the (i,j)-th element of matrix . denotes the -th row of . denotes the length of a (column) vector , and denotes the set of . Vector inequalities, e.g. , as well as vector maximum and minimum operators, e.g. , are meant entry-wise.
is the identity matrix.is the set indicator, equalling if and otherwise. is the value of a time-variant column vector , at a discrete time instant, ; is the -th entry of ; . is the Euclidean norm of and is the norm of
. Note that no notational distinction is made between a random variable and its realisation,. The context will make clear which is meant.
2 FPD-optimal Bayesian transfer learning (FPD-BTL)
Assume two stochastically independent modellers, the source (with subscript S) and the target (without subscript), each modelling their local environment. Here, we will formulate the task of FPD-optimal Bayesian transfer learning (FPD-BTL) between this source and target, the aim being to improve the target’s model of its local environment via transfer of probabilistic knowledge from the source’s local environment, as depicted in Figure 1.
Before addressing the two-task context, let us recall the state estimation problem (filtering) for an isolated target, i.e. in the absence of knowledge transfer from a source.
In the Bayesian filtering framework Karat:05
, a system of interest is described by the following probability density functions (pdfs):
Here, is an -dimensional observable output, is an optional -dimensional known (exogenous) system input, and is an -dimensional unobservable (hidden) system state. We assume that (i) the hidden state process, , satisfies the Markov property; (ii) no direct relationship between input and output exists in the observation model; and (iii) the optional inputs constitute of a known sequence , , as already stated.
Bayesian filtering, i.e. the inference task of learning the unknown state process, , given the data history , involves sequential computation of the posterior pdf, . Specifically, is a (multivariate) sequence of observed data, , . Evolution of is described by a two-step recursion (the data update and time update) initialized with the prior pdf, (2), and ending with a data update at the final time, .
The data update (Bayes’ rule) processes the latest datum, :
The time update (marginalization) infers the evolution of the state at the next time:
Next, we return to two stochastically independent modellers, i.e. the source and target (Figure 1). Each filter models its local system, and , respectively. The target has access only to the (probabilistic) state predictor of the source, , but not to the actual data or states of the source (Figure 1).
In the isolated target task, the modeller’s complete knowledge about the evolution of its local state and output is expressed uniquely by the joint pdf (i.e. the numerator in (2)):
where and .
Now, performing knowledge transfer as depicted in Figure 1, the target joint pdf (4) must be conditioned by the transferred source state predictor, , and so the target’s knowledge-conditional joint pdf takes the form . Since no joint model of the source and target relationship is assumed, this pdf is non-unique, and unknown. Specifically, it is a variational quantity, , in a function set, , of possible candidates, as follows:
We now separately examine the two factors on the right-hand side of (5):
The factor represents the target’s knowledge about its (local) state , after transfer of the source’s state predictor, , to the target. The target chooses to accept the source’s predictor as its own state model with full acceptance. The consequences of this definition will be discussed in Section 4.7. Based on this full acceptance, the target accepts that and are equal in distribution:
The factor now remains as the only variational factor, being a consequence of the target’s choice not to elicit an interaction model between the source and target (Figure 1). According to (2), , i.e. the observation model is conditionally independent, given , of . This conditional independence is preserved by the knowledge transfer. Therefore,
The main design—i.e. decision—problem for the target is now to choose an optimal form of (7).
where and . The set, , of the target’s admissible joint models, , following knowledge transfer from the source, is therefore
The optimal pdf, , respecting both the transferred knowledge and the target filter behaviour, is sought using fully probabilistic design (FPD), which is an axiomatically justified procedure for distributional decision-making KarKro:12 , Ber:79 . It seeks , being the joint pdf (8) that minimizes the Kullback-Leibler divergence (KLD) (below) KulLei:51 from to the target’s fixed ideal, . This ideal is defined as (4), i.e. the joint pdf of the isolated target filter, modelling its behaviour prior to (i.e. without) the transfer of source knowledge. To summarize, the ideal pdf, and the knowledge-conditional pdf to be designed by the target are, according to (4), (8) and (9):
Recall that the KLD KulLei:51 from to is defined as
(13) conditions the target’s knowledge about future on the transferred in an FPD-optimal manner. For simplicity, the superscript will be omitted in the resulting FPD-optimal pdf, i.e. .
We note the following:
The transferred source knowledge, , can be elicited in various ways that are unknown to the target; e.g. as an empirical distribution of a quantity similar to , or some unspecified distributional approximation, etc. QuiKarGuy:17 . In this paper, involving multiple state filtering tasks, we will assume that is the output of the source’s synchronized time update at (3).
3 FPD-BTL between LSU-UOS filtering tasks
As noted in Section 1, we are specifically interested in knowledge processing among interacting Bayesian state-space filters with uniform noises (LSU models, see below). We therefore instantiate the FPD-optimal scheme (13) for conditioning the target’s observation model on the source’s transfered state predictor in this specific context. Firstly, in Section 3.1, we review the isolated LSU-UOS filter, and derive the approximate solution to the related state estimation problem. Then, the required instantiation of FPD-BTL to a pair of these LSU-UOS filters is presented in Section 3.2. In Section 3.3, the framework is extended to multiple LSU-UOS source filters, transferring probabilistic state knowledge to a single target.
3.1 LSU-UOS filtering task for the isolated target
where , , . , , are known model matrices of appropriate dimensions; and are additive random processes expressing observational and modelling uncertainties, respectively, and their stochastic model must now be specified. We assume that andknown supports of finite measure:
where , , with finite positive entries, and denotes the uniform pdf on an orthotopic support (UOS), as now defined.
Consider a finite-dimensional vector random variable, , with realisations in the following bounded subset of :
where . This convex polytope, , is called an orthotope.
The uniform pdf of on the orthotopic support (17) called the UOS pdf is defined as
Model (14), (15), (16), together with (18), defines the linear state-space mode with uniform additive noises on orthotopic supports, denote the LSU-UOS model. Its observation and state evolution models (2) are equivalently specified as
Exact Bayesian filtering for the LSU model (19) and (20)—i.e. computation of following (2) and (3)—is intractable, since the UOS class of pdfs (Remark 1) is not closed under those filtering operations. One consequence is that the dimension of the sufficient statistic of the filtering pdf (2) is unbounded as
grows i.e. at an infinite filtering horizon and so cannot be implemented (the curse of dimensionalityKarat:05 ). In JirPavQui:19b ; PavJir:18 , approximate Bayesian filtering with the LSU model (19) and (20), closed within the UOS class (18), is proposed. This involves a local approximation after each data update (2) and time update (3), as recalled below. This tractable but approximate Bayesian filtering procedure will be called LSU-UOS Bayesian filtering.
3.1.1 LSU-UOS data update
Define a strip, , as a set in
bounded by two parallel hyperplanes, as follows:
Here are scalars, and .
In the data update (2), prior is processed together with in (19), and with the latest observation, , via Bayes’ rule. It starts at with . The resulting filtering pdf is uniformly distributed on a polytopic support that results from the intersection of the orthotopic support of and strips induced by the latest observation, :
The approximate Bayesian sufficient statistics, and , process tractably , yielding an implementable algorithm. The details are provided in JirPavQui:19b .
3.1.2 LSU-UOS time update
It now remains to ensure that each data update (above) is, indeed, presented with a UOS output from the preceding time update, as presumed. In each time update, the UOS posterior, (23), is processed together with (20)—uniform on -dependent strips—via the marginalization operator in (3). The resulting pdf does have an orthotopic support, but is not uniform on it. In JirPavQui:19b , the following local approximation projects back into the UOS class, :
3.2 FPD-BTL between a pair of LSU-UOS filtering tasks
We now return to the central concern of this paper: the static FPD-optimal transfer of the state predictor, , from the source LSU-UOS filter (“the source task”) to the target LSU-UOS filter (“the target task”). The transfers will occur statically, , meaning that the marginal state predictor,
is transferred in each step of FPD-BTL. (For a derivation of joint source knowledge transfer—i.e. dynamic transfer—in Kalman filters, seePapQui:18 .)
Although there exists an explicit functional minimizer of (13) (see. QuiKarGuy:17 ), our specific purpose here is to instantiate this FPD-optimal solution for UOS-closed filtering in the source and target tasks, as defined in Section 3.1.
We propose that the FPD-optimal target knowledge-constrained observation model (13) (i.e. after transfer), , be uniform with its support, , bounded (here, our set notation emphasizes the fact that the support is a function of ). We now prove that this choice is closed under the FPD optimization (13). While the following theorem is formulated for uniform pdfs on general bounded sets, it is applied to our UOS class in the sequel.
Let the target’s ideal pdf in FPD (13) be its isolated joint predictor (10). Assume that the target’s (pre-transfer) state predictor, is uniform on bounded support, . is defined in (19). The transferred source state predictor, , is also uniform, with bounded support, . Define the bounded intersection (Figure 2):
Assume that the (unoptimized) variational target observation model, (11), is also uniform with bounded support.
If , then the optimal choice of minimizing the FPD objective (13) is
where the FPD-optimal set of after transfer of the source knowledge is deduced to be .
If —a testable condition before
transfer—then knowledge transfer is
stopped,111 This decision is consistent with the definition of
This decision is consistent with the definition of conditional probability.and , i.e. the optimal target conditional observation model is defined to be that of the isolated target.
See A.1 ∎
|case 1||case 2|
The sets and are functions only of and , respectively, i.e. they are local statistics of the target or source tasks, respectively. In this way, FPD-BTL is effecting transfer of optimal statistics (knowledge) from source to target, in the spirit of knowledge fusion (Section 1). This is in contrast to any requirement to transfer raw data from the source, for processing in the target, as occurs in conventional multi-task inference (see Section 4). This property of transfer of source-optimal statistics to the target is a defining characteristic of FPD-BTL.
Corollary 3 (Specialization to the UOS case).
Effectively, then the FPD-optimal transfer restricts the support of the target’s (prior isolated) state predictor, , to (26), and this then forms the prior for the subsequent processing of the target’s local datum, via a conventional data update (29).
The knowledge is processed sequentially in the target, i.e. firstly the target processes local , yielding ), (27); secondly, the target filter processes local (data update (29)); thirdly the target predicts via the local target time update (3), making available to the next (3-part) step of FPD-BTL its knowledge-conditional state predictor, . Knowledge transfer is therefore interleaved between the time and data updates.
The implied algorithmic sequence for FPD-BTL between a pair of LSU-UOS filters is provided in Algorithm 1.
3.3 FPD-BTL for multiple LSU-UOS sources and a single LSU-UOS target
Here, we extend FPD-BTL (Sec. 3.2) to the case of multiple bounded-support sources, which can be specified to the case of multiple interacting LSU-UOS tasks, again via Corollaries 3 and 4. Assume the same scenario as in Figure 1, with one target but, now, sources, (i.e. interacting LSU-OUS tasks in total). Once again, the instantiation of the tasks is avoided (i.e. incomplete modelling). Each source provides its state predictor , , statically, , to the target in the same way as in the single source setting.
Let there be state-space filters, , , …, , having bounded supports of their state predictors, , ,…, , respectively. Assume is the target filter, and , …, the source filters. Then the FPD-optimal target observation model after transfer for the source state predictors is
See A.2. ∎
4 Simulations studies
In this section, we provide a detailed study of the performance of the proposed Bayesian transfer learning algorithm (FPD-BTL) between LSU-UOS filtering tasks. We compare it to Bayesian complete (network) modelling (BCM, to be defined below) for the UOS class and to the distributed set-membership fusion algorithm (DSMF) for ellipsoidal sets WanSheXiaZhu:19 , which also involves complete modelling of the networked LSU filters.
In the design of these comparative experiments, our principal concerns are the following:
To study the influence of the number of sources on the performance of the target filter in FPD-BTL (experiment #1).
To compare FPD-BTL to complete modelling alternatives (BCM and DSMF, experiment #2).
To study the robustness of FPD-BTL—which does not require for tasks interaction (i.e. it is incompletely modelled)—to model mismatches that inevitably occur between source and target tasks in the complete modelling approaches (BCM and DSMF) (experiments #3–#5).
To assess the computational demands of the proposed FPD-BTL algorithm in comparison to the competitive methods (BCM and DSMF).
Section 4.1 explains the necessary background, emphasizing the important distinction between the synthetic and analytic model in these simulation studies. Then, model mismatch and its types are specified (Sections 4.1.1 and 4.1.2). The specific LSU-UOS systems (19), (20) used in our studies are described in Section 4.2. The completely modelled alternatives (BCM and DSMF) are reviewed in Section 4.3, and the evaluation criteria are defined in Section 4.4. Then, the experimental results are presented and discussed in Section 4.5, before overall findings are collected and interpreted in Section 4.7.
4.1 Synthetic vs. analytic models
In computer-based simulations—such as those which follow—we explicitly distinguish between the synthetic model, used for data generation, and the analytic model on which the derived state estimation algorithm depends. The synthetic model can be understood as an abstraction of a natural (physical) data-generating process, while the analytic model is a subjective (i.e. epistemic Jay:03 )—and inevitably approximate—description of this process adopted by the inference task (here, the LSU-UOS filters).
Figure 3 shows three models for a pair of state-space filters adopted in this paper, as either synthetic or analytic models (or both). If the V-shaped graph (Figure 3a) is used as the synthetic model, state sequence is realized commonly for all the filters, via the Markov state process (15), and then locally corrupted via independent, additive, white UOS observation noise processes (14). If the analytic model is also the V-shaped graph (Figure 3a) with known parameters, then we refer to this as complete modelling, as adopted in BCM and DSMF.
The U-shaped graph (Figure 3b) is adopted as the synthetic model in some of the experiments below. here, the target state sequence, and source state sequence are synthesized as distinct—but mutually correlated—processes, with an appropriate fully specified interaction model between and . The U-shaped graph is not used as an analytic model in this paper.
As already explained in sections 2 and 3, the multiple modeller approach (Figure 3c) is adopted as the analytic model only in our proposed FPD-BTL approach, expressing the fact that the target elicits no model of the source process or of its relationship to it. The source and target analytic models are therefore stochastically independent, and can be interpreted as independent 2-node marginals of a (unspecified) 4-node complete model. This arrangement respects the key notion of local expertise, i.e. the commonly encountered situation in distributed inference where the source is a better local analytic modeller (i.e. expert) of its local data than the remote target modeller ever can be.
In the simulation studies most frequently encountered in the literature, the synthetic and analytic models are implicitly assumed to be identical. In the computer-based synthetic-data experiments below, modelling mismatch can be explored, and is, indeed, our priority. However, in real-data studies, the notion of a synthetic model is inadmissible Jay:03 . It follows that there is therefore almost-sure mismatch between the (typically unknowable) “truth theory” of synthesis Jay:03 and the analysis model which prescribes the adopted algorithm. It is for this reason that a study of analysis-synthesis modelling mismatch—as provided below—is of key importance, particularly in assessment of the robustness or fragility of the algorithm to such mismatches.
In the forthcoming simulations, modelling mismatch will be arranged at the level of the state sequence(s), either via mismatches in the process noise (16), or state matrix mismatches, (15). These are detailed in the next two subsections.
4.1.1 State noise mismatch
The target filter’s state at time is synthesized according to (15) with a uniform noise, (16). If the sequence is common for both the filters, data synthesis is described by the V-shaped graph (Figure 3a), as already noted in Section 4.1. Synthesis via the U-shaped graph (Figure 3b) realizes distinct state processes, and , via the operating parameter, , which controls the interaction (i.e. correlation) between them:
Here, and are mutually independent, white UOS state processes in . The source’s analytic model is also given by (15) with perfectly matched parameters. However, we will assume that the target modeller is unaware of the mismatching noise process, , and so the target’s analytic model of their local state process, , is also (15). This enforces a mismatch between the target’s synthetic model (4.1.1), and its analysis model (15), via the state noise mismatch process, (4.1.1). Note that if , the source and target synthesized states are identical (Figure 3a) and matched to the source and target analysis models(s). In contrast, if , then the marginal synthesis model (i.e. pdf) of
is trapezoidal (being the convolution of two uniform pdfs) with increased variance, while the target’s mismatched local analytic state model is uniform (20).
4.1.2 State matrix mismatch in the analytic models
In this section, we distinguish between the state matrix (15) in synthesis, , of the common state process, , (Figure 3a) ), and the state matrix/matrices used in analysis, . Specifically, we set (i.e. no synthesis-analysis mismatch in the source), but (i.e. mismatch in the target). There are several ways to achieve :
Modification of the eigenvalues of invertible. In the target analytic model, we modify the eigenvalues of geometrically in one of two ways:
Radial shift: a selected eigenvalue , of is multiplied by a real scalar operating parameter , i.e. , while maintaining Hermitian symmetry.
Rotation: here, is multiplied by a factor, , where the angle of rotation, , is the operating parameter, i.e. . Once again, Hermitian symmetry is maintained.
Multiplication of by a scalar, , i.e. . In this case, all eigenvalues of experience the same radial shift.
4.2 The synthesis models
The following specific LSU systems (14), (15), are simulated in the upcoming experiments, i.e. they specify the synthesis model for both and in the V-shaped graph (Figure 3a) or in the U-shaped graph (Figure 3b), as specified in (4.1.1). The uncertainty parameters, and (16), are specified in each experiment.
This system is studied in Fri:12 , being the discretization and randomization of the continuous-time system, , with sampling period, , and with added random processes, and , representing observational and modelling (i.e. state) uncertainties, respectively.
4.3 Alternative multivariate inference algorithms
The key distinguishing attribute of our FPD-BTL algorithm is its multiple modeller approach with incomplete modelling of the interaction between the tasks. Its defining characteristic—the transfer of source sufficient statistics and not raw data—for processing at the target, distinguishes it from methods that adopt a complete model of the networked tasks, often involving joint processing—at the target or other fusion centre—of the multiple raw data channels. We will reserve the term transfer learning (TL) for the former (FPD-BTL in the case of our FPD-optimal Bayesian TL scheme), and refer to the latter as multivariate inference schemes. We will compare FPD-BTL against two approaches to the latter: (i) Bayesian multivariate inference (Section 4.3.1) consistent with a complete analysis model (i.e. V-shaped network graph in Figure 3a); and (ii) distributed set-membership fusion (DSMF) (Section 4.3.2), a state-of-the-art, non-probabilistic, fusion-based state estimation algorithm WanSheXiaZhu:19 .