# Deterministic Bayesian Information Fusion and the Analysis of its Performance

This paper develops a mathematical and computational framework for analyzing the expected performance of Bayesian data fusion, or joint statistical inference, within a sensor network. We use variational techniques to obtain the posterior expectation as the optimal fusion rule under a deterministic constraint and a quadratic cost, and study the smoothness and other properties of its classification performance. For a certain class of fusion problems, we prove that this fusion rule is also optimal in a much wider sense and satisfies strong asymptotic convergence results. We show how these results apply to a variety of examples with Gaussian, exponential and other statistics, and discuss computational methods for determining the fusion system's performance in more general, large-scale problems. These results are motivated by studying the performance of fusing multi-modal radar and acoustic sensors for detecting explosive substances, but have broad applicability to other Bayesian decision problems.

## Authors

• 4 publications
• ### Sequential testing over multiple stages and performance analysis of data fusion

We describe a methodology for modeling the performance of decision-level...
03/27/2013 ∙ by Gaurav Thakur, et al. ∙ 0

• ### A Fusion Algorithm for Solving Bayesian Decision Problems

This paper proposes a new method for solving Bayesian decision problems....
03/20/2013 ∙ by Prakash P. Shenoy, et al. ∙ 0

• ### Efficient Large-Scale Multi-Modal Classification

While the incipient internet was largely text-based, the modern digital ...
02/06/2018 ∙ by D. Kiela, et al. ∙ 0

• ### Commuting Conditional GANs for Robust Multi-Modal Fusion

This paper presents a data driven approach to multi-modal fusion, where ...
06/10/2019 ∙ by Siddharth Roheda, et al. ∙ 0

• ### Medical Image Segmentation Based on Multi-Modal Convolutional Neural Network: Study on Image Fusion Schemes

Image analysis using more than one modality (i.e. multi-modal) has been ...
10/31/2017 ∙ by Zhe Guo, et al. ∙ 0

• ### Tracking object's type changes with fuzzy based fusion rule

In this paper the behavior of three combinational rules for temporal/seq...
10/08/2009 ∙ by Albena Tchamova, et al. ∙ 0

• ### Fusion of finite set distributions: Pointwise consistency and global cardinality

A recent trend in distributed multi-sensor fusion is to use random finit...
02/17/2018 ∙ by Murat Uney, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Sensor networks are ubiquitous across many different domains, including wireless communications, temperature and process control, area surveillance, object tracking and numerous other fields [2, 6]. Large performance gains can be achieved in such networks by performing data fusion between the sensors, or combining information from the individual sensors to reach system-level decisions [9, 16, 24, 26]. The sensors are typically connected by wireless links to either a separate information collector (centralized fusion) or to each other (distributed fusion). Elementary fusion rules based on Boolean logic are used in many contexts due to their simplicity and ease of implementation. On the other hand, in most situations we have some knowledge of the statistical properties of the sensors’ outputs, and designing fusion rules that take this into account can provide much better performance [17, 24]. The fusion rule can be built to satisfy any of various statistical optimality criteria, such as achieving the maximum likelihood or the minimum Bayes risk, under any other constraints of the problem [17]. Sensor information fusion can also be understood as a special case of the more general problem of statistical data reduction, where the goal is to reduce information from a high-dimensional space into a low-dimensional one in some optimal manner.

In many sensor fusion applications, it is important for the fusion rule to be a deterministic function of the sensor outputs. Fusion techniques that incorporate randomness are widely used in the sensing literature and can be more easily optimized to achieve given performance targets, such as false or correct classification probabilities

[19]

. However, in certain applications such as the detection of explosive compounds, it is common for the number of positive targets to be several orders of magnitude smaller than the number of negative ones, which means that randomized fusion rules have a large variance and rarely achieve their expected theoretical performance on realistic sample sizes. For simple, binary decision-level fusion problems, deterministic rules are easy to find

[24], but in general, requiring the fusion rule to be deterministic effectively introduces a nonconvex constraint that is difficult to incorporate into a numerical optimization framework. This type of constraint also complicates the calculation of a fusion rule’s expected classification performance, which is a key component of modeling and simulation efforts to design sensor layouts and to perform trade studies between different sensor configurations [23].

The goals of this paper are to study the fusion rule that is Bayes optimal among all deterministic fusion rules, to investigate its mathematical properties under different cost criteria and other problem constraints, and to develop a computational framework for finding its expected performance. These results are motivated by the standoff detection of threat substances using multi-modal sensors in a centralized fusion network. However, we formulate the problem in an abstract setting that makes minimal assumptions on the details of the sensors or the type of information they produce, in contrast to the relatively well-defined situations in much of the sensing literature (e.g., [5, 9, 21]). We first use variational techniques to derive the standard Bayesian posterior mean as the optimal deterministic fusion rule under a quadratic cost. We show that the resulting system’s classification performance is a smooth function under some regularity constraints on the problem, and describe extensions to more general settings where the posterior mean no longer applies. For a certain class of fusion scenarios with sub-Gaussian priors, we extend these results beyond the classical setting of a quadratic cost, showing that the same fusion rule remains optimal under higher order cost functions and that its classification performance exhibits stronger, pointwise-like asymptotic behavior. This mathematical theory is developed in Section 2, with the proofs of the theorems deferred to Appendix A. In Section 3, we apply these results to several illustrative examples with Gaussian, exponential and other statistics for the sensors and examine the different types of behavior that they exhibit. In Section 4, we finally discuss efficient computational methods for finding the performance of the fusion rule in practice, based on Monte Carlo integration techniques commonly used in machine learning and other fields.

## 2 Mathematical framework for data fusion

Suppose we have

sensors (random variables)

, each observing an object (hypothesis) and producing outputs that are sent into a fusion center (see Figure 1). The sensor outputs could represent simple true/false decisions, choices between several distinct classes, or collections of multiple continuous-valued physical features of an object. We want to design a fusion rule at that uses the information from the to minimize the Bayes risk, or in other words, the expected cost of making a wrong decision [24]. In real world problems, the sensor outputs are typically deterministic functions of the random object and any background noise. We thus make the standard assumption that the are all conditionally independent given , which is valid as long as the noise is white and each sensor observes the object at a slightly different time.

In what follows, we will denote the density and distribution of any random variable by and

respectively, and use the vector notation

(where each may itself be a vector). For any closed set , we let , and respectively be the space of continuous, bounded functions on , the space of functions with compact support, and the space of functions that are smooth (infinitely differentiable with bounded derivatives on ). In order to incorporate deterministic fusion, we view a density as a first-order generalized function [12, 22]

(i.e., a Schwartz distribution, but we use the former terminology to avoid confusion with the meaning of the term in probability theory). This formulation allows us to consider delta functions and other such function-like objects. It is equivalent to treating the distribution as a measure with singular components, but the generalized function framework is easier to work with in developing the theory below. The space of first-order generalized functions can be identified with the dual space of

and is denoted .

The fusion problem can now be described as follows. We are given the densities for , the prior and a cost function . We want to minimize the Bayes risk, or the expected cost of making a decision . The sensor densities represent either generic statistical models trained from experimental lab data or mathematical descriptions of the underlying sensor physics, and the prior comes from domain-specific operational knowledge. The fusion rule can be expressed as the density . We usually want the fusion center to produce the same type of information that represents, so it is reasonable to take the cost function to be a simple metric between and and motivates the choice for an appropriate increasing function .

We now proceed to establish several results on the optimal deterministic fusion rule and its performance characteristics. To maintain the clarity of exposition, we defer the proofs of these theorems to Appendix A. We first address the classical setting of a quadratic cost . The optimal fusion rule in this case reduces to the usual posterior expectation

corresponding to a naive Bayes classifier, but we establish it under the abstract formulation discussed above and use a variational argument in the proof that will be the foundation for later results. For the rest of the section, we also assume that all the conditions of Proposition

1 are satisfied unless stated otherwise.

###### Proposition 1.

Suppose we have a given object space , a collection of feature spaces with , and a decision space . Suppose is a single interval with . Let the fusion problem be set up as described above, with the random variables , and respectively taking on values in , and , and let . Assume that , and for every , and . Then there is an almost everywhere unique, deterministic, Bayes optimal fusion rule that combines the and produces outputs in . It is given by the posterior expectation

 f(A)=∫Ih(∏Mm=1dAm|H(Am,h))dH(h)dh∫I(∏Mm=1dAm|H(Am,h))dH(h)dh. (1)

The spaces , and

respectively represent the type of hidden information we want to estimate or classify between, the type of information we observe from the sensors and the type of decision we can make from those observations. We typically think of the objects and decisions as scalar quantities that may be discrete or continuous, while the feature spaces are potentially vector quantities (for example, a camera sensor that outputs images). Since we allow

to be a generalized function, the results of Proposition 1 cover the case where the object space is discrete and finite by taking to be a weighted sum of delta functions. This is the case in most sensor fusion applications, where there are a finite and known number of anomaly classes. Note that the deterministic constraint ((8) in the proof) is what allows the fusion rule to become the posterior expectation (1), and there can generally be other, random fusion rules (where is a density) that have better performance without this constraint imposed. Under some additional constraints on the problem, we can also study the properties of the classification performance of this fusion rule, given by the density .

###### Theorem 2.

Assume that and are compact, for all , and for each point there is at least one sensor and feature , , such that

 ∫I∫I⎛⎝dAm′|H(am′,~h)∂dAm′|H(am′,h)∂(am′)n−dAm′|H(am′,h)∂dAm′|H(am′,~h)∂(am′)n⎞⎠hdH(h)dH(~h)dhd~h≠0. (2)

Then the classification performance of the fusion rule (1), given by , is in for each .

The performance function

of the fusion rule is a generalization of the classical “confusion matrix” that describes the probabilities of correct and false classification in a binary, decision-level setting

[24]. For every object , the values along the diagonal are the likelihoods of the fusion rule giving the correct result, with a delta function corresponding to an ideal classifier (unattainable in a real situation). The condition (2) holds for simple examples such as those involving Gaussian statistics (see Section 3) but its exact form is not central to the result, as it can be replaced by a variety of weaker but more complicated conditions under which the stationary phase argument in the proof still holds. Under these conditions, Theorem 2 says that is actually a function, instead of merely a generalized function, and is meaningful at every point. The proof of Theorem 2 also provides a way to compute the performance function, by finding

 dC|H(c,h)=∫Jδ(c−f(a))(M∏m=1dAm|H(am,h))da. (3)

This is effectively an integral over only the level sets of the fusion rule in the joint feature space. In practice, these level sets can be numerically approximated from the fusion rule as long as either the fusion rule does not have “flat regions” where the gradient is identically zero, or the decision space is discrete. This is usually the simplest and most efficient approach to computing , although various “smoothed out” versions of (3

) can be used instead, such as taking the Fourier transform of (

3) and inverting it, as done in (Appendix A: Proofs of Theorems) in the proof.

The computation of the fusion rule (1) itself is simple and involves an integration over only the (small) object space . On the other hand, increasing the number of sensors will rapidly increase the dimension of the integral (3) for the performance function. In practice, each sensor generates only a moderate number of statistics and the individual feature spaces are relatively small (with dimension typically at most or ). This results in the joint density having a “block diagonal” dependence structure that allows the calculation of to remain computationally tractable. We will describe a Monte Carlo-based approach to perform this calculation efficiently in Section 4.

We next extend Proposition 1 to more general decision spaces that are not necessarily a single interval. This case is important for many applications in which and are both finite sets, the so-called binary decision and -ary classification fusion problems. The case where is also of interest in situations where there are several threat substances of interest, but we ultimately want to make a “true” or “false” decision at the fusion center. The posterior expectation (1) is no longer a feasible solution and cannot be applied to this situation, but we still have the following result.

###### Theorem 3.

If the decision space is a closed set but otherwise unconstrained, then the Bayes optimal fusion rule is , where is the quantization function defined for each by choosing any from the set . This fusion rule is unique almost everywhere.

The modified fusion rule is a generalization of well known formulas for binary (two element) spaces , and [15, 24], and can be computed easily in practice. For example, if , then the fusion rule given by (1) may generally take on any value in and is not feasible, but we can simply round it to the nearest integer to obtain the actual, optimal fusion rule for our scenario. Note that in the proof of Theorem 3, the quadratic cost function is crucial and allows for the cancellation of the third term in (17). The result of Theorem 3 will usually not hold for other costs.

We now consider cost functions more general than the quadratic one. Optimization problems in Bayesian statistics under arbitrary cost functions usually have no closed-form solutions, and the fusion rule would in most cases have to be determined numerically (except for some very specific densities, such as in

[27]). However, we identify a class of prior and sensor densities for which the same fusion rule (1) turns out to be Bayes optimal for a much larger class of costs.

###### Theorem 4.

Suppose that is sub-Gaussian (i.e., for some ) and for each , whenever for some , as well. Let be a single interval (possibly unbounded). Then the fusion rule (1) is Bayes optimal for , where is any entire, even, nonnegative and convex function with for some constants and .

The conditions of Theorem 4 are satisfied, for example, when the object and sensor statistics are all Gaussian, a case that is discussed in more detail in Section 3. Many other scenarios where Theorem 4 holds can also be found, including cases with finitely supported priors and other types of sensor distributions. For example, a prior of the form , corresponding to , together with Levy distributed sensor observations can be shown to result in satisfying the “symmetric zeros” condition of Theorem 4.

The class of cost functions addressed by Theorem 4 covers a wide variety of interesting cases. It includes functions of the form with even exponents , as well as functions that asymptotically behave like for any real (e.g., ). For such cost functions, larger values of give greater weight to cases where the object is something particularly hard to detect (such as an explosive compound hidden inside an everyday object such as a mobile phone). Theorem 4 essentially says that when detecting such objects is a high priority, (1) is still the best way to reach a decision about them. Note that the case with is not covered, and the optimal decision rule in this case is the posterior median instead of the posterior mean (1). The statements on the performance from Theorem 2 continue to hold. However, the Bayes optimal fusion rule is usually no longer unique for non-quadratic costs, and (1) is one of several possible choices.

We finally examine the asymptotic properties of the fusion rule (1) as the number of sensors increases, with every to simplify the notation. For fusion scenarios that are in the class covered by Theorem 4, we can prove much stronger asymptotic statements that cover not only the Bayes risk but also the fused classification performance for every possible object.

###### Theorem 5.

Let each and let satisfy and for some constant . Suppose on and is an interval such that for some . Then as , the Bayes risk of the fusion rule goes to . If in addition and are compact and for each , and satisfy the conditions of Theorem 4 and is a bounded, even function, then for every , in the weak- sense as .

The proof of Theorem 5 shows that for a large number of sensors , the Bayes risk of the optimal fusion rule is bounded above by the sample mean of the sensor outputs. The quadratic Bayes risk of the optimal fusion rule decays at least as fast as (and possibly faster than) the rate that the sample mean achieves, but it is usually quite different for fixed and realistic values of . The conditions in Theorem 4 are the main things that enable the pointwise-like convergence result , and the additional constraints of Theorem 5 (such as compact and ) simplify the proof but can be relaxed. If the object and feature distributions are not in the class covered by Theorem 4, the performance of may not necessarily approach “perfect classification” (a delta function) and is only guaranteed to do so on average, in the sense of the Bayes risk under a quadratic cost going to zero.

## 3 Example data fusion scenarios

In this section, we consider a series of examples applying the theory from Section 2 to concrete data fusion and performance analysis problems. The simplest situation is when the sensor and object statistics are all Gaussian, and is one of only a few cases where the performance function of the fusion rule can be calculated symbolically. We study this case in detail.

###### Proposition 6.

Let and for some even . Suppose the object is Gaussian with mean and variance . Suppose the sensors and observe and respectively output collections of features and , where each and is Gaussian with mean and variance for some fixed parameters and . Then the optimal fusion rule under a quadratic cost (1) is

 f((A,B))=uMu2+v⎛⎝M/2∑m=1Am+M/2∑m=1Bm⎞⎠, (4)

and its performance and Bayes risk are given by

 dC|H(c,h) = Mu2+vu√2πMve−(Mu2(c−h)+cv)22Mu2v E((C−H)2) = vMu2+v.

Furthermore, (4) is optimal for all costs satisfying the conditions of Theorem 4.

The situation described by Proposition 6 can be interpreted in the following way. The object space represents a degree of belief between the certain presence () and certain absence of an object of interest. Each of the features picked up by the sensors and provide some information on , but with a known level of uncertainty . The fusion rule (4) combines the features in a way that best matches the resulting (soft) decision with the original belief, using the available information from the sensors. The division of the features into two sensors and is arbitrary and they can equivalently be combined into a single sensor, but this formulation is useful in establishing Corollary 7 below.

Proposition 6 shows that the average performance of the fusion rule (as measured by the quadratic Bayes risk) is roughly inversely proportional to the number of features being fused. Note that the optimal fusion rule (4) is similar to but different from the sample mean of all the features and

, and that its performance is skewed by the prior on

(see Figure 2). For a large number of features, it is easy to verify that as , the fusion rule (4) approaches the actual sample mean of the and and the performance satisfies in the weak- sense for each , in line with Theorem 5.

Proposition 6 also allows us to compare the performance of directly fusing all features for the entire system, as opposed to having each sensor combine its own features internally according to (4) and then fusing the resulting sensor outputs using (4) again (see Figure 3 (a-b)). The latter corresponds to a typical approach taken in many real-world sensor fusion systems and is a version of distributed fusion with person-by-person optimization (PBPO), where the decision rule at each fusion center is chosen using only the properties of its own inputs and outputs, without regard to the rest of the graphical model (see [4, 9, 24, 25]). Distributed fusion networks are common in many applications with large numbers of sensors due to bandwidth or other communication constraints between the sensors. This typically incurs a loss in the overall system performance, but in the special case when the statistics are all Gaussian, the performance is unaffected.

###### Corollary 7.

Let , , , , and be as given in Proposition 6, with and , so the Bayes risk of the fusion configuration in Proposition 6 is . Now suppose and each have internal fusion centers that respectively reduce to a decision and to a decision using locally Bayes optimal decisions. Let and be combined at the system fusion center to produce a locally optimal output in , as in Figure 3 (b). Then the Bayes risk of the entire fusion system is still

The result of Corollary 7 is possible only because of the simple form of the fusion rule (4). The Bayes risk in a distributed fusion model like this will generally be lower in other cases due to a loss of information at and . An example of this is discussed below.

Another example similar to Proposition 6 can be considered, using other types of sensor statistics.

###### Proposition 8.

Let . Suppose the object

is exponentially distributed with rate parameter

(i.e., ), and there are sensors with exponentially distributed observations all having rate parameter . Then the optimal fusion rule is . Its performance and Bayes risk are given by

 dC|H(c,h) = (M+1)hM+1M!c2(M+1c−1)Me−h(M+1c−1),c∈(0,M+1] E((C−H)2) = 2M+2.

Proposition 8 illustrates a variety of different behavior from the Gaussian cases. The fusion rule is now quite different from the sample mean and only takes on values between and , even though the decision space is the entire positive real axis. This indicates that any fusion rule that produces decisions would compromise the good performance of the decisions for , to the extent that the overall Bayes risk of the system increases. The performance function is only meaningful for such values of , but if we define it to be identically zero for , then it still converges to a delta function in the weak- sense as .

We next look at a simple example with two sensors, each producing one feature, and a discrete, finite object space. It is generally no longer possible to study the fusion performance symbolically, but we can compute it numerically using (3). Let and . For the sensors and , let and be Gaussian variables for each , with , and . We consider the cases and , corresponding to hard or soft decisions at the fusion center. The resulting fusion rules and their performance functions are shown in Figure 4. The hard decision fusion rule contains small “islands” surrounded by larger regions where different decisions are made, and this behavior is typical of problems where and are discrete but the are continuous. The soft decision scenario has an improved Bayes risk (0.35536) over the hard decision case (0.43775), reflecting the fact that more information is preserved about the object at the fusion center. However, some of the classification performance of the object was traded off for improved performance with .

These scenarios can also be placed in the context of Corollary 7 and the distributed fusion model in Figure 3 (b). If we take , and as above and set and , then the fusion centers at and are simply identity mappings that pass the outputs of and into , so the Bayes risk at is the same as the soft decision scenario above. On the other hand, if we take , so that the outputs of and are first reduced to hard decisions before the fusion at , then the Bayes risk at turns out to be 0.57862, even though the final decision space is unchanged. This shows how distributed fusion can hurt performance when enough information is lost at and , as opposed to fusing and directly at .

Another, similar type of scenario can be considered with discrete features and shows how random fusion rules can enter the picture. Let and . Let and and be Poisson variables with rate parameter . By Theorem 3, the deterministic fusion rule for a quadratic cost is given by taking when and otherwise. In other words, there are only six pairs with . The classification performance of this rule can be found explicitly, with the false positive rate and the miss rate . Now the only way to improve the miss rate is by mapping one of the six values to instead, but there are a countable number of fusion rules that do this and their performances can only take on specific values. If we change the cost to be such that the false positive rate can be at most for some small , then there are random fusion rules with lower miss rates than any of the deterministic rules. For example, the random fusion rule that takes with probability and otherwise, and for all other , is a (non-unique) optimal choice with and . This situation is typical when the feature and/or decision spaces are discrete, with a random fusion rule having more “wiggle room” to achieve specific classification probabilities, and is essentially a special case of the classical Neyman-Pearson lemma for likelihood ratio tests ([19], p. 23).

We finally consider one more numerical example with more complex features that have mixture distributions. Let and . Suppose that has a Gaussian prior with mean and variance as in Proposition 6, but is an exponential-uniform mixture with the form , where is the indicator function, and follows the Gaussian mixture distribution . The optimal fusion rule and its performance function are shown in Figure 5. Note that the optimal fusion rule never outputs decisions greater than about , which is reflected in its performance and is similar to the scenario in Proposition 8.

## 4 Computational methods

We discuss some approaches to efficiently compute the performance and Bayes risk of the fusion rule (12) in realistic scenarios with a large number of sensors or features. We assume that the densities

are available in a symbolic (but not necessarily closed) form, coming from either a physics-based model or a kernel density estimate or other statistical model fitted to experimental data. The

can be discrete or continuous variables and have different dimensions , depending on what kinds of information each sensor puts out, but they are assumed to be well localized in their respective feature spaces. Having a symbolic expression for as opposed to a tabulation on a discrete grid (for continuous variables) allows us to sample at points anywhere in , which is essential for the use of randomized integration methods that scale efficiently in the total number of dimensions . For sensors that produce several different features (), is often not specified explicitly but is given in terms of a probabilistic graphical model [14] that describes the dependencies among the individual features and any intermediate (nuisance) variables.

In typical fusion scenarios in practice, it is common to combine several hundred features at once, and determining the performance leads to a high dimensional integral in (3) that is intractable by conventional lattice-based approaches. However, one key property of statistical inference problems such as this is that the sensor densities are all nonnegative, which means that the integral (3) involves no cancellation and the largest contribution comes from around the local maxima of the . We outline a Monte Carlo importance sampling approach that is motivated by these observations. We collect samples drawn from the proposal distribution , which prioritizes points around the maxima of (3) and reduces the variance in the resulting estimates [11]. is a variable with either the same distribution as

or the uniform distribution on

. This reflects an accuracy tradeoff between finding the Bayes risk and the performance, with the former choice more efficient for calculating the Bayes risk and the latter preferable for finding the performance. For a given number of samples, the former choice will pick the points in that contribute the most to the Bayes risk, but may leave a large portion of the domain uncovered by the samples and result in an inaccurate performance function. On the other hand, sampling according to a uniform distribution on will distribute the points evenly across , even though only a few may add significantly to the Bayes risk. Note that finding the performance corresponds to a “downstream” calculation in the probabilistic graphical model in Figure 1, where samples are generated at and propagate downward through the sensors into . This is in contrast to the more conventional task of doing posterior inference on data, which amounts to finding and is an “upstream” calculation that involves the Bayes formula.

The sensor densities

can be sampled from using a variety of approaches, depending on how each one is specified. For graphical models, standard Markov Chain Monte Carlo (MCMC) methods such as Gibbs sampling and its variants (see

[3] for details) allow us to obtain samples from a high-dimensional joint density that would be impossible to sample from directly, unless it has some special structure. However, one such case arises frequently in practice, where each is jointly Gaussian. Many standard types of radar and acoustic sensors for explosive detection collect measurements such as the signal’s return time or its power at specific frequencies. These measurements are typically formed by averaging a large number of consecutive “looks” to smooth out the effects of noise, which results in each being approximately an

-valued Gaussian variable, and the resulting joint distribution is easy to sample from directly. The individual components of

are usually not independent, and may represent measurements such as the signal intensity at different frequencies, which are all influenced by an explosive substance with a given spectral profile. However, we can simply take independent Gaussian samples

(using the Gaussian quantile function) with mean

and variance , and “color” them appropriately by taking , where and is the Cholesky decomposition.

Once a sequence of samples has been obtained, we can use the fact that for each there is exactly one with , so the delta function in the performance integral (3) never has to be computed or approximated explicitly. Instead, we discretize the decision space , and for each sample , we find the closest in the discretized space and add to the sum corresponding to that pair, effectively producing a weighted histogram of to determine . The Bayes risk may be found from the same samples directly, or from the performance by computing

. Standard confidence bounds on the estimated Bayes risk and performance function can be found from the central limit theorem, which also holds for dependent variables with sufficiently good ergodicity or mixing properties (as is the case with some MCMC sampling patterns

[13]). For example, let be the Monte Carlo estimate of the Bayes risk under the cost function with samples taken from the proposal distribution with . Then for any confidence level , as ,

and for fusion problems and costs covered by Theorem 4, this implies that for sufficiently large ,

 P(|BL(W)−B(W)|<√2L(BL(W2)1/2+BL(W))erf−1(R))≥R.

We use the approach discussed here to study a larger version of the fusion scenario considered in Proposition 6. In Figure 6, we take total features in Proposition 6 with i.i.d. points sampled from the proposal distribution with a uniform . The performance can be compared with Figure 2 and, as expected, is much more concentrated along the main diagonal, with a correspondingly lower Bayes risk. Note that as we increase the number of features , the quadratic Bayes risk decays at least as fast as by Theorem 5, so we need roughly sampling points to achieve a fixed relative error in the calculation.

## 5 Conclusion

We have described a mathematical and computational framework for analyzing the expected performance of deterministically combining statistical information under a specified optimality criterion. These results can be applied to many diverse situations, both in the sensors field as well as other domains, and can also be extended in a number of other directions. In particular, many applications involve online formulations of this problem where the sensor statistics are not known in advance and need to be estimated from real-time data streams, and will be explored in future work.

## References

• [1] L. Ahlfors. Complex Analysis. McGraw-Hill, 1979.
• [2] I. F. Akyildiz, W. Su, Y. Sankarasubramaniam, and E. Cayirci. Wireless sensor networks: a survey. Computer Networks, 38(4):393–422, 2002.
• [3] C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
• [4] R. S. Blum, S. A. Kassam, and H. V. Poor. Distributed detection with multiple sensors II: Advanced topics. Proceedings of the IEEE, 85(1):64–79, 1997.
• [5] B. Chen, R. Jiang, T. Kasetkasem, and P. K. Varshney. Channel aware decision fusion in wireless sensor networks. IEEE Transactions on Signal Processing, 52(12):3454–3458, 2004.
• [6] C.-Y. Chong and S. P. Kumar. Sensor networks: evolution, opportunities, and challenges. Proceedings of the IEEE, 91(8), 2003.
• [7] E. Cinlar. Probability and Stochastics. Springer, 2011.
• [8] B. Dacorogna. Direct Methods in the Calculus of Variations. Springer, 1989.
• [9] H. Durrant-Whyte. Data Fusion in Sensor Networks. IPAM Sensor Networks, 2007.
• [10] L. C. Evans. American Mathematical Society, 1998.
• [11] A.E. Gelfand and A.F.M. Smith. Sampling-based approaches to calculating marginal densities. Journal of the American Statistical Association, 85(410), 1990.
• [12] L. Hormander. The Analysis of Linear Partial Differential Operators I: Distribution Theory and Fourier Analysis. Springer, 1990.
• [13] L. T. Johnson and C. J. Geyer. Variable transformation to obtain geometric ergodicity in the random-walk Metropolis algorithm. Annals of Statistics, 40(6):3050–3076, 2012.
• [14] M. I. Jordan. Graphical Models. Statistical Science, 39(1):140–155, 2004.
• [15] M. Kam, Q. Zhu, and W.S. Gray. Optimal data fusion of correlated local decisions in multiple sensor detection systems. IEEE Transactions on Aerospace and Electronic Systems, 28(3), 1992.
• [16] H. B. Mitchell. Multi-sensor Data Fusion: An Introduction. Springer, 2007.
• [17] E. F. Nakamura, A. A. F. Loureiro, and A. C. Frery. Information fusion for wireless sensor networks: Methods, models, and classifications. ACM Computing Surveys, 39(3), 2007.
• [18] D. Pappas. Minimization of constrained quadratic forms in Hilbert spaces. Annals of Functional Analysis, 2(1):1–12, 2011.
• [19] H. V. Poor. An Introduction to Signal Detection and Estimation. Springer, 1994.
• [20] W. Rudin. Functional Analysis, Second Ed. Int’l Series in Pure and Applied Math. McGraw-Hill, 1991.
• [21] D. Smith and S. Singh. Approaches to Multisensor Data Fusion in Target Tracking: A Survey. IEEE Transactions on Knowledge and Data Engineering, 18(12):1696–1710, 2006.
• [22] E. M. Stein and G. Weiss. Introduction to Fourier Analysis on Euclidean Spaces. Princeton University Press, 1971.
• [23] G. Thakur. Sequential testing over multiple stages and performance analysis of data fusion. Proc. SPIE Signal Processing, Sensor Fusion and Target Recognition XXII, (87540S), 2013.
• [24] P. K. Varshney. Distributed Detection and Data Fusion. Springer, 1996.
• [25] R. Vishwanathan and P. Varshney. Distributed detection with multiple sensors I: Fundamentals. Proceedings of the IEEE, 85(1):54–63, 1997.
• [26] G. Xing, R. Tan, B. Liu, J. Wang, X. Jia, and C. W. Yi. Data Fusion Improves the Coverage of Wireless Sensor Networks. Proceedings of ACM MobiCom ’09, pages 157–168, 2009.
• [27] L. Yeh. Bayesian Variable Sampling Plans for the Exponential Distribution with Type I Censoring. Annals of Statistics, 22(2):696–711, 1994.

## Appendix A: Proofs of Theorems

###### Proof of Proposition 1.

Let be the joint feature space with dimension . The minimum Bayes risk is

subject to the constraint that for every , is a probability density, or in other words,

 ∫KdC|A(c,a)dc = 1, (6) dC|A(c,a) ≥ 0. (7)

In addition, we want the fusion rule to be deterministic, so for all , there is a single finite point in such that

 supp(dC|A(⋅,a))={f(a)}. (8)

The condition (8) makes the problem nonconvex, but it can be simplified as follows. Let be fixed. We want to show that . The condition (8) is saying that there is a point such that for all test functions such that . Since is compactly supported, this also holds for the larger class with . Any can be put into this form by writing , and using (6) along with linearity implies that . Since any generalized function in is uniquely determined by its action on [22], it follows that , which implies (7) as well. This means that the minimization problem (5) is equivalent to

 (9)

where there are no additional constraints.

We first consider the optimization problem (9) over the larger space defined by

and without any other constraint. In this case, (9) becomes a standard problem of minimizing a positive quadratic form over a Hilbert space, with at least one feasible solution since . There exists a unique solution to this problem in [18, 22], which is thus unique pointwise up to sets with zero -dimensional Lebesgue measure. To find this solution, we set up the Euler-Lagrange equation [10]

 ∂∂f∫I12(f−h)2(M∏m=1dAm|H(am,h))dH(h)dh=0, (11)

which can be solved to obtain

 f(a)=∫Ih(∏Mm=1dAm|H(am,h))dH(h)dh∫I(∏Mm=1dAm|H(am,h))dH(h)dh. (12)

This is the only stationary point of the functional in (9) and the only candidate for a minimum. The denominator in (12) is simply the joint density , and since , it follows that as well. If either or are finite, we can check that

 sup(f(J):a∈J) ≤ supa∈J∫Isup(h:h∈I)(∏Mm=1dAm|H(am,h))dH(h)dh∫I(∏Mm=1dAm|H(am,h))dH(h)dh (13) = sup(h:h∈I) ≤ sup(h:h∈K),

and in the same manner, , so is a feasible solution for (9) and is thus in fact the optimal fusion rule. ∎

###### Proof of Theorem 2.

First, note that if for all , the level sets have zero -dimensional measure, then the performance is given by

 dC|H(c,h) = ∫Jδ(c−f(a))(M∏m=1dAm|H(am,h))da. (14)

The integral of (14) over is always since is a probability density. If the level sets have positive measure, (14) is no longer well defined, but we can still examine a “smoothed out” version of (14

) by considering the Fourier transform (or characteristic function) of

, given by . Since ,

 supz∈R∣∣∣∂l∂zlE(e−2πiCz|H)∣∣∣ = supz∈R∣∣E((−2πiC)le−2πiCz|H)∣∣ = supz∈R∣∣ ∣∣∫J(−2πif(a))le−2πizf(a)(M∏m=1dAm|H(am,H))da∣∣ ∣∣ ≤ (2π∥f∥L∞(J))l ≤ (2πmax(|c|:c∈K))l

for every . Now for the th sensor and th feature that satisfies (2), the fact that are all positive on implies

 ∣∣∣∂f(a)∂(am′)n∣∣∣ = 1|dA(a)2|∣∣∣∫I∫I⎛⎝dAm′|H(am′,~h)∂dAm′|H(am′,h)∂(am′)n−dAm′|H(am′,h)∂dAm′|H(am′,~h)∂(am′)n⎞⎠× ⎛⎝∏m≠m′dAm′|H(am,h)dAm|H(am,~h)⎞⎠hdH(h)dH(~h)dhd~h∣∣∣ > 0.

This means that has a nonzero component for every . We form a stationary phase approximation of the Fourier transform (see [10, 12]) by integrating by parts times,

 ∣∣E(e−2πiCz|H)∣∣ = ∣∣ ∣∣∫J1(−2πiz)l((Tf)le−2πizf(a))(M∏m=1dAm|H(am,H))da∣∣ ∣∣ (16) = 1(2πz)l∣∣ ∣∣∫Je−2πizf(a)(T∗f)l(M∏m=1dAm|H(am,H))da∣∣ ∣∣ ≤

where is the linear differential operator and is its adjoint. The integral in (16) is finite for each due to the conditions on and . This means that for each , is a smooth function that decays faster than any polynomial. The inverse Fourier transform is thus also a smooth function with the same decay [22]. ∎

###### Proof of Theorem 3.

To simplify the notation, we let

 G(a,h)=(M∏m=1dAm|H(am,h))dH(h).

For any fusion rule , the quadratic Bayes risk can be expanded by writing