Sensor networks are ubiquitous across many different domains, including
wireless communications, temperature and process control, area surveillance,
object tracking and numerous other fields [2, 6]. Large
performance gains can be achieved in such networks by performing data
fusion between the sensors, or combining information from the individual
sensors to reach system-level decisions [9, 16, 24, 26].
The sensors are typically connected by wireless links to either a
separate information collector (centralized fusion) or to each other
(distributed fusion). Elementary fusion rules based on Boolean logic
are used in many contexts due to their simplicity and ease of implementation.
On the other hand, in most situations we have some knowledge of the
statistical properties of the sensors’ outputs, and designing fusion
rules that take this into account can provide much better performance
[17, 24]. The fusion rule can be built to satisfy any of
various statistical optimality criteria, such as achieving the maximum
likelihood or the minimum Bayes risk, under any other constraints
of the problem . Sensor information fusion can also be
understood as a special case of the more general problem of statistical
data reduction, where the goal is to reduce information from a high-dimensional
space into a low-dimensional one in some optimal manner.
In many sensor fusion applications, it is important for the fusion rule to be a deterministic function of the sensor outputs. Fusion techniques that incorporate randomness are widely used in the sensing literature and can be more easily optimized to achieve given performance targets, such as false or correct classification probabilities
. However, in certain applications such as the detection of explosive compounds, it is common for the number of positive targets to be several orders of magnitude smaller than the number of negative ones, which means that randomized fusion rules have a large variance and rarely achieve their expected theoretical performance on realistic sample sizes. For simple, binary decision-level fusion problems, deterministic rules are easy to find, but in general, requiring the fusion rule to be deterministic effectively introduces a nonconvex constraint that is difficult to incorporate into a numerical optimization framework. This type of constraint also complicates the calculation of a fusion rule’s expected classification performance, which is a key component of modeling and simulation efforts to design sensor layouts and to perform trade studies between different sensor configurations .
The goals of this paper are to study the fusion rule that is Bayes optimal among all deterministic fusion rules, to investigate its mathematical properties under different cost criteria and other problem constraints, and to develop a computational framework for finding its expected performance. These results are motivated by the standoff detection of threat substances using multi-modal sensors in a centralized fusion network. However, we formulate the problem in an abstract setting that makes minimal assumptions on the details of the sensors or the type of information they produce, in contrast to the relatively well-defined situations in much of the sensing literature (e.g., [5, 9, 21]). We first use variational techniques to derive the standard Bayesian posterior mean as the optimal deterministic fusion rule under a quadratic cost. We show that the resulting system’s classification performance is a smooth function under some regularity constraints on the problem, and describe extensions to more general settings where the posterior mean no longer applies. For a certain class of fusion scenarios with sub-Gaussian priors, we extend these results beyond the classical setting of a quadratic cost, showing that the same fusion rule remains optimal under higher order cost functions and that its classification performance exhibits stronger, pointwise-like asymptotic behavior. This mathematical theory is developed in Section 2, with the proofs of the theorems deferred to Appendix A. In Section 3, we apply these results to several illustrative examples with Gaussian, exponential and other statistics for the sensors and examine the different types of behavior that they exhibit. In Section 4, we finally discuss efficient computational methods for finding the performance of the fusion rule in practice, based on Monte Carlo integration techniques commonly used in machine learning and other fields.
2 Mathematical framework for data fusion
Suppose we have
sensors (random variables), each observing an object (hypothesis) and producing outputs that are sent into a fusion center (see Figure 1). The sensor outputs could represent simple true/false decisions, choices between several distinct classes, or collections of multiple continuous-valued physical features of an object. We want to design a fusion rule at that uses the information from the to minimize the Bayes risk, or in other words, the expected cost of making a wrong decision . In real world problems, the sensor outputs are typically deterministic functions of the random object and any background noise. We thus make the standard assumption that the are all conditionally independent given , which is valid as long as the noise is white and each sensor observes the object at a slightly different time.
In what follows, we will denote the density and distribution of any random variable by and
respectively, and use the vector notation(where each may itself be a vector). For any closed set , we let , and respectively be the space of continuous, bounded functions on , the space of functions with compact support, and the space of functions that are smooth (infinitely differentiable with bounded derivatives on ). In order to incorporate deterministic fusion, we view a density as a first-order generalized function [12, 22]
(i.e., a Schwartz distribution, but we use the former terminology to avoid confusion with the meaning of the term in probability theory). This formulation allows us to consider delta functions and other such function-like objects. It is equivalent to treating the distribution as a measure with singular components, but the generalized function framework is easier to work with in developing the theory below. The space of first-order generalized functions can be identified with the dual space ofand is denoted .
The fusion problem can now be described as follows. We are given the
densities for , the prior
and a cost function . We
want to minimize the Bayes risk, or the expected cost of making a
decision . The sensor densities
represent either generic statistical models trained from experimental
lab data or mathematical descriptions of the underlying sensor physics,
and the prior comes from domain-specific operational
knowledge. The fusion rule can be expressed as the density .
We usually want the fusion center to produce the same type of
information that represents, so it is reasonable to take the
cost function to be a simple metric between and and motivates
the choice for an appropriate increasing
We now proceed to establish several results on the optimal deterministic fusion rule and its performance characteristics. To maintain the clarity of exposition, we defer the proofs of these theorems to Appendix A. We first address the classical setting of a quadratic cost . The optimal fusion rule in this case reduces to the usual posterior expectation
corresponding to a naive Bayes classifier, but we establish it under the abstract formulation discussed above and use a variational argument in the proof that will be the foundation for later results. For the rest of the section, we also assume that all the conditions of Proposition1 are satisfied unless stated otherwise.
Suppose we have a given object space , a collection of feature spaces with , and a decision space . Suppose is a single interval with . Let the fusion problem be set up as described above, with the random variables , and respectively taking on values in , and , and let . Assume that , and for every , and . Then there is an almost everywhere unique, deterministic, Bayes optimal fusion rule that combines the and produces outputs in . It is given by the posterior expectation
The spaces , and
respectively represent the type of hidden information we want to estimate or classify between, the type of information we observe from the sensors and the type of decision we can make from those observations. We typically think of the objects and decisions as scalar quantities that may be discrete or continuous, while the feature spaces are potentially vector quantities (for example, a camera sensor that outputs images). Since we allowto be a generalized function, the results of Proposition 1 cover the case where the object space is discrete and finite by taking to be a weighted sum of delta functions. This is the case in most sensor fusion applications, where there are a finite and known number of anomaly classes. Note that the deterministic constraint ((8) in the proof) is what allows the fusion rule to become the posterior expectation (1), and there can generally be other, random fusion rules (where is a density) that have better performance without this constraint imposed. Under some additional constraints on the problem, we can also study the properties of the classification performance of this fusion rule, given by the density .
Assume that and are compact, for all , and for each point there is at least one sensor and feature , , such that
Then the classification performance of the fusion rule (1), given by , is in for each .
The performance function
of the fusion rule is a generalization of the classical “confusion matrix” that describes the probabilities of correct and false classification in a binary, decision-level setting. For every object , the values along the diagonal are the likelihoods of the fusion rule giving the correct result, with a delta function corresponding to an ideal classifier (unattainable in a real situation). The condition (2) holds for simple examples such as those involving Gaussian statistics (see Section 3) but its exact form is not central to the result, as it can be replaced by a variety of weaker but more complicated conditions under which the stationary phase argument in the proof still holds. Under these conditions, Theorem 2 says that is actually a function, instead of merely a generalized function, and is meaningful at every point. The proof of Theorem 2 also provides a way to compute the performance function, by finding
This is effectively an integral over only the level sets of the fusion rule in the joint feature space. In practice, these level sets can be numerically approximated from the fusion rule as long as either the fusion rule does not have “flat regions” where the gradient is identically zero, or the decision space is discrete. This is usually the simplest and most efficient approach to computing , although various “smoothed out” versions of (3
) can be used instead, such as taking the Fourier transform of (3) and inverting it, as done in (Appendix A: Proofs of Theorems) in the proof.
The computation of the fusion rule (1) itself is simple
and involves an integration over only the (small) object space .
On the other hand, increasing the number of sensors will rapidly increase
the dimension of the integral (3) for the performance
function. In practice, each sensor generates only a moderate number
of statistics and the individual feature spaces are relatively
small (with dimension typically at most or ).
This results in the joint density
having a “block diagonal” dependence structure that allows the
calculation of to remain computationally tractable.
We will describe a Monte Carlo-based approach to perform this calculation
efficiently in Section 4.
We next extend Proposition 1 to more general decision spaces that are not necessarily a single interval. This case is important for many applications in which and are both finite sets, the so-called binary decision and -ary classification fusion problems. The case where is also of interest in situations where there are several threat substances of interest, but we ultimately want to make a “true” or “false” decision at the fusion center. The posterior expectation (1) is no longer a feasible solution and cannot be applied to this situation, but we still have the following result.
If the decision space is a closed set but otherwise unconstrained, then the Bayes optimal fusion rule is , where is the quantization function defined for each by choosing any from the set . This fusion rule is unique almost everywhere.
The modified fusion rule is a generalization of well known
formulas for binary (two element) spaces , and [15, 24],
and can be computed easily in practice. For example, if ,
then the fusion rule given by (1) may generally
take on any value in and is not feasible, but we can simply
round it to the nearest integer to obtain the actual, optimal fusion
rule for our scenario. Note that in the proof of Theorem
3, the quadratic cost function is crucial and allows
for the cancellation of the third term in (17). The
result of Theorem 3 will usually not hold for other
We now consider cost functions more general than the quadratic one. Optimization problems in Bayesian statistics under arbitrary cost functions usually have no closed-form solutions, and the fusion rule would in most cases have to be determined numerically (except for some very specific densities, such as in). However, we identify a class of prior and sensor densities for which the same fusion rule (1) turns out to be Bayes optimal for a much larger class of costs.
Suppose that is sub-Gaussian (i.e., for some ) and for each , whenever for some , as well. Let be a single interval (possibly unbounded). Then the fusion rule (1) is Bayes optimal for , where is any entire, even, nonnegative and convex function with for some constants and .
The conditions of Theorem 4 are satisfied, for example,
when the object and sensor statistics are all Gaussian, a case that
is discussed in more detail in Section 3. Many other
scenarios where Theorem 4 holds can also be found,
including cases with finitely supported priors and other types
of sensor distributions. For example, a prior of the form ,
corresponding to ,
together with Levy distributed sensor observations
can be shown to result in satisfying the “symmetric
zeros” condition of Theorem 4.
The class of cost functions addressed by Theorem 4
covers a wide variety of interesting cases. It includes functions
of the form with even exponents , as
well as functions that asymptotically behave like
for any real (e.g., ).
For such cost functions, larger values of give greater weight
to cases where the object is something particularly hard to detect
(such as an explosive compound hidden inside an everyday object such
as a mobile phone). Theorem 4 essentially says that
when detecting such objects is a high priority, (1)
is still the best way to reach a decision about them. Note that the
case with is not covered, and the optimal decision
rule in this case is the posterior median instead of the posterior
mean (1). The statements on the performance
from Theorem 2 continue to hold. However, the
Bayes optimal fusion rule is usually no longer unique for non-quadratic
costs, and (1) is one of several possible choices.
We finally examine the asymptotic properties of the fusion rule (1) as the number of sensors increases, with every to simplify the notation. For fusion scenarios that are in the class covered by Theorem 4, we can prove much stronger asymptotic statements that cover not only the Bayes risk but also the fused classification performance for every possible object.
Let each and let satisfy and for some constant . Suppose on and is an interval such that for some . Then as , the Bayes risk of the fusion rule goes to . If in addition and are compact and for each , and satisfy the conditions of Theorem 4 and is a bounded, even function, then for every , in the weak- sense as .
The proof of Theorem 5 shows that for a large number of sensors , the Bayes risk of the optimal fusion rule is bounded above by the sample mean of the sensor outputs. The quadratic Bayes risk of the optimal fusion rule decays at least as fast as (and possibly faster than) the rate that the sample mean achieves, but it is usually quite different for fixed and realistic values of . The conditions in Theorem 4 are the main things that enable the pointwise-like convergence result , and the additional constraints of Theorem 5 (such as compact and ) simplify the proof but can be relaxed. If the object and feature distributions are not in the class covered by Theorem 4, the performance of may not necessarily approach “perfect classification” (a delta function) and is only guaranteed to do so on average, in the sense of the Bayes risk under a quadratic cost going to zero.
3 Example data fusion scenarios
In this section, we consider a series of examples applying the theory from Section 2 to concrete data fusion and performance analysis problems. The simplest situation is when the sensor and object statistics are all Gaussian, and is one of only a few cases where the performance function of the fusion rule can be calculated symbolically. We study this case in detail.
Let and for some even . Suppose the object is Gaussian with mean and variance . Suppose the sensors and observe and respectively output collections of features and , where each and is Gaussian with mean and variance for some fixed parameters and . Then the optimal fusion rule under a quadratic cost (1) is
and its performance and Bayes risk are given by
The situation described by Proposition 6 can be interpreted
in the following way. The object space represents a degree of belief
between the certain presence () and certain absence
of an object of interest. Each of the features picked up by the sensors
and provide some information on , but with a known level
of uncertainty . The fusion rule (4) combines
the features in a way that best matches the resulting (soft) decision
with the original belief, using the available information from the
sensors. The division of the features into two sensors and
is arbitrary and they can equivalently be combined into a single
sensor, but this formulation is useful in establishing Corollary 7
Proposition 6 shows that the average performance of the fusion rule (as measured by the quadratic Bayes risk) is roughly inversely proportional to the number of features being fused. Note that the optimal fusion rule (4) is similar to but different from the sample mean of all the features and
, and that its performance is skewed by the prior on(see Figure 2). For a large number of features, it is easy to verify that as , the fusion rule (4) approaches the actual sample mean of the and and the performance satisfies in the weak- sense for each , in line with Theorem 5.
Proposition 6 also allows us to compare the performance of directly fusing all features for the entire system, as opposed to having each sensor combine its own features internally according to (4) and then fusing the resulting sensor outputs using (4) again (see Figure 3 (a-b)). The latter corresponds to a typical approach taken in many real-world sensor fusion systems and is a version of distributed fusion with person-by-person optimization (PBPO), where the decision rule at each fusion center is chosen using only the properties of its own inputs and outputs, without regard to the rest of the graphical model (see [4, 9, 24, 25]). Distributed fusion networks are common in many applications with large numbers of sensors due to bandwidth or other communication constraints between the sensors. This typically incurs a loss in the overall system performance, but in the special case when the statistics are all Gaussian, the performance is unaffected.
Let , , , , and be as given in Proposition 6, with and , so the Bayes risk of the fusion configuration in Proposition 6 is . Now suppose and each have internal fusion centers that respectively reduce to a decision and to a decision using locally Bayes optimal decisions. Let and be combined at the system fusion center to produce a locally optimal output in , as in Figure 3 (b). Then the Bayes risk of the entire fusion system is still
The result of Corollary 7 is possible only because
of the simple form of the fusion rule (4). The Bayes
risk in a distributed fusion model like this will generally be lower
in other cases due to a loss of information at and .
An example of this is discussed below.
Another example similar to Proposition 6 can be considered, using other types of sensor statistics.
Let . Suppose the object is exponentially distributed with rate parameter
is exponentially distributed with rate parameter(i.e., ), and there are sensors with exponentially distributed observations all having rate parameter . Then the optimal fusion rule is . Its performance and Bayes risk are given by
Proposition 8 illustrates a variety of different behavior from the Gaussian cases. The fusion rule is now quite different from the sample mean and only takes on values between and , even though the decision space is the entire positive real axis. This indicates that any fusion rule that produces decisions would compromise the good performance of the decisions for , to the extent that the overall Bayes risk of the system increases. The performance function is only meaningful for such values of , but if we define it to be identically zero for , then it still converges to a delta function in the weak- sense as .
We next look at a simple example with two sensors, each producing
one feature, and a discrete, finite object space. It is generally
no longer possible to study the fusion performance symbolically, but
we can compute it numerically using (3). Let
and . For the sensors and , let
and be Gaussian variables for each , with ,
We consider the cases and , corresponding
to hard or soft decisions at the fusion center. The resulting fusion
rules and their performance functions are shown in Figure 4.
The hard decision fusion rule contains small “islands” surrounded
by larger regions where different decisions are made, and this behavior
is typical of problems where and are discrete but the
are continuous. The soft decision scenario has an improved Bayes risk
(0.35536) over the hard decision case (0.43775), reflecting the fact
that more information is preserved about the object at the fusion
center. However, some of the classification performance of the object
was traded off for improved performance with .
These scenarios can also be placed in the context of Corollary 7
and the distributed fusion model in Figure 3 (b).
If we take , and as above and set
and , then the fusion centers at and
are simply identity mappings that pass the outputs of and
into , so the Bayes risk at is the same as the soft decision
scenario above. On the other hand, if we take ,
so that the outputs of and are first reduced to hard decisions
before the fusion at , then the Bayes risk at turns out to
be 0.57862, even though the final decision space is unchanged.
This shows how distributed fusion can hurt performance when enough
information is lost at and , as opposed to
fusing and directly at .
Another, similar type of scenario can be considered with discrete
features and shows how random fusion rules can enter the picture.
Let and . Let
and and be Poisson variables with rate parameter .
By Theorem 3, the deterministic fusion rule for
a quadratic cost is given by taking when and
otherwise. In other words, there are only six
pairs with . The classification performance of this rule
can be found explicitly, with the false positive rate
and the miss rate . Now the only way to improve
the miss rate is by mapping one of the six values to
instead, but there are a countable number of fusion rules that do
this and their performances can only take on specific values. If we
change the cost to be such that the false positive rate can be at
most for some small , then there
are random fusion rules with lower miss rates than any of the deterministic
rules. For example, the random fusion rule that takes
with probability and otherwise, and
for all other , is a (non-unique) optimal choice with
and . This situation is typical
when the feature and/or decision spaces are discrete, with a random
fusion rule having more “wiggle room” to achieve specific classification
probabilities, and is essentially a special case of the classical
Neyman-Pearson lemma for likelihood ratio tests (, p.
We finally consider one more numerical example with more complex features that have mixture distributions. Let and . Suppose that has a Gaussian prior with mean and variance as in Proposition 6, but is an exponential-uniform mixture with the form , where is the indicator function, and follows the Gaussian mixture distribution . The optimal fusion rule and its performance function are shown in Figure 5. Note that the optimal fusion rule never outputs decisions greater than about , which is reflected in its performance and is similar to the scenario in Proposition 8.
4 Computational methods
We discuss some approaches to efficiently compute the performance and Bayes risk of the fusion rule (12) in realistic scenarios with a large number of sensors or features. We assume that the densities
are available in a symbolic (but not necessarily closed) form, coming from either a physics-based model or a kernel density estimate or other statistical model fitted to experimental data. Thecan be discrete or continuous variables and have different dimensions , depending on what kinds of information each sensor puts out, but they are assumed to be well localized in their respective feature spaces. Having a symbolic expression for as opposed to a tabulation on a discrete grid (for continuous variables) allows us to sample at points anywhere in , which is essential for the use of randomized integration methods that scale efficiently in the total number of dimensions . For sensors that produce several different features (), is often not specified explicitly but is given in terms of a probabilistic graphical model  that describes the dependencies among the individual features and any intermediate (nuisance) variables.
In typical fusion scenarios in practice, it is common to combine several hundred features at once, and determining the performance leads to a high dimensional integral in (3) that is intractable by conventional lattice-based approaches. However, one key property of statistical inference problems such as this is that the sensor densities are all nonnegative, which means that the integral (3) involves no cancellation and the largest contribution comes from around the local maxima of the . We outline a Monte Carlo importance sampling approach that is motivated by these observations. We collect samples drawn from the proposal distribution , which prioritizes points around the maxima of (3) and reduces the variance in the resulting estimates . is a variable with either the same distribution as
or the uniform distribution on. This reflects an accuracy tradeoff between finding the Bayes risk and the performance, with the former choice more efficient for calculating the Bayes risk and the latter preferable for finding the performance. For a given number of samples, the former choice will pick the points in that contribute the most to the Bayes risk, but may leave a large portion of the domain uncovered by the samples and result in an inaccurate performance function. On the other hand, sampling according to a uniform distribution on will distribute the points evenly across , even though only a few may add significantly to the Bayes risk. Note that finding the performance corresponds to a “downstream” calculation in the probabilistic graphical model in Figure 1, where samples are generated at and propagate downward through the sensors into . This is in contrast to the more conventional task of doing posterior inference on data, which amounts to finding and is an “upstream” calculation that involves the Bayes formula.
The sensor densities
can be sampled from using a variety of approaches, depending on how each one is specified. For graphical models, standard Markov Chain Monte Carlo (MCMC) methods such as Gibbs sampling and its variants (see for details) allow us to obtain samples from a high-dimensional joint density that would be impossible to sample from directly, unless it has some special structure. However, one such case arises frequently in practice, where each is jointly Gaussian. Many standard types of radar and acoustic sensors for explosive detection collect measurements such as the signal’s return time or its power at specific frequencies. These measurements are typically formed by averaging a large number of consecutive “looks” to smooth out the effects of noise, which results in each being approximately an
-valued Gaussian variable, and the resulting joint distribution is easy to sample from directly. The individual components ofare usually not independent, and may represent measurements such as the signal intensity at different frequencies, which are all influenced by an explosive substance with a given spectral profile. However, we can simply take independent Gaussian samples
(using the Gaussian quantile function) with meanand variance , and “color” them appropriately by taking , where and is the Cholesky decomposition.
Once a sequence of samples has been obtained, we can use the fact that for each there is exactly one with , so the delta function in the performance integral (3) never has to be computed or approximated explicitly. Instead, we discretize the decision space , and for each sample , we find the closest in the discretized space and add to the sum corresponding to that pair, effectively producing a weighted histogram of to determine . The Bayes risk may be found from the same samples directly, or from the performance by computing
. Standard confidence bounds on the estimated Bayes risk and performance function can be found from the central limit theorem, which also holds for dependent variables with sufficiently good ergodicity or mixing properties (as is the case with some MCMC sampling patterns). For example, let be the Monte Carlo estimate of the Bayes risk under the cost function with samples taken from the proposal distribution with . Then for any confidence level , as ,
and for fusion problems and costs covered by Theorem 4, this implies that for sufficiently large ,
We use the approach discussed here to study a larger version of the fusion scenario considered in Proposition 6. In Figure 6, we take total features in Proposition 6 with i.i.d. points sampled from the proposal distribution with a uniform . The performance can be compared with Figure 2 and, as expected, is much more concentrated along the main diagonal, with a correspondingly lower Bayes risk. Note that as we increase the number of features , the quadratic Bayes risk decays at least as fast as by Theorem 5, so we need roughly sampling points to achieve a fixed relative error in the calculation.
We have described a mathematical and computational framework for analyzing the expected performance of deterministically combining statistical information under a specified optimality criterion. These results can be applied to many diverse situations, both in the sensors field as well as other domains, and can also be extended in a number of other directions. In particular, many applications involve online formulations of this problem where the sensor statistics are not known in advance and need to be estimated from real-time data streams, and will be explored in future work.
-  L. Ahlfors. Complex Analysis. McGraw-Hill, 1979.
-  I. F. Akyildiz, W. Su, Y. Sankarasubramaniam, and E. Cayirci. Wireless sensor networks: a survey. Computer Networks, 38(4):393–422, 2002.
-  C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
-  R. S. Blum, S. A. Kassam, and H. V. Poor. Distributed detection with multiple sensors II: Advanced topics. Proceedings of the IEEE, 85(1):64–79, 1997.
-  B. Chen, R. Jiang, T. Kasetkasem, and P. K. Varshney. Channel aware decision fusion in wireless sensor networks. IEEE Transactions on Signal Processing, 52(12):3454–3458, 2004.
-  C.-Y. Chong and S. P. Kumar. Sensor networks: evolution, opportunities, and challenges. Proceedings of the IEEE, 91(8), 2003.
-  E. Cinlar. Probability and Stochastics. Springer, 2011.
-  B. Dacorogna. Direct Methods in the Calculus of Variations. Springer, 1989.
-  H. Durrant-Whyte. Data Fusion in Sensor Networks. IPAM Sensor Networks, 2007.
-  L. C. Evans. Partial Differential Equations. American Mathematical Society, 1998.
-  A.E. Gelfand and A.F.M. Smith. Sampling-based approaches to calculating marginal densities. Journal of the American Statistical Association, 85(410), 1990.
-  L. Hormander. The Analysis of Linear Partial Differential Operators I: Distribution Theory and Fourier Analysis. Springer, 1990.
-  L. T. Johnson and C. J. Geyer. Variable transformation to obtain geometric ergodicity in the random-walk Metropolis algorithm. Annals of Statistics, 40(6):3050–3076, 2012.
-  M. I. Jordan. Graphical Models. Statistical Science, 39(1):140–155, 2004.
-  M. Kam, Q. Zhu, and W.S. Gray. Optimal data fusion of correlated local decisions in multiple sensor detection systems. IEEE Transactions on Aerospace and Electronic Systems, 28(3), 1992.
-  H. B. Mitchell. Multi-sensor Data Fusion: An Introduction. Springer, 2007.
-  E. F. Nakamura, A. A. F. Loureiro, and A. C. Frery. Information fusion for wireless sensor networks: Methods, models, and classifications. ACM Computing Surveys, 39(3), 2007.
-  D. Pappas. Minimization of constrained quadratic forms in Hilbert spaces. Annals of Functional Analysis, 2(1):1–12, 2011.
-  H. V. Poor. An Introduction to Signal Detection and Estimation. Springer, 1994.
-  W. Rudin. Functional Analysis, Second Ed. Int’l Series in Pure and Applied Math. McGraw-Hill, 1991.
-  D. Smith and S. Singh. Approaches to Multisensor Data Fusion in Target Tracking: A Survey. IEEE Transactions on Knowledge and Data Engineering, 18(12):1696–1710, 2006.
-  E. M. Stein and G. Weiss. Introduction to Fourier Analysis on Euclidean Spaces. Princeton University Press, 1971.
-  G. Thakur. Sequential testing over multiple stages and performance analysis of data fusion. Proc. SPIE Signal Processing, Sensor Fusion and Target Recognition XXII, (87540S), 2013.
-  P. K. Varshney. Distributed Detection and Data Fusion. Springer, 1996.
-  R. Vishwanathan and P. Varshney. Distributed detection with multiple sensors I: Fundamentals. Proceedings of the IEEE, 85(1):54–63, 1997.
-  G. Xing, R. Tan, B. Liu, J. Wang, X. Jia, and C. W. Yi. Data Fusion Improves the Coverage of Wireless Sensor Networks. Proceedings of ACM MobiCom ’09, pages 157–168, 2009.
-  L. Yeh. Bayesian Variable Sampling Plans for the Exponential Distribution with Type I Censoring. Annals of Statistics, 22(2):696–711, 1994.
Appendix A: Proofs of Theorems
Proof of Proposition 1.
Let be the joint feature space with dimension . The minimum Bayes risk is
subject to the constraint that for every , is a probability density, or in other words,
In addition, we want the fusion rule to be deterministic, so for all , there is a single finite point in such that
The condition (8) makes the problem nonconvex, but it can be simplified as follows. Let be fixed. We want to show that . The condition (8) is saying that there is a point such that for all test functions such that . Since is compactly supported, this also holds for the larger class with . Any can be put into this form by writing , and using (6) along with linearity implies that . Since any generalized function in is uniquely determined by its action on , it follows that , which implies (7) as well. This means that the minimization problem (5) is equivalent to
where there are no additional constraints.
We first consider the optimization problem (9) over the larger space defined by
and without any other constraint. In this case, (9) becomes a standard problem of minimizing a positive quadratic form over a Hilbert space, with at least one feasible solution since . There exists a unique solution to this problem in [18, 22], which is thus unique pointwise up to sets with zero -dimensional Lebesgue measure. To find this solution, we set up the Euler-Lagrange equation 
which can be solved to obtain
This is the only stationary point of the functional in (9) and the only candidate for a minimum. The denominator in (12) is simply the joint density , and since , it follows that as well. If either or are finite, we can check that
and in the same manner, , so is a feasible solution for (9) and is thus in fact the optimal fusion rule. ∎
Proof of Theorem 2.
First, note that if for all , the level sets have zero -dimensional measure, then the performance is given by
) by considering the Fourier transform (or characteristic function) of, given by . Since ,
for every . Now for the th sensor and th feature that satisfies (2), the fact that are all positive on implies
where is the linear differential operator and is its adjoint. The integral in (16) is finite for each due to the conditions on and . This means that for each , is a smooth function that decays faster than any polynomial. The inverse Fourier transform is thus also a smooth function with the same decay . ∎