The literature on methods to estimate the exact value of a spatially varying phenomenon of interest across a certain area based on spatially sparse measurements is vast and comprehensive, see, e.g., [Arias2018]. While the numerical value of a specific quantity may itself be of interest, such as air temperature, precipitation or wind speed and direction when monitoring the weather, many problems require performing inferences and providing interpretations on the underlying phenomenon. Practical examples include applications in environmental monitoring (e.g. identifying regions with intolerably high emission levels), spectrum sensing (e.g. detecting regions where certain frequency bands are densely used or underutilized), smart buildings (e.g. determining areas with low oxygen levels), integrated sensing and communication (ISAC, one of the drivers of 6G wireless) or acoustics (e.g. describing the region with high ambient noise levels). This fundamental problem [Nowak2003] is illustrated in Fig. 1.
Over the past two decades, the rapid development of ever cheaper and smaller sensors, as well as the rise of faster, lower latency and more reliable wireless connectivity standards have facilitated the deployment of large-scale sensor networks, a key technology in the iot. In this work, we propose a fully data-driven framework to directly infer spatial regions of interesting, different or anomalous
behavior with quantitative statistical performance guarantees in terms of Type I errors using large-scale sensor networks. To this end, we model the spatial area of interest as a regular spatial grid. We formulate a mht problem such that a decision on the state of the observed phenomenon is made at each grid point (also referred to as alocation
). We discriminate between the nominal state of the phenomenon, represented by the null hypothesis, and any state that deviates from the nominal, represented by alternate hypotheses. While in many problems, one could distinguish various classes of anomalous, interesting or different behavior, we summarize everything that is not conform with the null hypothesis under the alternative . Sensors (also referred to as nodes), capable of communicating with a cloud/fc, are placed sparsely in distinct locations. No particular sensor distribution geometry or configuration is assumed.
The proposed method does not presume a specific parametric spatial signal propagation model. Neither does it require raw measurements to be exchanged among neighboring nodes, nor to be transmitted to the fc. Instead, each node communicates local information to the fc via a scalar quantity [Wang2018]. The communication of all raw measurements [Arias2018] is infeasible in the context of large-scale sensor networks due to constraints on bandwidth and energy consumption [Wang2018, Nitzan2020]. While the former is particularly critical when a network is composed of many agents operating in a congested spectrum, the latter is a crucial requirement for a long sensor life span.
It may not be feasible to specify an accurate probability model for each node and its signal environment. The underlying probability models are learned from the data, which makes our method robust against modeling errors and suitable to a wide range of applications. However, incorporating specific models describing the underlying physics may be beneficial, as will be pointed towards the end of this paper.
We propose that each node condenses its measurements in a single local sufficient statistic, likelihood ratio or -value. Large-scale sensor networks may be composed of heterogeneous types of devices. Hence, the statistic is chosen to allow for fusing the results among potentially very different sensors. The principal purpose of dedicated emission sensors, for example, is to record data and perform inference or learning. Contrarily, smartphones equipped with a good number of different sensors collect data as a by-product of normal operation. In addition, a suitable local summary statistic is simple to compute. This limits energy consumption and reduces the hardware requirements for individual nodes. Also, it refrains from strict assumptions on the local signal model. Useful summary statistics in the mht framework include -values, -scores and (generalized) likelihood ratios. These sensor-level quantities are forwarded to the fc, which fuses the information to identify the local regions associated with different hypotheses. Our work differs from distributed inference [Veeravalli2012], where a single binary decision on the state of the phenomenon across the entire area is made.
In dependence on the size of the monitored area and the desired spatial resolution, the number of grid points and hence decisions might easily reach the order of tens of thousands. To prevent a flood of false positives by testing for a large number of times [Tukey1991], we follow the principles of mht, where choosing the alternative is called a discovery [Soric1989]. Performance guarantees are commonly provided w.r.t. the fdr, which is the expected proportion of false discoveries among all discoveries, [Benjamini1995]. Controlling the fdr is similarly popular to constraining the false alarm probability in binary hypothesis testing. When the number of true positives increases, procedures that control the fdr allow more false positives than classic mht methods applied, e.g., in [Zoubir1995]. Thus, controlling the fdr leads to a larger detection power.
Attempting to control the fdr in our spatial inference setup poses the following problem-specific challenges. First, local summary statistics are only available at the subset of grid points where the sensors are located, but decisions between and need to be made at every grid point. Second, the deployed method must scale well to large-scale sensor networks, i.e., its computational complexity remains moderate even for a large number of nodes. Finally, the method should allow for incorporating available side information into the decision process to increase the detection power.
The past work on fdr control in the context of spatial data has focused on testing a priori [Benjamini2007, Barber2015a] or a posteriori [Chouldechova2014, Chumbley2009, Chumbley2010, Schwartzman2019] formed groups of data. While these procedures typically rely on assumptions that may not be realistic [Eklund2016], they do also not provide guarantees w.r.t. to the localization accuracy of the identified alternative area. Random field-based approaches like [Sun2014, Shu2015] suffer from excessive computational cost if the number of nodes is large. We find the concept of the lfdr [Efron2001, Efron2005, Efron2008, Efron2010] particularly suitable to deal with the aforementioned challenges. The lfdr is computed for each sensor and quantifies the risk of making a false discovery when accepting at its location. Previous work [Efron2010, Tansey2018, Halme2019, Golz2019, Scott2015, Chen2019] demonstrates its capability to incorporate available side information like spatial proximity into decision making.
The lfdr’s are computed from the local summary statistics. In addition, the lfdr’s rely on the joint pdf of the local summary statistics and the overall proportion of alternatives that are unknown in practice. These quantities need to be learned or estimated accurately from the data. This is often referred to as lfdr estimation in the mht literature [Efron2010]. A variety of estimators exist, often assuming that the joint pdf belongs to the exponential family [Efron2010, Muralidharan2010, Martin2011, Scott2015]. Our novel contributions in this work are as follows.
We formulate an lfdr-based framework as a flexible data-driven approach to accurately determine spatial regions of interest, difference or anomaly of a physical phenomenon.
We propose a novel, highly computationally efficient method for computing lfdr’s. It bases upon an innovative mixture distribution model and the method of moments.
We demonstrate that proper interpolation of the lfdr’s between sensors allows for inferring the state of the phenomenon even at locations between the sensors, where measurements are unavailable. Our proposed method scales well to spatial inference with large-scale sensor networks under strict statistical performance guarantees.
Notation: Throughout the paper, regular lowercase letters denote scalars, whereas bold lowercase letters , bold uppercase letters and underlined bold uppercase lettersand denote matrix square root and Moore-Penrose inverse. Calligraphic letters denote sets and sets containing all positive integers . is the cardinality of set and denotes the indicator function. The Hadamard product operator is , while represents the outer product. denotes the pdf of rv and its pdf conditioned on event . stands for the false discovery rate, for the local false discovery rate.
Ii The spatial inference problem
The spatial inference problem is illustrated in Fig. 1. and denote the regions of nominal and anomalous, different or interesting behavior, respectively. The continuous observation area is discretized by a regular grid of elements, to each of which we refer by its index . Their position on the grid is denoted by . The state of the observed phenomenon at grid point and time instant is described by the unknown true binary local hypothesis , with the null hypothesis and the alternative . If , we say that the observed phenomenon is in its nominal state at , . If , an interesting phenomenon or anomaly is present. holds under any deviation from and is thus in general composite. In environmental monitoring, could represent clean air, whereas could indicate contamination above a tolerable level. We consider phenomena that vary smoothly in space and slowly in time. Due to the latter, we assume that the true local hypotheses are constant over the observation period and write . For now, we assume that a sensor is placed at each grid point.
and are mutually exclusive sets that comprise the grid points at which or hold. Formally, the objective of spatial inference is to identify the set of all grid points where is in place, and the set containing all locations where holds. The models for the measured field levels differ under and , i.e.,
where is the non-zero level of the phenomenon at location and the temporally i.i.d. measurement noise. This noise process is spatially independent but not necessarily identically distributed in different nodes. varies with , but takes on similar values at close-by locations due to the spatial smoothness assumption. This is well-justified by the underlying mechanisms of many physical phenomena. Radio waves, for example, are subject to path-loss and shadow fading [Molisch2012] that vary slowly. We cannot directly observe . Eq. (1) suggests to utilize the measurements to obtain the empirical models for the hypotheses or and form the estimated regions and associated with null hypothesis and alternative. The individual sensors condense their raw observations over observation period into soft local decisions statistics . These are more informative than local binary decisions, used, e.g., in [Marano2019]. A procedure that controls the fdr [Benjamini1995], the expected ratio between the number of false discoveries and all discoveries
at a nominal level guarantees that on average, a proportion of the locations in are actual members of the true alternative region .
The type of deployed sensor and the distribution of the measurement noise may differ from sensor to sensor [Rovatsos2020]. Thus, the cannot be directly fused with each other. Instead, one defines local summary statistics that are normalized such that they are i.i.d. , but not necessarily for . The null and alternative regions for a given set of observations of the local summary statistics can be inferred by standard methods, such as the bh [Benjamini1995]. denotes the domain of . Common choices for are -values
, where is the pdf of under , or -scores , where is the standard normal cdf [Efron2010]. For -values, the domain is and for -scores, . has to be known to compute -values or -scores. If is unknown, it can be estimated from the data using for example the bootstrap [Zoubir2001, Golz2017]. Small -values indicate very little support for the null hypothesis.
We propose to solve the spatial inference problem with the help of a suitable mht soft decision statistic. Define the random variablethat represents the mixture of all local summary statistics. The pdf of is
with the relative size of the null region, the pdf for and the pdf for if . The model in Eq. (4) exploits that the local summary statistics are i.i.d. across locations . Finally, the local false discovery rate is [Efron2005, Efron2010]
Appealingly, is the posterior empirical Bayes probability that . Thus, it allows us to assess the probability of making a false discovery. To solve the spatial inference problem while controlling , we form the region associated with the alternative hypothesis
This approach guarantees fdr control at level while maximizing detection power, since the so-called bfdr is an upper bound of the Frequentist fdr from Eq. (2) [Efron2010]. Note that the bfdr is the average false discovery probability across the alternative region , while the lfdr asserts each location with the individual risk of being a false discovery.
In what follows, we develop a novel method to compute the lfdr’s. The concept of the local false discovery rate is described in Sec. III. The proposed method is based on a sophisticated probability model for the local summary statistics. We introduce the novel model and a novel algorithm to compute lfdr’s from observed data in Sec. IV. To determine decision statistics in between the sparse sensor locations, we propose to interpolate the sensor lfdr’s in Sec. V. We conclude by simulation results in Sec. VI.
Iii Local False Discovery Rate Estimation
Inference based on local false discovery rates contains estimation and hypothesis testing components, since the theoretical lfdr’s defined in Eq. (5) are unavailable in practice. Thus, the null and alternative regions are determined based on lfdr estimates. The accuracy of the deployed estimators has immediate consequences on fdr control: underestimation of the lfdr’s leads to violations of the nominal fdr level, whereas overestimation reduces power.
The general structure of lfdr estimators follows from Eq. (5). The pdf of for is most often assumed to be known [Efron2010]. The relative size of the null region is unknown. Finally, the mixture distribution from Eq. (4) is not available. The common approach to lfdr estimation relies on separate estimation of and by estimators and , which are then plugged into Eq. (5) [Efron2010], i.e.,
The underlying physical effects drive the statistical behavior of the lfdr via . Thus, a generally optimal estimator does not exist.
The increasing interest in the incorporation of covariate information into mht, e.g. [Scott2015, Tansey2018, Chen2019a, Halme2019, Golz2019], has lead to a number of sophisticated lfdr estimators that treat as a two component mixture, the so-called two groups model [Efron2001, Pounds2003]
We also adopt the two-groups model. In spatial inference, is an estimator for the mixture of local summary statistics in the alternative region , see Eq. (4).
Iii-a The choice of a local summary statistic
We infer the alternative region for a given set of realizations of the local summary statistics by Eq. (6) with . We follow the traditional approach to lfdr estimation. is found from Eq. (7) using plug-in estimates and . The latter obeys the two groups model from Eq. (8). is assumed to be known, while the behavior in the alternative region is captured by . If the lfdr’s are underestimated at locations in , the nominal fdr level is violated. Thus, on average more than a fraction of the members in are false positives.
The suitability of different density estimation techniques to determine depends on the local summary statistic. The two groups model from Eq. (8) applies to -values and -scores alike. As illustrated in Fig. 2, the shapes of , and differ significantly. For -values, follows an . For -scores,
is the standard normal distribution,.
Much of the existing literature on lfdr estimation focuses on -scores. is often modeled as a parametric pdf with infinite support .
is found by estimating the parameters of a finite Gaussian mixture model[Le2003] or an exponential family model [Muralidharan2010, Efron2010]. The decomposition of into and for Eq. (8) is to be performed in a subsequent, non-trivial step. Non-parametric methods, which directly decompose into the components of the two-groups model have been studied using kernel estimates [Robin2007] and, more recently, predictive recursion [Martin2011, Scott2015, Martin2018, Newton2002].
To maintain fdr control, the estimator of is required to be highly accurate for values of that are more likely under the alternative, that is, in the left tail for left-sided -scores. High tail accuracy of the estimate is difficult to achieve, especially at small sample sizes. An overall well-fitting does not necessarily imply a good fit in the tails, due to their small probability mass. Tail behavior is difficult to capture. Finally, the domain of , , covers the entire real axis. Hence, the support of and its tails is infinite.
The lfdr estimators based on -values are not subject to the previously discussed tail-accuracy-related issues of
-score-based lfdr estimators. To this end, observe that the assumption of uniformly distributed-values under , i.e., follows , spreads statistically insignificant observations uniformly across the bounded domain instead of producing a bulk in the center of the domain. Consequently, most of the mass of is located in the subregion of that contains statistically significant (small) -values, i.e., where is large. Additionally, -value-based lfdr estimation allows for a simple way [Pounds2003] to decompose estimate into the components of the two-groups model and , ,
Since the domain of the -values is bounded, -value density estimators are commonly parametric. The true analytical form of is unknown. Consequently, one has to identify a parametric probability model and a parameter vector value such that . The quality of the pdf estimate can be evaluated by difference measures , such as the kld divergence or the wsd distance.
Since our proposed method relies on modeling -value mixture distributions, we consider the popular bum model [Pounds2003],
Superscript BUM indicates the dependency on the bum parameters and . denotes the bum model with the respective mle for and for . The bum model exploits two known properties of , which are also apparent in Fig. 2. First, the uniform distribution under is captured by the constant . Second,
is known to be monotonically decreasing. A single-parameter beta distribution decreases monotonically. The bum model is simple and has been applied successfully in a number of applications. However, it lacks flexibility due to its limited number of tuning parameters. Estimating by leads to overly pessimistic lfdr estimates, as the simulations in Sec. VI underline. We introduce a more flexible model in Sec. IV.
Iv The Proposed lfdr Estimator
In this section, we introduce a novel lfdr estimator. Our approach estimates lfdr’s from -values. We propose the parametric probability model
a finite single-parameter bm with shape parameters and mixture weights such that and . For , and , the -th component is monotonically decreasing, constant and monotonically increasing in , respectively. Due to the increased number of components, is more flexible than . Estimating its parameters is more involved than for the bum model, since model parameters are to be determined from the observations . Closed-form mles for the parameters of mixture distributions are difficult to obtain. Instead, mles are commonly found iteratively by em [Dempster1977], which is computationally expensive for larger model orders. Also, the parameter estimates are only locally optimal, which may result in poorly fitting models for non-convex likelihood functions. In this work, we target computationally light-weighted procedures suitable for large-scale sensor networks. Our approach bases upon the mom that estimates model parameters by solving pre-defined equation systems.
Iv-a The method of moments
The principle of moment-based parameter estimation [Pearson1936, Iskander1999]
is to match population and empirical moments. To this end, multivariate systems of moment equations are solved. The mom is conceptionally simple, but also entails challenges. Its analytic complexity rapidly increases with the number of model parameters. In addition, empirical higher-order moments are prone to large variance[Bowman2006]. Therefore, the sample size required to provide meaningful estimates grows exponentially in the number of model parameters [Hsu2012]. As a consequence, the standard mom is not well-suited to determining the parameters for Eq. (11).
The smom, a recent approach [Anandkumar2012, Hsu2012], allows to determine the parameters of multivariate Gaussian mixtures from only the first three moments. Thus, smom avoids higher-order moments. In contrast to other work on the field of low-order moment-based parameter estimation, the method in [Hsu2012] does not require a minimum distance between the locations of the mixture components to guarantee identifiability. This suits particularly well to this work, since we are dealing with -values on the domain and need to discriminate between mixture components that are located closely to one another. Combining the bm model and the smom provides a base for a computationally efficient -value density estimator that we introduce in what follows.
Iv-B The multivariate -value model
Traditional statistical techniques would treat the input data as a single observation of a -dimensional random vector with elements . The identification of the regions associated with null hypothesis and alternative and based on a single observation of a high-dimensional random vector is fairly challenging. To perform spatial inference, we adopt the idea of learning from the experience of others [Efron2010] and treat as realizations of the same scalar random variable . We first model the -values as a -dimensional random vector instead of estimating directly from . Then, we estimate its joint pdf . Finally, we average over the marginals to obtain the univariate estimate . Estimating a joint pdf appears intuitively more challenging. However, our multivariate -value model enables fast and reliable estimation of , since it facilitates the application of the smom. To the best of our knowledge, this concept is entirely new to lfdr estimation.
In what follows, assume . We divide the set of observations for random variable into distinct subsets of equal size , . Next, arrange the elements of each in no particular order into -dimensional -value vectors . are observations of the random vector , whose -th entry be random variable . Since , the marginal distribution of each can be described without loss of generality by a -component mixture
with mixture proportion vector such that and shape parameters . The partitioning of into and the ordering of the entries within each , , is not to be based , . Then, the are mutually independent random variables. Thus, is fully characterized by its marginals, which relate to through
With [Johnson1995, Chapter 24], the first two cumulants, mean and variance , of the -th marginal’s -th component are found , as
The third-order cumulant, is
Additionally, denote the -th mixture component mean vector by and the -th component vector of third-order cumulants by for all . We also define the averages across mixture components, , and . Since the are independently distributed, the -th component’s covariance matrix is diagonal with the -th entry . The mixture covariance matrix is .
To conclude this section, we formulate the following assumption on the -values.
Assumption 1 (Similar component variances).
The marginal variances of the -th mixture component are similar, such that they can be treated as approximately equivalent across the marginals, i.e., , .
In other words, for a certain , the entries of can be treated as observations of random variables with approximately equivalent variances.
There exist a subset size and a number of mixture components for which the -value subsets and vectors , , can be formed such that Assumption 1 holds for each mixture component .
Thus, the , can be divided into groups such that joint pdf of the -value vectors in each group is described by mixture component . For illustration purposes, consider that the subsets are formed based on spatial proximity, i.e., each is composed of -values from neighboring locations. For those subsets containing exclusively -values from locations , the statistical properties of each element are similar by design. Due to the assumed spatial smoothness of the physical phenomena of interest, also the -values obtained at close-by locations have similar statistical properties. Our simulation results in Sec. VI confirm that Assumption 1 is fairly mild.
Iv-C The spectral method of moments
The spectral method of moments was formulated for multivariate spherically Gaussian distributed data in[Hsu2012]. Their approach builds on the relation between the population moments and the model parameters, namely, the mixture weights, means and variances. In this section, we formulate similar relations for the -value vectors, given that they follow the model from Eq. (12) and fulfill Assumption 1. We first extend [Hsu2012, Theorem 2] such that it fits to our proposed data model.
Theorem 1 (Relation of mixture model parameters to spectral quantities).
if , and the -th mixture component mean vectors are linearly independent. Here, is a vector chosen uniformly at random from the unit sphere in , matrix . The projection matrices and are based on the matrix of left singular vectors of the thin svd of . In addition, and such that
with a vector of zeros and as its -th entry. is the outer product.
follows from Eq. (14). The relations for the mixture component means , and the mixture proportion vector result directly from the proof of [Hsu2012, Theorem 2]. Note, that and , which implies that is diagonalizable along the lines of in [Hsu2012, Theorem 2] by a diagonal matrix with diagonal entries . ∎
The relations established in Theorem 1 allow to estimate the mixture model by means of its parameters and , since , , and enable a one-to-one mapping between the model parameters and the observable population moments. In particular, is the first moment of , whereas and are related to the second and third moments and of . The exact relationships are derived in Theorem 2.
Theorem 2 (Relation of spectral to observable quantities).
Under the assumptions in Theorem 1, the average variance over mixture components is the smallest eigenvalue of the population covariance matrix . With any unit-norm eigenvector of eigenvalue , we find
where the observable and the difference to the non-observable are
with third-order tensors , vector ,
denotes the Hadamard product and
is the vector composed of the marginals’ mean third cumulants, i.e., Eq. (15) averaged over the mixture components.
See Appendix A. ∎
The first and second population moments and , the population covariance matrix and consequently also , and can be found using consistent sample estimates of the moments and covariance. cannot be estimated directly, since only depends exclusively on sample moments, but not . However, we show in Appendix B that is a sufficiently good approximation of for estimating the -value mixture density. is observable. In the following section, we exploit these relations to estimate the component first central moments and the model parameters , for Eq. (13).
Iv-D The lfdr-smom estimator
The proposed method is summarized in Alg. 1. It partitions the -values into subsets, estimates the parameters for the joint pdf of the resulting -value vectors and determines from Eq. (13). The density fit is repeated several times for different -value vectors, increasing subset sizes and increasing model orders . If the distinct subsets are immovable, because they are formed based on fixed covariate information like spatial proximity, we randomly rearrange the elements within each -value vector for different runs. The lfdr’s are estimated using , Eq. (9) and Eq. (7). The goodness of fit is assessed by the value of a difference measure between and the data.
is initialized with a large value, to ensure that the algorithm finds a solution. The best solution has been found, if additional degrees of freedom do not lead to a better fit.
The parameters of the -value vector mixture density are estimated by Alg. 2, which determines the right side of Eq. (16) from the sample moments. In Line 1, the data is split into two distinct sets of equal size. If
is odd, we drop one. The sample moment-based estimates for and for are multiplied during the estimation process. Hence, they must be computed from different data to guarantee their independence. The sample covariance matrix estimates in Line 3 are full rank, but Line 5 reduces the rank to its assumed value . is determined in Lines 6 and 7. Lines 8 to 13 are dedicated to the estimation of the eigenvector, eigenvalue pairs for Eq. (16). We generate different vectors and select the best run in Alg. 1 to fulfill Lemma 1. We found a value as low as to provide satisfying results.