Controlling and monitoring physical processes via sensed data are integral parts of internet-of-things (IOT) and cyber-physical systems, and also have applications in industrial process monitoring and control, localization, tracking of mobile objects, environmental monitoring, system identification and disaster management. In such applications, sensors are simultaneously resource constrained (power and/or bandwdith) and tasked to achieve high performance sensing, control, communication, and tracking. Wireless sensor networks must further contend with interference and fading. One strategy for balancing resource use with performance is to activate a subset of the total possible number of sensors to limit both computation as well as bandwidth use.
Herein, we address the fundamental problem of optimal dynamic sensor subset selection for tracking a time-varying stochastic process. We first examine the centralized tracking of an i.i.d. process with a known distribution, which is a precursor to the centralized tracking of an i.i.d. process with an unknown, parametric distribution. For the known prior distribution case, optimality of the proposed algorithm is proven. For the proposed algorithm for centralized tracking of an i.i.d. process with parameter learning, results on almost sure convergence to local optima are proven. The algorithms are numerically validated to demonstrate their efficacy against competetive algorithms and natural heuristics.
Optimal sensor subset selection problems can be broadly classified into two categories: (i) optimal sensor subset selection for static data with known prior distribution, but unknown realization, and (ii) dynamic sensor subset selection to track a time-varying stochastic process. There have been several recent attempts to solve the first problem; see for sensor network applications and  for mobile crowdsensing applications. This problem poses two major challenges: (i) computing the estimation error given the observations from a subset of sensors, and (ii) finding the optimal sensor subset from exponentially many number of subsets. In , a tractable lower bound on performance addressed the first challenge and a greedy algorithm addressed the second. In our current paper, we use Gibbs sampling to solve the problem of tracking an i.i.d. time varying process via active sensing. While estimation of static data and tracking i.i.d. time-varying process are the same problems mathematically, herein we provide a provably optimal alternative approach to that of ; in case the distribution is unknown and learnt over time, Gibbs sampling also yields a low-complexity sensor subset selection scheme, thereby eliminating the need for running a greedy algorithm whose complexity scales with the number of sensors.
, the problem of selecting a single sensor node to track a Markov chain is addressed; assumes the availability of a centralized controller which has knowledge of the latest observation made by the selected sensor. This problem was extended to sensor subset selection (by a centralized controller) in , and energy efficiency issues were incorporated in . These two papers considered sequential decision making over a finite time horizon. The existence of an optimal policy for the centralized optimal dynamic sensor subset selection problem for infinite time horizon is proved in ; the structure of the optimal policy for the special case of linear quadratic Gaussian (LQG) problem is also provided. The paper 
addresses the problem of selecting a single sensor at each time, with the assumption that the observation of the sensor is shared among all sensors. Thompson sampling, in, solved the problem of centralized tracking of a linear Gaussian process (with unknown noise statistics) via active sensing.
Herein, we consider the problem of dynamically choosing the optimal sensor subset for centralized tracking of an i.i.d. time-varying process with an unknown parametric distribution, using tools from Gibbs sampling (see ) and stochastic approximation (see ). To the best of our knowledge, this problem has not been solved in prior work. Our work accommodates energy constraint in the network by imposing a constraint on the number of active sensors.
In this paper, we make the following contributions:
In Section III, a centralized tracking and learning algorithm for an i.i.d. process with a known distribution is developed, in order to minimize time-average estimation error subject to a constraint on the mean number of active sensors. In particular, Gibbs sampling minimizes computational complexity for a relaxed version of the problem, along with stochastic approximation that is employed to iteratively update a Lagrange multiplier to achieve the mean number of activated sensors constraint. Desired almost sure convergence to the optimal solution is proved. A challenge we overcome in the analysis, is handling updates at different time scales that given rise to several technical issues that need to be addressed.
In Section IV, a centralized tracking and learning algorithm for an i.i.d. process with an unknown, but parametric distribution is developed. In addition to Gibbs sampling and stochastic approximation as used in Section III
, simultaneous perturbation stochastic approximation (SPSA) is employed for parameter estimation obviating the need for expectation-maximization.
Numerical results show that the proposed algorithms outperform simple greedy algorithms. Numerical results also demonstrate a tradeoff between performance and computational cost for learning. Furthermore, the numerical results show that sometimes global optima are achieved in tracking i.i.d. process with unknown parametric distribution.
The rest of the paper is organized as follows. The system model is described in Section II. Tracking of an i.i.d. process with known distribution is described in Section III. Section IV deals with the tracking of an i.i.d. process with unknown, parametric distribution. Numerical results are presented in Section V, followed by the conclusion in Section VI. All mathematical proofs are provided in the appendices.
Ii System Model
We consider a connected single-hop wireless sensor network (see Figure 1) where sensor nodes communicate directly with the fusion center; the fusion center is responsible for all control or estimation operations in the network. The sensor nodes are denoted by the index set . While our methods can be adapted to consider multihopped communication via relays, we do not treat this case herein.
The physical process under measurement is denoted by , where is a discrete time index and . is an i.i.d. process. The distribution of may be known, or might have a parametric distribution , where the unknown parameter vector needs to be be learnt via the measurements. The parameter vector lies inside the interior of a compact subset .
At time , if a sensor is used to sense the process, then the observation at sensor is provided by a -dimensional column vector
where is a Gaussian random vector (observation noise) which is independent across and i.i.d. across .
Let be a vector where the -th entry if the th sensor is activated at time, and , if it is inactive. The decision to activate any sensor for sensing and communicating the observation is taken by the fusion center. We denote by the set of all possible configurations (i.e., sensor activation vectors) in the network, and by a generic configuration. Clearly, . Each configuration represents a unique set of activated sensors. The notation is used to represent the configuration with its -th entry removed. We denote by another configuration which agrees with at all coordinates other than the -th coordinate, where has a value as the -th entry (i.e., the -th sensor is not activated); a similar definition holds for .
The observation made by sensor at time is . We define .
Ii-a Problem framework
Our sensor network seeks to achieve two goals: develop a sensing strategy, and compute an estimate of at the fusion center which is denoted by (see Figure 1). For the case of unknown distribution paramter, the fusion center also computes an estimate of those parameters, . To compute these three functions we define two distinct information structures:
The corresponding functions are then given as follows:
We observe the sequential nature in applying the functions , that is, the activation vector determines the observations which in turn are used to compute the tracked process, . For unknown , we compute . For an i.i.d. time varying process, is sufficient to estimate . However, in order to optimally decide when is unknown, the fusion center needs knowledge about the performance of all past configurations. Hence, and have two different information structures. However, we will see that, our Gibbs sampling algorithm determines by using only a sufficient statistic (which captures the past history) calculated iteratively in each slot.
We define a policy as a tuple of mappings, where , and as discussed earlier. The policy may be randomized, where the quantities , and are chosen according to random distributions defined by ; that is, , and
are three probability distributions for, and , respectively. In the sequel, we will investigate Gibbs sampling strategies for sensor selection; therein, will be random.
Our goal is to solve the following centralized problem of minimizing the time-average mean squared error (MSE) subject to a constraint on the mean number of active sensors per unit time:
where is the expectation under policy , and the expectation is taken over the randomness in the process as well as any possible randomness in the policy .
Iii IID proces with known distribution
In this section, we provide an algorithm for solving the centralized problem (P1) when is i.i.d. with known distribution. This algorithm is developed as a precursor to the algorithms for tracking an i.i.d. process having a parametric distribution with an unknown parameter .
Iii-a Relaxing the constrained problem
The multiplier can be viewed as the cost incurred for activating a sensor at any time instant. We will see later that solution of the unconstrained problem (P2) will be used to solve the constrained problem (P1).
We observe that, at time , for the chosen sensor subset and the corresponding collected observations , the minimum mean squared error (MMSE) estimate of is given by ; hence, we fix the estimation policy and solve (P2) only over the sensor subset selection policy (since the distribution of is known, has no relevance here). Since (P2) is an unconstrained problem and is i.i.d. across , there exists at least one optimizer (not necessarily unique) for the problem (P2); if the configuration is chosen at each , the minimum cost of (P2
) can be achieved (follows from the law of large numbers, since the cost incurred over time for a givenconstitutes an i.i.d. sequence whose mean is the optimal cost for (P2)). Hence, (P2) can be written as:
Here (for any ) is the MSE under estimation policy when the sensor activation vector is ; becomes equal to the MMSE under configuration if . Our results in this paper will hold for MMSE or any other general estimator.
The following result tells us how to choose the optimal to solve (P1).
Consider problem (P1) and its relaxed version (P3). If there exists a Lagrange multiplier and a , such that an optimal configuration for (P3) under is , and the constraint in (P1) is satisfied with equality under the pair , then is an optimal configuration for (P1).
See Appendix A. ∎
Finding the optimal solution of (P2) and (P3) requires us to search over possible configurations and to compute the MSE for each configuration. Hence, we propose Gibbs sampling based algorithms to avoid this computation.
Let us define a probability distribution over as (with a parameter ):
Following the terminology in statistical physics, we call the inverse temperature, and the partition function. The quantity is viewed as the energy under configuration . It is straightforward to see that . Hence, if a configuration is selected at each time with probability distribution for sufficiently large , then will belong to the set of minimizers of (P3) with high probability; if , then, for a unique minimizer , (which goes to as ). However, computing requires addition operations; hence, we use a sequential subset selection algorithm based on Gibbs sampling (see [11, Chapter ]) in order to avoid explicit computation of while picking .
Below we introduce the Basic Gibbs (BG) algorithm.
BG algorithm: Start with an initial configuration . At time , pick a random sensor uniformly from the set of all sensors. Choose with probability and choose with probability . For , choose . Activate the sensors according to .
Note that, in this algorithm, it is sufficient to maintain due to the i.i.d. nature of .
Under the BG algorithm, is a reversible, ergodic, time-homogeneous Markov chain with stationary distribution .
Follows from the theory in [11, Chapter ]). The proof can be done by verifying the detailed balance equation for the Markov chain . ∎
Theorem 2 tells us that if the fusion center runs BG and reaches the steady state distribution of the Markov chain , then the configuration chosen by the algorithm will have distribution . Also, by the ergodicity of , the time-average occurence rates of all configurations match the distribution almost surely.
For very large , if one runs for a sufficiently long, finite time , then the terminal state will belong to with high probability. We have already shown that, for a unique minimizer , (which goes to as ). In Section III-D, we will provide an upper bound on which is the total variation distance between the distribution of under BG algorithm and the distribution ; the upper bound is , where . Hence, for a large but finite time , we can derive the following bound:
Iii-C The exact solution
BG is operated with a fixed , but the optimal solution of the unconstrained problem (P2) can only be obtained with ; this is done by updating at a slower time-scale than the iterates of BG algorithm. The quantity is increased logarithmically with time in order to maintain the necessary timescale difference between Gibbs sampling and update. We call this new algorithm Adaptive Basic Gibbs or ABG.
ABG algorithm: This algorithm is same as BG except that at time , we use to compute the update probabilities, where , , and .
Under the ABG algorithm, the Markov chain is strongly ergodic, and the limiting probability distribution satisfies .
For i.i.d. time varying
with known joint distribution, we can either: (i) find the optimal configurationusing ABG off-line and use for ever, or (ii) run ABG at the same timescale as , and use the current configuration for sensor activation; both schemes will minimize the cost in (P2). By the strong ergodicity of , optimal cost will be achieved for (P2) under ABG.
Iii-D Convergence rate of BG and ABG
Let denote the probability distribution of under BG. Let us consider the transition probability matrix of the Markov chain with , under BG. Let us recall the definition of the Dobrushin’s ergodic coefficient from [11, Chapter , Section ] for the matrix ; using a method similar to that of the proof of Theorem 3, we can show that . Then, by [11, Chapter , Theorem ], we can say that under BG, we have . We can prove similar bounds for any , where .
Clearly, under the BG algorithm, the convergence rate decreases as increases. Hence, there is a trade-off between convergence rate and accuracy of the solution in this case. Also, the rate of convergence decreases with .
Such a closed-form convergence rate bound for ABG is not easily available. For the ABG algorithm, the convergence rate is expected to decrease with time, since the value of increases to .
Iii-E Gibbs sampling and stochastic approximation for (P1)
In Section III-B and Section III-C, we presented Gibbs sampling based algorithms for the unconstrained problem (P2). Now we provide an algorithm that updates with time in order to meet the constraint in (P1) with equality, and thereby solves (P1) (see Theorem 1) by solving the unconstrained problem.
Let us denote the optimal configuration for (P3) under a given estimation strategy (which could be the MMSE estimator) by .
For the unconstrained problem (P3), the optimal mean number of active sensors, , decreases with . Similarly, the optimal error, , increases with .
See Appendix D. ∎
Lemma 1 provides intuition as to how to update in BG or in ABG in order to solve (P1). We seek to provide one algorithm which updates at each time instant, based on the number of active sensors in the previous time instant. In order to maintain the necessary timescale difference between the process and the update process, we use stochastic approximation () based update rules for .
The optimal mean number of active sensors, , for the unconstrained problem (P3) is a decreasing staircase function of , where each point of discontinuity is associated with a change in the optimizer . Hence, the optimal solution of the constrained problem (P1) requires us to randomize between two values of (and therefore between two configurations) in case the optimal as in Theorem 1 belongs to the set of such discontinuities. However, this randomization will require us to update a randomization probability at another timescale; having stochastic approximations running in multiple timescales leads to slow convergence. Hence, instead of using a varying , we use a fixed, but large and update in an iterative fashion using stochastic approximation; BG itself is a randomized subset selection algorithm and hence no further randomization is required to meet the constraint in (P1) with equality. This observation is formalized in the following lemma. This lemma will be crucial in the convergence proof of the Gibbs Learn (GL) algorithm proposed later.
Under BG, is a Lipschitz continuous and decreasing function of .
See Appendix E. ∎
We make the following feasibility assumption for (P1), under BG with the chosen .
There exists such that the constraint in (P1) under and BG is met with equality.
Note that, by Lemma 2, continuously decreases in . Hence, if is feasible, then such a must exist by the intermediate value theorem. Our proposed Gibbs Learn (GL) algorithm updates iteratively in order to solve (P1). Let us define: (recall the notation from (P3)). Now, we formally describe the GL algorithm to solve the constrained problem (P1).
Choose any initial and .
At each discrete time instant , pick a random sensor independently and uniformly. For sensor , choose with probability
and choose with probability
. For , we choose .
Update at each node as follows:
We next make two observations on the GL algorithm:
If is more than , then is increased with the hope that this will reduce the number of active sensors in subsequent time slots, as suggested by Lemma 2.
The and processes run on two different timescales; runs in the faster timescale whereas runs in a slower timescale. This can be understood from the fact that the stepsize in the update process decreases with time . Here the faster timescale iterate will view the slower timescale iterate as quasi-static, while the slower timescale iterate will view the faster timescale as almost equilibriated. This is reminiscent of two-timescale stochastic approximation (see [12, Chapter ]).
Let denote under .
Under GL and Assumption 1, we have almost surely, and the limiting distribution of is .
See Appendix F. ∎
Theorem 4 says that GL produces a configuration from the distribution under steady state. Hence, GL meets the sensor activation constraint in (P1) with equality and offers a near-optimal time-average mean squared error for the constrained problem; the gap from the optimal MSE can be made arbitrarily small by choosing large enough.
Iii-F A hard constraint on the number of activated sensors
Let us consider the following modified constrained problem (recall notation from (P3)):
It is easy to see that (P4) can be easily solved using similar Gibbs sampling algorithms as in Section III, where the Gibbs sampling algorithm runs only on the set of configurations which activate number of sensors. Thus, as a by-product, we have also proposed a methodology for the problem in , though our framework is more general than .
Iv IID process with parametric distribution: Unknown
In Section III, we described algorithms for centralized tracking of an i.i.d. process with known distributions. In this section, we will deal with the centralized tracking of an i.i.d. process where with an unknown parameter ; in this case, has to be learnt over time through observations.
The algorithm described in this section will be an adaptation of the GL algorithm discussed in Section III-E. However, when is unknown, we have to update its estimate over time using the sensor observations. In order to solve the constrained problem (P1), we still need to update over time so as to attain the optimal of Theorem 1 iteratively. However, (MSE under configuration ) in (P3) is unknown since is unknown, and its estimate has to be learnt over time using the sensor observations; as a result is also iteratively updated for all . Hence, we combine the Gibbs sampling algorithm with update schemes for , and using multi-timescale stochastic approximation (see ). The Gibbs sampling runs in the fastest timescale and the update runs in the slowest timescale.
Since the algorithm has several steps (such as Gibbs sampling, update, update and update) and each of these steps needs detailed explanation, we first describe in details some key features and steps of the algorithm in Section IV-A, Section IV-B, Section IV-C, Section IV-D and Section IV-E, and then provide a brief summary of the algorithm in Section IV-F.
The proposed Gibbs Parameter Learning (GPL) algorithm also requires a sufficiently large positive number and a large integer as input. The need of these parameters will be clear as we proceed through the algorithm description.
Let denote the indicator that time is an integer multiple of . Define to be the number of time slots till time when all sensors are activated.
Iv-a Step size sequences
For the stochastic approximation updates of , and , the algorithm uses four nonnegative sequences , , , . Let be a generic sequence where . Then we require the following two conditions: (i) and (ii) .
In addition, we have the following specific assumptions:
Let us recall from the GL algorithm that we had one stochastic approximation update for and a single stepsize sequence . However, in the GPL algorithm, we will have three stochastic approximation updates and hence there are three step size sequences , and in the sequel; the stepsize is used in estimate update .
Conditions (i) and (ii) are standard requirements for stochastic approximation step sizes. Conditions (iii) and (iv) are additional requirements for the update scheme described later. Condition (v) ensures that the three update equations run on separate timescales. The stepsize corresponds to the slowest timescale and corresponds to the fastest timescale among these step size sequences; this can be understood from the fact that any iteration involving as the stepsize will vary very slowly, and the iteration involving will vary fast due to the large step sizes. In condition (v), we have instead of because in our proposed GPL algorithm will be updated only once in every time slots, and hence a step size will be used to update by using observations from all sensors whenever .
Iv-B Gibbs sampling step for sensor subset selection
The algorithm also maintains a running estimate of for all . At time , it selects a random sensor uniformly and independently, and sets with probability and with probability . For , it sets . 111This randomization operation can even be repeated multiple times in each time slots to achieve faster convergence results. The sensors are activated according to , and the observations are collected. Then the algorithm declares .
Iv-C Parameter estimate update
If , the fusion center reads all sensors and obtains . This is required primarily because we seek to update iteratively and reach a local maximum of the function:
is the expected log-likelihood of given , when all sensors are active and . Note that, maximizing minimizes the KL divergence . If we use only the sensor observations corresponding to the activation vector obtained from Gibbs sampling, then the estimate will be biased by the reading of the sensors which are chosen more frequently by Gibbs sampling. However, we will later see that the additional amount of sensing and communication can be made arbitrarily small by choosing large enough.
Since we seek to reach a local maximum of , a gradient ascent scheme is employed. The gradient of along any coordinate can be computed by perturbing in two opposite directions along that coordinate and evaluating the difference of at those two perturbed values. However, if is high-dimensional, then estimating this gradient along all coordinates is computationally intensive. Moreover, evaluating for any requires us to compute an expectation over the distribution and the distribution of the observation noise at all sensors, which might also be expensive. Hence, we perform a noisy gradient estimation for by simultaneous perturbation stochastic approximation (SPSA) as in . Our algorithm generates uniformly over all sequences, and perturbs the current estimate by a random vector (recall that ) in two opposite directions to obtain and , and estimates each component of the gradient from the following difference:
This estimate is noisy because (i) and are random, and (ii) while ideally it should be infinitesimally small.
The -th component of is updated as follows:
The iterates are projected onto the compact set to ensure boundedness.
Note that, (7) is a stochastic gradient ascent iteration performed once in every slots with step size . On the other hand, the sequence is the perturbation sequence used in gradient estimation, and hence it need not behave like a standard stochastic approximation step size sequence. However, should converge to as , in order to ensure that the gradient estimate is asymptotically unbiased. The conditions (iii) and (iv) in Section IV-A are technical conditions required for the convergence of (7).
Iv-D Weight update
is updated as follows:
The intuition here is that, if , the sensor activation cost needs to be increased to prohibit activating large number of sensors in future; this is motivated by Lemma 2 and is similar to the update equation in GL algorithm. The goal is to converge to as defined in Theorem 1.
Iv-E MSE estimate update
Since is not known initially, the true value of (i.e., the MSE under configuration and known distribution of ) is not known; hence, the proposed GPL algorithm updates an estimate using the sensor observations. If , the fusion center obtains by reading all sensors. The goal is to obtain a random sample of the MSE under a configuration , by using these observations, and update using .
However, since is unknown and only is available, as an alternative to the MSE under configuration , the fusion center uses the trace of the conditional covariance matrix of given , assuming that
. Hence, we define a random variable:
for each , where is the MMSE estimate declared by for configuration and given the observation made by active sensors determined by , under the assumption that . Clearly, is a random variable with the randomness coming from two sources: (i) randomness of , and (ii) randomness of which has a distribution since the original process that yields has a distribution . Computation of is simple for Gaussian and the MMSE estimator, since closed form expressions are available for . In case the computation of is expensive (since it requires us to evaluate an expectation over the conditional distribution of given and
), one can obtain an unbiased estimate ofby drawing a random sample of from the distribution , and scaling the sample squared error by the ratio of the two distributions and ; this technique is basically importance sampling (see ) with only one sample.
Using , the following update is made for all :
The iterates are projected onto a compact interval to ensure boundedness. The goal here is that, if , then will converge to , which is equal to under .
We will later argue that this occasional computation for all can be avoided, but convergence will be slow.
Iv-F The GPL algorithm
A summary of all the steps of the GPL algorithm is provided below. We will show in Theorem 5 the almost sure convergence of this algorithm.
GPL algorithm: Initialize all iterates arbitrarily. For any time , do the following: Sensor activation and process estimation: Perform the Gibbs sampling step as in Section IV-B and obtain the activation vector . Activate the sensors and obtain the corresponding observations at the fusion center. Estimate using the observations and the current parameter estimate . If , read all sensors and obtain . Compute or estimate (defined in (9)) for all . update: If , update for all using (10) and also update for all . update: If , update using (7). If , then . update: Update according to (8).
Note that, in the GPL algorithm, it is sufficient to consider , and . The Gibbs sampling step tries to minimize a running estimate of the unconstrained cost function over the space of configurations .
Multiple timescales: GPL has multiple iterations running in multiple timescales (see [12, Chapter ]). The process runs ar the fastest timescale, whereas the update scheme runs at the slowest timescale. The basic idea is that a faster timescale iterate views a slower timescale iterate as quasi-static, whereas a slower timescale iterate views a faster timescale iterate as almost equilibriated. For example, since , the iterates will vary very slowly compared to iterates; as a result, iterates will view quasi-static . In other words, the iterates will behave as if a slower timescale iterate varies in a slow outer loop, and a faster timescale iterate varies in an inner loop.
Iv-G Complexity of GPL and reducing the complexity
Sampling and communication complexity: Since all sensors are activated when , the mean number of additional active sensors per unit time is ; these additional observations need to be communicated to the fusion center. However, the additional sensing can be made large enough by choosing a very large .
Computational complexity: The computation of in (10) for all requires expectation computations whenever . However, if one chooses large (e.g., ), then this additional computation per unit time will be small. However, if one wants to avoid that computation also, then, when , one can simply compute and update