# Optimal Dynamic Sensor Subset Selection for Tracking a Time-Varying Stochastic Process

Motivated by the Internet-of-things and sensor networks for cyberphysical systems, the problem of dynamic sensor activation for the tracking of a time-varying process is examined. The tradeoff is between energy efficiency, which decreases with the number of active sensors, and fidelity, which increases with the number of active sensors. The problem of minimizing the time-averaged mean-squared error over infinite horizon is examined under the constraint of the mean number of active sensors. The proposed methods artfully combine three key ingredients: Gibbs sampling, stochastic approximation for learning, and modifications to consensus algorithms to create a high performance, energy efficient tracking mechanisms with active sensor selection. The following progression of scenarios are considered: centralized tracking of an i.i.d. process; distributed tracking of an i.i.d. process and finally distributed tracking of a Markov chain. The challenge of the i.i.d. case is that the process has a distribution parameterized by a known or unknown parameter which must be learned. The key theoretical results prove that the proposed algorithms converge to local optima for the two i.i.d process cases; numerical results suggest that global optimality is in fact achieved. The proposed distributed tracking algorithm for a Markov chain, based on Kalman-consensus filtering and stochastic approximation, is seen to offer an error performance comparable to that of a competetive centralized Kalman filter.

## Authors

• 10 publications
• 18 publications
• ### Dynamic Sensor Subset Selection for Centralized Tracking a Time-Varying Stochastic Process

Motivated by the Internet-of-things and sensor networks for cyberphysica...
04/09/2018 ∙ by Arpan Chattopadhyay, et al. ∙ 0

• ### Centralized active tracking of a Markov chain with unknown dynamics

In this paper, selection of an active sensor subset for tracking a discr...
10/30/2020 ∙ by Mrigank Raman, et al. ∙ 0

• ### Security against false data injection attack in cyber-physical systems

In this paper, secure, remote estimation of a linear Gaussian process vi...
07/31/2018 ∙ by Arpan Chattopadhyay, et al. ∙ 0

• ### Online Learning over Dynamic Graphs via Distributed Proximal Gradient Algorithm

We consider the problem of tracking the minimum of a time-varying convex...
05/16/2019 ∙ by Rishabh Dixit, et al. ∙ 0

• ### Distributed Bayesian Filtering using Logarithmic Opinion Pool for Dynamic Sensor Networks

The discrete-time Distributed Bayesian Filtering (DBF) algorithm is pres...
12/11/2017 ∙ by Saptarshi Bandyopadhyay, et al. ∙ 0

• ### Uniform ε-Stability of Distributed Nonlinear Filtering over DNAs: Gaussian-Finite HMMs

In this work, we study stability of distributed filtering of Markov chai...
02/16/2016 ∙ by Dionysios S. Kalogerias, et al. ∙ 0

• ### Consensus-based Distributed Quickest Detection of Attacks with Unknown Parameters

Sequential attack detection in a distributed estimation system is consid...
02/01/2018 ∙ by Jiangfan Zhang, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

Controlling and monitoring physical processes via sensed data are integral parts of internet-of-things (IOT), cyber-physical systems, and defense with applications to industrial process monitoring and control, localization, tracking of mobile objects, environmental monitoring, system identification and disaster management. In such applications, sensors are simultaneously resource constrained (power and/or bandwdith) and tasked to achieve high performance sensing, control, communication, and tracking. Wireless sensor networks must further contend with interference and fading. One strategy for balancing resource use with performance is to activate a subset of the total possible number of sensors to limit both computation as well as bandwidth use.

Herein, we address the fundamental problem of optimal dynamic sensor subset selection for tracking a time-varying stochastic process. We first examine the centralized tracking of an iid process with an unknown, parametric distribution which serves as a benchmark for the first extension to decentralized tracking of this process. For both proposed algorithms, almost sure convergence to local optima can be proven. Next a distributed algorithm for tracking a Markov process with a known probability transition matrix is developed. All algorithms are numerically validated.

### I-a Related Literature

Optimal sensor subset selection problems can be broadly classified into two categories: (i) optimal sensor subset selection for static data with known prior distribution, but unknown realization, and (ii) dynamic sensor subset selection to track a time-varying stochastic process. There have been several recent attempts to solve the first problem; see

[3, 1]. This problem poses two major challenges: (i) computing the estimation error given the observations from a subset of sensors, and (ii) finding the optimal sensor subset from exponentially many number of subsets. In [3], a tractable lower bound on performance addressed the first challenge and a greedy algorithm addressed the second. In [1] only the second challenge is addressed via Gibbs sampling approach.

Dynamic sensor subset selection for a time-varying stochastic process is considered in e.g. [4, 5, 6, 7, 8, 9]. Markov process tracking with centralized controllers for single ([6]) or multiple sensor selection with energy constraints and sequential decision making ([4, 5]) have been previously studied. The optimal policy and its structural properties for a special case of dynamic sensor selection over an infinite horizon was examined in [7]. Single sensor selection with the broadcast of sensor data is studied in [8]. To our knowledge, the combination of Gibbs sampling (e.g. [10]) and stochastic approximation (e.g. [11]) has not been previously applied to iid process tracking as we do herein. The paper [12]

, using Thompson sampling, has solved the problem of

centralized tracking of a linear Gaussian process (with unknown noise statistics) via active sensing.

Given our consideration of dynamic sensor subset selection for distributed tracking of a Markov process, traditional Kalman filtering is not applicable; however, there is much prior, recent art on the developmemt of consensus-based Kalman filtering [13, 14, 15, 16, 17] which we further adapt to work with our Gibbs sampling/stochastic approximation approach for the iid case. This distributed tracking problem does not appear to be heavily studied. In [8] perfect information sharing between sensors is assumed, which is impractical and [18] assumes that the estimation is done in a centralized fashion, while sensors make decentralized decisions about whether to sense or communicate to the fusion center. In [18], the sparsity of the Markov process is also exploited.

### I-B Our Contribution

In this paper, we make the following contributions:

1. A centralized tracking and learning algorithm for an iid process with an unknown, but parametric distribution is developed. In particular, Gibbs sampling minimizes computational complexity with stochastic approximation to achieve the mean number of activated sensors constraint. Furthermore, simultaneous perturbation stochastic approximation (SPSA) is employed for parameter estimation obviating the need for the expectation-maximization algorithm. A challenge we overcome in the analysis, is handling updates at different time scales. As a precursor to the algorithm for unknown parameters, algorithms have been developed for a simpler version of the problem when the parameter of the distribution is known.

2. The centralized algorithm, which serves as a benchmark, is adapted to the distributed case by exploiting partial consensus. Our partial consensus is novel in that SPSA is again employed to learn the optimal consensus gains adaptively.

3. A trick for ensuring that all sensors employ similar sampling strategies is to have each sensor use the same seed in a random number generator.

4. For both the centralized and distributed algorithms, we can prove almost sure convergence to local optima. Furthermore, we an prove that the resources needed for communication and learning can be made arbitrarily small, by exploiting properties of the multi-scale updates.

5. Our final generalization to is to develop an algorithm for sensor subset selection for the decentralized tracking of a Markov chain with known probability transition matrix. We adapt methods for Kalman consensus filtering to our framework with Gibbs sampling and stochastic approximation.

6. Numerical results show that the decentralized scheme for a Markov chain performs close to that of the centralized scheme. Numerical results also show a tradeoff between performance and message exchange for learning. Furthermore, the numerical results show that global (not local) optima are achieved in tracking iid process.

### I-C Organization

The rest of the paper is organized as follows. The system model is described in Section II. Centralized tracking of an iid process with known distribution is described in Section III. Section IV deals with centralized tracking of an iid process having a parametric distribution with unknown parameters. Distributed tracking of iid process is discussed in Section V. Distributed tracking of a Markov chain is described in Section VI. Numerical results are presented in Section VII, followed by the conclusion in Section VIII.

## Ii System Model

### Ii-a Network, data and sensing model

We consider a connected single or multi-hop wireless sensor network. The sensor nodes are denoted by the set . It might also be possible that the connectivity in the network is maintained via a few relay nodes, but we ignore this possibility for the ease of analysis. There can be a fusion center (connected to all sensors via single hop wireless links) responsible for all control or estimation operations in the network, or, alternatively, the sensors can operate autonomously in a multihop mesh network.

The physical process under measurement is denoted by , where is a discrete time index and . We consider two models for the evolution of :

• IID model: is an i.i.d. process with a parametric distribution , where the unknown parameter needs to be be learnt via the measurements. lies inside the interior of a compact subset .

• is a finite-state ergodic Markov chain with known transition probability matrix.

At time , if a sensor  is used to sense the process, then the observation at sensor  is provided by a

-dimensional column vector

where is an observation matrix of appropriate dimension, and is a Gaussian vector; the observation noise is assumed to be independent across and i.i.d. across .

Let be a vector where is the indicator that the sensor  is activated at time ; if sensor  is active at time , or else . The decision to activate any sensor for sensing and communicating the observation is taken either by the fusion center or by the sensor itself in the absence of a fusion center. We denote by the set of all possible configurations in the network, and by a generic configuration. Clearly, . Each configuration represents a set of activated sensors. is used to represent the configuration with its -th entry removed.

The observation made by sensor  at time is . We define .

### Ii-B Centralized estimation problem

The estimate of at the fusion center (connected to all sensors via direct wireless links) is denoted by . We denote by the information available at time at the fusion center about the history of observations, activations and estimates up to time , before the fusion center determines . On the other hand, we define where is the current estimate of ; denotes the information used by the fusion center at time to estimate . For i.i.d. time varying process, is sufficient to estimate and obtain , and will be available only after deciding the activation vector and collecting all observations. In order to optimally decide , the fusion center needs knowledge about the performance of all configurations in the past. Hence, and have two different information structures. However, we will see that, our Gibbs sampling algorithm determines by using only a sufficient statistic calculated iteratively in each slot.

The information structure used to track a Markov chain will be different, which we will see in Section VI.

We define a policy as a pair of mappings, where and . Our first goal is to solve the following centralized problem of minimizing the time-average mean squared error (MSE) subject to a constraint on the mean number of sensors active per unit time:

 μ∗=argminμlimsupt→∞1tt∑τ=1Eμ||X(τ)−^X(τ)||2 s.t.limsupt→∞1tt∑τ=1Eμ||B(τ)||1≤¯N (P1)

### Ii-C Distributed estimation problem

In the absence of a fusion center in a multi-hop network, the estimate of at sensor  is denoted by . We denote by the information available at time at sensor  about the history before the sensor determines , and by the information available at sensor  at time just before is estimated.

We define a policy , where and .

We seek to solve the following distributed problem:

 μ∗ = argminμlimsupt→∞1tt∑τ=11NN∑k=1Eμ||X(τ)−^X(k)(τ)||2 s.t.limsupt→∞1tt∑τ=1Eμ||B(τ)||1≤¯N (P2)

## Iii Centralized Tracking of IID proces: known θ0

In this section, we provide an algorithm for solving the centralized problem (P1) when i.i.d. This is done by relaxing (P1) by a Lagrange multiplier. Though our final goal is to track an i.i.d. process with unknown , we discuss algorithms for known in this section, as a precursor to the algorithms developed in subsequent sections for unknown and also tracking a Markov chain. Also, extension to distributed tracking will be discussed in Section V for unknown , and hence will be omitted in this section.

### Iii-a The relaxed version of the constrained problem

We relax (P1) by using a Lagrance multiplier and obtain the following unconstrained problem:

 μ∗=argminμlimsupt→∞1tt∑τ=1Eμ(||X(τ)−^X(τ)||2+λ||B(τ)||1) (P3)

The multiplier can be viewed as the cost incurred for activating a sensor at any time instant. Now, since (P3) is an unconstrained problem and is i.i.d. across , there exists one optimizer (not necessarily unique) for the problem (P3); if the configuration is chosen at each , the minimum cost of (P3

) can be achieved by the law of large numbers. Hence, for known

, the problem (P3) can be equivalently written as:

 argminB∈BEμ2,B||X(τ)−^X(τ)||2:=f(B)+λ||B||1:=h(B) (P4)

The following result tells us how to choose the optimal to solve (P1).

###### Theorem 1
Consider problem (P1) and its relaxed version (P4). If there exists a Lagrange multiplier and a , such that an optimal configuration for (P4) under is , and the constraint in (P1) is satisfied with equality under the pair , then is an optimal configuration for (P1). In case there exist multiple configurations , a multiplier , and a probability mass function such that (i) each of is optimal for problem (P4) under , and (ii) , then an optimal solution for (P1) is to choose one configuration from with probability mass function .
See Appendix A.

###### Remark 1

Theorem 1 allows us to obtain a solution for (P1) from the solution of by choosing an appropriate ; this will be elaborated upon in Section III-D.

### Iii-B Basics of Gibbs sampling for known pθ0(⋅)

Finding the optimal solution of (P4) requires us to search over possible configurations and to compute MMSE for each of these configurations. Hence, we propose Gibbs sampling based algorithms to avoid this computation.

Let us define the probability distribution

over as (with ):

 πβ(B):=e−βh(B)∑B′∈Be−βh(B′):=e−βh(B)Zβ. (1)

Following the terminology in statistical physics, we call the inverse temperature, and the partition function. is viewed as the energy under configuration . Now, . Hence, if a configuration is selected at each time with probability distribution for sufficiently large , then will belong to the set of minimizers of (P3) with high probability. However, computing requires addition operations; hence, we use a sequential subset selection algorithm based on Gibbs sampling (see [10, Chapter ]) in order to avoid explicit computation of while picking .

###### Algorithm 1
Start with an initial configuration . At time , pick a random sensor  uniformly from the set of all sensors. Choose with probability and choose with probability . For , choose . Activate the sensors according to .

###### Theorem 2
Under Algorithm 1, is a reversible, ergodic, time-homogeneous Markov chain with stationary distribution .
Follows from the theory in [10, Chapter ]).

###### Remark 2

Theorem 2 tells us that if the fusion center runs Algorithm 1 and reaches the steady state distribution of the Markov chain , then the configuration chosen by the algorithm will have distribution . For very large , if one runs for a sufficiently long, finite time , then the terminal state will belong to with high probability. Also, by the ergodicity of , the time-average occurence rates of all configurations match the distribution almost surely.

### Iii-C The exact solution

Algorithm 1 is operated with a fixed ; but the optimal soultion of the unconstrained problem (P3) can only be obtained with ; this is done by updating at a slower time-scale than the iterates of Algorithm 1.

###### Algorithm 2
This algorithm is same as Algorithm 1 except that at time , we use to compute the update probabilities, where , , and .

###### Theorem 3
Under Algorithm 2, the Markov chain is strongly ergodic, and the limiting probability distribution satisfies .
See Appendix C. We have used the notion of weak and strong ergodicity of time-inhomogeneous Markov chains from [10, Chapter , Section ]), which is provided in Appendix B. The proof is similar to the proof of one theorem in [19], but is given here for completeness.

###### Remark 3

Theorem 3 shows that we can solve (P3) exactly if we run Algorithm 2 for infinite time, in contrast with Algorithm 1 which provides an approximate solution.

###### Remark 4

For i.i.d. time varying

with known joint distribution, we can either: (i) find the optimal configuration

using Algorithm 2 and use for ever, or (ii) run Algorithm 2 at the same timescale as , and use the running configuration for sensor activation; both schemes will minimize the cost in (P3). By the strong ergodicity of , optimal cost will be achieved for (P3) under Algorithm 2.

#### Iii-C1 Convergence rate of Algorithm 1

Let denote the probability distribution of under Algorithm 1. Let us consider the transition probability matrix of the Markov chain with , under Algorithm 1. Let us recall the definition of the Dobrushin’s ergodic coefficient from [10, Chapter , Section ] for the matrix ; using a method similar to that of the proof of Theorem 3, we can show that . Then, by [10, Chapter , Theorem ], we can say that under Algorithm 1, we have . We can prove similar bounds for any , where .

Unfortunately, we are not aware of such a closed-form bound for Algorithm 2.

###### Remark 5

Clearly, under Algorithm 1, the convergence rate decreases as increases. Hence, there is a trade-off between convergence rate and accuracy of the solution in this case. Also, the rate of convergence decreases with . For Algorithm 2, the convergence rate is expected to decrease with time.

### Iii-D Gibbs sampling and stochastic approximation based approach to solve the constrained problem

In Section III-B and Section III-C, we presented Gibbs sampling based algorithms for (P3). Now we provide an algorithm that updates with time in order to meet the constraint in (P1) with equality, and thereby solves (P1) (via Theorem 1).

###### Lemma 1
For the unconstrained problem (P4), the optimal mean number of active sensors, , decreases with . Similarly, the optimal error, , increases with .
See Appendix D.

###### Remark 6

The optimal mean number of active sensors, , for the unconstrained problem (P4) is a decreasing staircase function of , where each point of discontinuity is associated with a change in the optimizer .

Lemma 1 provides an intuition about how to update in Algorithm 1 or in Algorithm 2 in order to solve (P1). We seek to provide one algorithm which updates at each time instant, based on the number of active sensors in the previous time instant. In order to maintain the necessary timescale difference between the process and the update process, we use stochastic approximation ([11]) based update rules for . But the above remark tells us that the optimal solution of the constrained problem (P1) requires us to randomize between two values of in case the optimal as in Theorem 1 belongs to the set of such discontinuities. However, this randomization will require us to update a randomization probability at another timescale; having stochastic approximations running in multiple timescales leads to slow convergence. Hence, instead of using a varying , we use a fixed, but large and update in an iterative fashion using stochastic approximation. Our proposed Algorithm 3 updates iteratively in order to solve (P1).

###### Algorithm 3
Choose any initial and . At each discrete time instant , pick a random sensor independently and uniformly. For sensor , choose with probability and choose with probability . For , we choose . After this operation, before the decision instant, update at each node as follows.
The stepsize constitutes a positive sequence such that and . The nonnegative projection boundaries and for the iterates are such that where is defined in Assumption 1.

The update of in Algorithm 3 is inspired by the following result which is crucial in the convergence proof.

###### Lemma 2
Under Algorithm 1, is a Lipschitz continuous and decreasing function of .
See Appendix E.

Discussion of Algorithm 3:

• If is more than , then is increased with the hope that this will reduce the number of active sensors in subsequent iterations, as suggested by Lemma 2.

• The and processes run on two different timescales; runs in the faster timescale whereas runs in a slower timescale. This can be understood from the fact that the stepsize in the update process decreases with time . Here the faster timescale iterate will view the slower timescale iterate as quasi-static, while the slower timescale iterate will view the faster timescale as almost equilibriated. This is reminiscent of two-timescale stochastic approximation (see [11, Chapter ]).

We make the following feasibility assumption for (P1), under the chosen .

###### Assumption 1

There exists such that the constraint in (P1) under and Algorithm 1 is met with equality.

###### Remark 7

By Lemma 2, continuously decreases in . Hence, if is feasible, then such a must exist by the intermediate value theorem.

Let us define: . Let denote under .

###### Theorem 4
Under Algorithm 3 and Assumption 1, we have almost surely, and the limiting distribution of is .
See Appendix F. This theorem says that Algorithm 3 produces a configuration from the distribution under steady state.

### Iii-E A hard constraint on the number of activated sensors

Let us consider the following modified constrained problem:

 minB∈Bf(B) s.t. ||B||1≤¯N (MCP)

It is easy to see that (MCP) can be easily solved using similar Gibbs sampling algorithms as in Section III, where the Gibbs sampling algorithm runs only on the set of configurations which activate number of sensors. Thus, as a by-product, we have also proposed a methodology for the problem in [3], though our framework is more general than [3].

###### Remark 8

The constraint in (P1) is weaker than (MCP).

###### Remark 9

If we choose

very large, then the number of sensors activated by GIBBSLEARNING will have very small variance. This allows us to solve (

MCP) with high probability.

## Iv Centralized Tracking of IID process: Unknown θ0

In Section III, we described algorithms for centralized tracking of an i.i.d. process where is known. In this section, we will deal with the centralized tracking of an i.i.d. process where with unknown; in this case, has to be learnt over time through observations, which creates many nontrivial issues that need to be addressed before using Gibbs sampling for sensor subset selection.

### Iv-a The proposed algorithm for unknown θ0

Since is unknown, its estimate has to be updated over time using the sensor observations. On the other hand, to solve the constrained problem (P1), we need to update over time so as to attain the optimal of Theorem 1 iteratively. Further, (MSE under configuration ) in (P4) is unknown since is unknown, and has to be learnt over time using the sensor observations. Hence, we combine the Gibbs sampling algorithm with update schemes for , and using stochastic approximation (see [11]).

The algorithm also requires a sufficiently large positive number and a large integer as input.

Let denote the indicator that time is an integer multiple of . Define .

We first describe some key features and steps of the algorithm and then provide a brief summary of the algorithm.

#### Iv-A1 Step size

For the stochastic approximation updates, the algorithm uses step sizes which are nonnegative sequences , , , such that:

(i) (ii) (iii) , (iv) , (v) .

#### Iv-A2 Gibbs sampling step

The algorithm also maintains a running estimate of for all . At time , it selects a random sensor with uniformly and independently, and sets with probability and with probability (similar to Algorithm 1). For , it sets . This operation can even be repeated multiple times. The sensors are activated according to , and the observations are collected. Then the algorithm declares .

#### Iv-A3 Occasional reading of all the sensors

If , the fusion center reads all sensors and obtains . This is required primarily because we seek to update iteratively and reach a local maximum of the function .

#### Iv-A4 θ(t) update when J(t)=1

Since we seek to reach a local maximum of , a gradient ascent scheme needs to be used. The gradient of along any coordinate can be computed by perturbing in two opposite directions along that coordinate and evaluating the difference of at those two perturbed values. However, if is high-dimensional, then estimating this gradient along each coordinate is computationally intensive. Moreover, evaluating for any requires us to compute an expectation, which might also be expensive. Hence, we perform a noisy gradient estimation for by simultaneous perturbation stochastic approximation (SPSA) as in [20]. Our algorithm generates uniformly over all sequences, and perturbs the current estimate by a random vector in two opposite directions, and estimates each component of the gradient from the difference ; this estimate is noisy because (i) and are random, and (ii) .

The -th component of is updated as follows:

 θk(t+1) = [θk(t)+c(ν(t))J(t)(logp(Z(t)|θ(t)+d(ν(t))Δ(t))2d(ν(t))Δk(t) (2) −logp(Z(t)|θ(t)−d(ν(t))Δ(t))2d(ν(t))Δk(t))]Θ

The iterates are projected onto the compact set to ensure boundedness. The diminishing sequence ensures that the gradient estimate becomes more accurate with time.

#### Iv-A5 λ(t) update

is updated as follows:

 λ(t+1)=[λ(t)+b(t)(||B(t)||1−¯N)]A00. (3)

The intuition here is that, if , the sensor activation cost needs to be increased to prohibit activating large number of sensors in future; this is motivated by Lemma 2. The goal is to converge to as defined in Theorem 1.

#### Iv-A6 f(t)(B) update

Since is not known initially, the true value of is not known; hence, the algorithm updates an estimate using the sensor observations. If , the fusion center obtains by reading all sensors. The goal is to obtain an estimate of the MSE under a configuration , by using these observations, and update using . However, since is unknown and only is available, as an alternative to the MSE under configuration , the fusion center uses the trace of the conditional covariance matrix of given , assuming that

. Hence, we define a random variable

for all , where is the MMSE estimate declared by under configuration , and is the observation made by active sensors determined by . Clearly, is a random variable with the randomness coming from two sources: (i) randomness of , and (ii) randomness of which has a distribution since the original process that yields has a distribution . Computation of is simple for Gaussian and the MMSE estimator, since closed form expressions are available to compute .

Using , the following update is made for all :

 f(t+1)(B)=[f(t)(B)+J(t)a(ν(t))(YB(t)−f(t)(B))]A00 (4)

The iterates are projected onto to ensure boundedness. The goal here is that, if , then will converge to , which is equal to under .

We will later argue that this occasional computation for all can be avoided, but convergence will be slow.

#### Iv-A7 The algorithm

A summary of all the steps of our scheme is provided in Algorithm 4. We will show in Theorem 5 that this algorithm almost surely converges to the set of locally optimum solutions for (P1).

###### Algorithm 4
Initialize all iterates arbitrarily. For any time : 1. Perform the Gibbs sampling step as in Section IV-A2, obtain the observations , and estimate . Update according to (3). 2. If , compute for all . Read all sensors and obtain . Update for all using (4) and using (2).

#### Iv-A8 Multiple timescales in Algorithm 4

Algorithm 4 has multiple iterations running in multiple timescales (see [11, Chapter ]). The process runs ar the fastest timescale, whereas the update scheme runs at the slowest timescale. The basic idea is that a faster timescale iterate views a slower timescale iterate as quasi-static, whereas a slower timescale iterate views a faster timescale iterate as almost equilibriated. For example, since , the iterates will vary very slowly compared to iterates; as a result, iterates will view quasi-static .

### Iv-B Complexity of Algorithm 4

#### Iv-B1 Sampling and communication complexity

Since all sensors are activated when , the mean number of additional active sensors per unit time is ; these observations need to be communicated to the fusion center. can be made arbitrarily small by choosing large enough.

#### Iv-B2 Computational complexity

The computation of in (4) for all requires computations whenever . However, if one chooses large (e.g., ), then this additional computation per unit time will be small. However, if one wants to avoid that computation also, then one can simply compute and update instead of doing it for all configurations . However, the stepsize sequence cannot be used; instead, a stepsize has to be used when and is updated using (4), where is the number of times configuration was chosen till time . In this case, the convergence result (Theorem 5) on Algorithm 4 will still hold; however, the proof will require a technical condition almost surely for all , which will be satisfied by the Gibbs sampler using finite and bounded . However, we discuss only (4) update in this paper for the sake of simplicity in the convergence proof, since technical details of asynchrounous stochastic approximation required in the variant is not the main theme of this paper.

When , one can avoid computation of for all in Step  of Algorithm 4. Instead, the fusion center can update only , and at time , since only these iterates are required in the Gibbs sampling.

### Iv-C Convergence of Algorithm 4

#### Iv-C1 List of assumptions

###### Assumption 2

The distribution and the mapping (or for distributed case) as defined before are Lipschitz continuous in .

###### Assumption 3

is known to the fusion center (centralized case), and are known to all sensors (distributed).

###### Remark 10

Assumption 3 allows us to focus only on the sensor subset selection problem rather than the problem of estimating the process given the sensor observations. For optimal MMSE estimators, . Computation of will depend on the exact functional form of

, and it can be done by using Bayes’ theorem.

###### Assumption 4

Let us consider with fixed in Algorithm 4. Suppose that, one uses Algorithm 1 to solve (P3) for a given but with the MSE replaced by in the objective function of (P3), and then finds the as in Theorem 1 to meet the constraint . We assume that, for the given and , and for each , there exists such that, the optimal Lagrange multiplier to relax this new unconstrained problem is (Theorem 1). Also, is Lipschitz continuous in .

###### Remark 11

Assumption 4 makes sure that the iteration (3) converges, and the constraint is met with equality.

Let us define the function .

###### Assumption 5

Consider the function ; this is the expected conditional log-likelihood function of conditioned on , given that and

. We assume that the ordinary differential equation

has a globally asymptotically stable solution in the interior of . Also, is Lipschitz continuous in .

###### Remark 12

One can show that the iteration (2) asymptotically tracks the ordinary differential equation inside the interior of . In fact, when lies inside the interior of . The condition on is required to make sure that the iteration does not converge to some unwanted point on the boundary of due to the forced projection. The assumption on makes sure that the iteration converges to .

#### Iv-C2 The main result

Now we present convergence result for Algorithm 4. This result tells us that the iterates of Algorithm 4 almost surely converge to the set of local optima for (P1).

###### Theorem 5
Under Assumptions 2, 3, 4, 5 and Algorithm 4, we have almost surely. Correspondingly, almost surely. Also, almost surely for all . The process reaches the steady-state distribution