In many applications, one has to trade-off between accuracy and cost. For example, for detecting some event, it is not only the accuracy of a sensor that matters, but the associated sensing cost is important as well. Also, one may have to predict labels of instances for which ground-truth cannot be obtained. In such scenarios, feedback about the correctness of sensors’ predictions remains unknown. Problems with this structure arise naturally in healthcare, security, and crowd-sourcing applications. In healthcare, the patients may not reveal the outcome of treatment due to privacy concerns; hence the effectiveness of the treatment is unknown. In crowd-sourcing systems, the expertise of self-listed-agents (workers) may not be known; therefore their quality cannot be identified. In a security application, specific threats may not have been seen before, and thus their in-situ ground-truth may not be available.
In this work, we focus on the study of sensor selection problems where we do not have the advantage of knowing the ground-truth and hence cannot measure the error rates of the sensors. Here sensors could correspond to medical tests (healthcare), detectors/scanners (security) or workers (crowd-sourcing). In these unsupervised sensor selection (USS) problems, the goal is to still find the ‘best’ sensor that gives the best trade-off between error and cost .
In USS setup, it is assumed that the sensors form a cascade, i.e., they are ordered by their prediction efficiency and costs– the average prediction error decreases hence, prediction efficiency increases with every stage of the cascade while the cost of acquiring it increases. Even though it is assumed that the sensor ordering is known and better sensors are associated with higher costs, the exact values of sensor errors are still unknown. The learner’s goal is to find a sensor that has small value of total prediction cost for a given task, which includes both the cost of acquiring the sensor’s outputs and the cost due to incorrect predictions.
Clearly, without the knowledge of the ground-truth, one cannot find the optimal sensor as the sensor accuracies cannot be computed. In the USS setup, the structure of the problem is exploited, and it is shown that under certain conditions, namely strong dominance (SD) and weak dominance (WD), learning is possible. The SD property requires the prediction accuracy of a sensor to stochastically dominate prediction accuracy of other sensors with lower costs in the cascade. Specifically, it assumes that if a sensor’s prediction is correct, then all the sensors that follow this sensor in the cascade also have correct predictions.
Under the SD property, Hanawal et al. 
established that USS problem is equivalent to a multi-armed bandit with side observations and exploit the equivalence to give an algorithm with sub-linear regret. SD property is quite strong and posits that disagreement probability of the predictions of two sensors is equal to the difference in error rates. This property implies that we can measure accuracy by measuring disagreement probabilities leading to a direct multi-armed bandit (MAB) reduction and analysis.
The WD property relaxes strict stochastic ordering on predictions and allows errors on some instances from better sensors. It is argued that the set of instances satisfying the WD property is maximally learnable, and any further relaxation of this property renders the problems unlearnable. The reduction techniques used under SD property does not apply/extend to WD property. For this case, a heuristic algorithm without any performance guarantee is given in. Our work bridges this gap. Our contributions are summarized as follows:
We develop an algorithm named USS-UCB that has sublinear regret under WD property. We characterize regret in terms of how ‘well’ the problem instances satisfy the WD property and then provide a bound that holds uniformly for all WD instances.
1.1 Related Work
Several works consider the problem of sensor selection in either batch, or online settings (e.g., Trapeznikov and Saligrama , Seldin et al. ). However, they all require that the label of each data point is available or the reward is obtained for each action. Zolghadr et al.  considers that the labels are available on payment. Greiner et al. , Póczos et al.  consider costs associated with tests. However, they assume that loss/reward associated with the players’ action is revealed. In contrast, in our setting, the labels are not revealed at any point and are thus completely unsupervised, and the cost in our setup is related to sensing cost and not that of acquiring a label.
consider the problem of estimating accuracies of the multiple binary classifiers with unlabeled data. Most of these works make strong assumptions such as independence given the labels, knowledge of the true distribution of the labels.Platanios et al. 
proposed logistic regression based methods using the classifiers’ agreement rates over unlabeled data, extend this work to use graphical models, and Platanios et al.  proposes method using probabilistic logic. Further, Platanios et al.  also uses weighted majority vote for label prediction. All this is in the batch setting and differs from our online setup.
In the crowd-sourcing problems, various methods have been proposed to estimate unknown skill-level of crowd-workers from the noisy labels they provide (Bonald and Combes , Kleindessner and Awasthi ). These methods assume that all workers are having the same cost and aggregate the predictions on a given dataset for estimating the accuracy of each worker. Unlike ours, these methods are not online.
2 USS Problem
We cast the unsupervised, stochastic, cascaded sensor selection as an instance of stochastic partial monitoring problem (SPM). We use sensor and arm interchangeably in the following. Formally, a problem instance in our setting is specified by a pair , where is a distribution over the dimensional hypercube, and is a
-dimensional, non-negative valued vector of costs. Whileis known to the learner from the start, is unknown. Henceforth, we identify problem instance by . The instance parameters specify the learner-environment interaction as follows: In each round , the environment generates a -dimensional binary vector chosen at random from . Here, is the output of sensor , while is the (hidden) label to be guessed by the learner. Simultaneously, the learner chooses an index where , and observes the sensor outputs , i.e., the learner goes through the first sensors and observes their outputs. Dropping the subindex , write . Then,
, the joint probability distribution ofand , can be expressed as , where for any and , is (essentially) observable while is not.
Hanawal et al. in addition assumes that the sensors are known to be ordered from least accurate to most accurate, i.e., is decreasing in . We relax this assumption later in the Section 4. The cost associated with sensor is denoted by and the cost of choosing action is , as the selection has to be done sequentially. The total cost incurred by the learner in round is thus where is a trade-off parameter between error rate and cost of using sensors111 is a parameter that makes associated cost unit-less. For example, assume cost is in $ and associated is . If cost is increased by multiple of ( for cost in cents) then the corresponding will be and vice-versa.. Without loss of generality, we set . The goal of the learner is to compete with the best choice knowing the . Let and be the optimal sensor. The cumulative (pseudo-)regret of the learner running an algorithm up to the end of round is
We say that the (expected) regret is sublinear if as , where the expectation is over , which is random as it depends on past random data. When the regret is sublinear, the learner collects almost as much reward in expectation in the long run as an oracle that knew the optimal action from the beginning. Let be the set of all stochastic, cascaded sensor selection problems. Thus, such that and is decreasing in . Given a subset , we say that is learnable if there exists a learning algorithm such that for any , the expected regret of algorithm on instance is sub-linear. A subset is said to be a maximal learnable problem class if it is learnable and for any subset that contains is not learnable.
2.1 Strong and Weak Dominance
The purpose of this section is to introduce the notions of strong and weak dominance from the work of Hanawal et al. . While Hanawal et al. studied learning under strong dominance, here we will focus on weak dominance. We also modify the definition of weak dominance of Hanawal et al. to correct an oversight of them.
The strong dominance (SD) property is defined as follows:
Definition 1 (Strong Dominance (SD)).
An instance is said to satisfy the strong dominance property if for , it holds almost surely (a.s.) that
The property implies that if a sensor predicts correctly then, a.s., all the sensors in the subsequent stages of the cascade also predict correctly. The set of all instances satisfying property, i.e., is learnable [1, Theorem 2]. The weaker version of the property is defined as follows:
Definition 2 (Weak Dominance (WD)).
An instance is said to satisfy weak dominance property if
Let denote the set of instances satisfying the property. The WD property holds for all problem instances where sensor is an optimal sensor.
Hanawal et al.  claimed that is learnable. However, their definition allowed . As it turns out, permitting can prevent from being learnable:
The set is not learnable.
Let . Theorem 19 of Hanawal et al.  constructs instances such that the optimal decision for is sensor , for is sensor . The suboptimality gap on instance is , while on instance is , where is a tunable parameter. At the same time in and in . Theorem 17 of Hanawal et al.  implies that a sound algorithm must check . However, no finite amount of data is sufficient to decide this: In particular, one can show that if an algorithm on achieves sublinear regret, then it must suffer linear regret on for small enough. Hence, all algorithms will suffer linear regret on some instance in . ∎
The following theorem is obtained directly from Theorem and Theorem in  after excluding the case in their proofs.
The set is a maximal learnable set.
In the following, we use an alternate characterization of the property given as
Notice that if and only if . Larger the value of ‘stronger’ is the property and easier it is to identify an optimal action. We later characterize the regret bounds in terms of .
3 Algorithm Under WD Property
In the following, we let denote the optimal arm with largest index, i.e., . The optimal sensor satisfies the following inequalities: equationparentequation
Note that the above decision criteria is risk-averse, i.e., if two sensors have the same optimal cost, the sensor with smaller error-rate will be chosen.
A natural candidate for a decision criteria is to replace error rates () by their estimates and look for an index that satisfies (5a) and (5b). However, error rates () cannot be estimated, implying that (5a) and (5b) can not lead to a sound algorithm. Recall the following result from :
Proposition 2 ([1, Proposition 3]).
Let for any , not necessarily in . Then, for any , and hence .
where forms a proxy for . For the case , we can appeal to the property and can replace (5b) by
Let . Let . Then contains the optimal sensor.
The proof is in Appendix A.
In bandit problems, the upper confidence bound (UCB) [16, 17] is highly effective for dealing with the trade-off between exploration and exploitation. Using UCB idea, we develop an algorithm, named USS-UCB, that utilizes the sets (8) and (9) and looks for an index that belongs to both. Since disagreement probabilities, ’s, are unknown (but fixed), they are replaced by their optimistic empirical estimates at round , denoted by where is empirical estimate of and is the confidence term associated with as in UCB algorithm. The new sets for selection criteria are defined as follows: equationparentequation
From the definition, it is easy to verify that and for any pair. Therefore, it is enough for algorithm to only keep track of and for .
It might be tempting to use lower confidence, i.e., term instead of the upper confidence term in (10b). However, such a change can make the algorithm converge to a sub-optimal sensor. A detailed discussion is given in the supplementary material.
The pseudo code of USS-UCB is given in Algorithm USS-UCB and it works as follows. It takes as an input that trades-off between exploration and exploitation. In the first round, it selects sensor and initializes the value of number of comparisons and counter of disagreements for each pair , denoted and (Line 3), respectively. In each subsequent round, the algorithm computes estimate for the disagreement probability (Line 5)and the associated confidence (Line 6)). Then and are used for computing sets and (Line 7) which are then used to select the sensor. Specifically, the algorithm selects a sensor that satisfies (10a) and (10b) (Line 9).
3.2 Regret Analysis
Following notations and definition are useful in subsequent proofs. For the optimal sensor and each , let
Notice that the values of and for all are positive under the property. Let denote the number of times sensor is selected until round . The following proposition gives the mean number of times a sub-optimal sensor is selected.
Let be a positive valued increasing function such that in USS-UCB. For any , the mean number of times a sensor is selected, is bounded as follows:
and for any
Notice that the mean number of times a sensor is selected, is finite. The regret bounds follows by noting that . Formally, we have the following regret bound.
Since for , . Rest follows from Corollary 1. ∎
We next present problem independent bounds on the expected regret of USS-UCB.
The expected regret of USS-UCB on is and on it is , where hides logarithmic terms.
The proof of Theorem 3 can be found in the supplementary material. We note that the above uniform bounds do not contradict Theorem in  which claimed non-existence of uniform bounds. The condition considered in  incorrectly includes the class of instances satisfying which renders not learnable, whereas in our definition of these instances are excluded and is learnable.
Discussion on optimality of USS-UCB: Any partial monitoring problem can be classified as an ‘easy’, ‘hard’ or ‘hopeless’ problem if it has expected regret bounds of the order or , respectively, and there exists no other class in between . The class is regret equivalent to a stochastic multi-armed bandit with side observations , for which regret scales as , hence resides in the easy class and our bound on it is optimal. Since , is not easy, and also is learnable, it cannot be hopeless. Therefore, the class is hard. We thus conclude that the regret bound of USS-UCB is optimal in . However, optimality concerning other leading constants (in terms of ) is to be explored further.
4 Unknown Ordering of Sensors
The sensor error rates are unknown in our setup and cannot be estimated due to unavailability of ground-truth. Thus, it may happen that we do not know whether error rate of the sensors in the cascade is decreasing or not. In this section, we remove the requirement that sensors are arranged in the decreasing order of their error rates and allow them to be arranged in an arbitrary order that is unknown. We denote the set of USS instances with unknown ordering of sensors by their error-rates as . The rest of the setup is same as in Section 2. We show that even with this relaxation, WD property defined earlier continues to characterize the learnability of the problem.
We begin with the following observation.
Let be an optimal sensor. Then, error rate of any sensor is higher than that of .
We have for all . For , as costs are increasing with sensors. Hence . ∎
The following corollary directly follows from Prop. 2.
For any , .
The following two propositions provide the conditions on sensor costs that allows comparison of their total costs based on disagreement probabilities.
Let . Assume
Then, iff .
Let . Assume
Then, iff .
From Lemma (2), for any we have . Propositions (4) and (5) then suggests that the value of are sufficient to select the optimal sensor if the sensors costs satisfy (Eq. 14) for all and Eqn. (15) for all . Since the values of can be estimated for all we can establish the following result.
Let be an optimal sensor. Any problem instance is learnable if
Notice that for , and . Hence, the learnability condition reduces to , i.e., same as the WD condition. Hence, we have the following result.
The set is learnable.
In this section, we evaluate the performance of USS-UCB on different problem instances derived from synthetic and two ‘real’ datasets: PIMA Indians Diabetes  and Heart Disease (Cleveland) [19, 20]. In our experiments, each sensor is represented by a classifier that is arranged in order of their decreasing misclassification error, i.e., error-rate for each dataset. The cost of using a classifier is assigned based on its error-rate – smaller the error-rate higher the cost. The case where sensors’ error-rate need not to decrease in the cascade is also considered.
Synthetic Dataset: We generate synthetic Bernoulli Symmetric Channel (BSC) dataset  as follows: The input, , is generated from i.i.d. Bernoullirandom variable. The problem instance used in experiment has three sensors with error rates . To ensure strong dominance, we impose the condition given in Eq. 2 during data generation. When sensor predicts correctly, we introduce error up to 10% to the outputs of sensor and . We use five problem instances by varying the associated cost of each sensor as given in Table 1.
|Values/Classifiers||Clf. 1||Clf. 2||Clf. 3||WD Prop.|
|Case 1 Costs||0||0.6||0.8||✓|
|Case 2 Costs||0||0.15||0.35||✓|
|Case 3 Costs||0||0.65||0.9||✓|
|Case 4 Costs||0.2||0.36||0.4||✓|
|Case 5 Costs||0||0.11||0.22||✕|
Real Datasets: Both real datasets specify the costs of acquiring individual features. We split these features into three subsets based on their costs and train three linear classifiers on these subsets using logistic regression. For PIMA-Diabetes dataset (# of samples=768) the first classifier is associated with patient history/profile at the cost of $6, the 2nd classifier, in addition, utilizes glucose tolerance test (cost $ 29) and the 3rd classifier uses all attributes including insulin test (cost $46). For the Heart dataset (# of samples=297) we associate 1st classifier with the first 7 attributes that include cholesterol readings, blood-sugar, and rest-ECG (cost $32), the 2nd classifier utilizes, in addition, the thalach, exang and oldpeak attributes that cost $397 and the 3rd classifier utilizes more extensive tests at a total cost of $601. We scale costs using a tuning parameter (since the costs of features are all greater than one) and consider minimizing a combined objective as stated in Section 2. In our setup, high (low)-values for correspond to low (high)-budget constraint. For example, if we set a fixed budget of $50, this corresponds to high-budget (small ) and low budget (large ) for PIMA Diabetes (3rd classifier optimal) and Heart Disease (1st classifier optimal) respectively. For performance evaluation, different values of are used in five problem instances for both real datasets as given in Table 2.
|PIMA-Diabetes||Heart Disease||WD Pro.|
|Clf. 1||Clf. 2||Clf. 3||Clf. 1||Clf. 2||Clf. 3|
|Cost (in $)||4||29||46||32||397||601|
|in Case 1||0.01||0.0106||0.015||0.0001||0.0008||0.001||✓|
|in Case 2||0.01||0.004||0.0038||0.0001||0.0001||0.00035||✓|
|in Case 3||0.01||0.0113||0.015||0.0001||0.0009||0.001||✓|
|in Case 4||0.0001||0.0001||0.0001||0.00001||0.00004||0.0001||✓|
|in Case 5||0.01||0.002||0.0055||0.0042||0.0001||0.00027||✕|
Verifying WD property: As we know the error-rate associated with each sensor, we can find an optimal sensor for a given problem instance. Once the optimal sensor is known, WD property is verified by using estimates of disagreement probability after rounds.
Expected Cumulative Regret v/s Time Horizon: The Expected Cumulative Regret of USS-UCB with versus Time Horizon plots for the Synthetic BSC Dataset and two real datasets are shown in Figure 2. These plots verify that any instance that satisfies WD property has sub-linear regret. The online USS-UCB selects an instance randomly from the dataset (with replacement) in each round for fixed time horizon. Further, we make a comparison of Algorithm 2 of  and USS-UCB for different values of . With same value of , Algorithm 2 of  and USS-UCB gives same regret whereas USS-UCB with gives best result. as shown in the Figure 3. We verify that if WD holds in any problem instance with the arbitrary ordering of sensors by error rates, then the problem is learnable as shown in Figure 3(c). We fix the time horizon to 10000 for our experiments. We repeat each experiment 100 times, and average regret with 95% confidence bound is presented.
Supervised v/s Unsupervised Learning:
Supervised v/s Unsupervised Learning:We compare USS-UCB against an algorithm where the learner receives feedback. In particular, for each action in each round, in the bandit setting, the learner knows whether or not the corresponding sensor output is correct. We implement the “supervised bandit” setting by replacing Step 5 in USS-UCB with estimated marginal error rates. We notice that for both high as well as low-cost scenarios, while supervised algorithm does have lower regret, the USS-UCB cumulative regret is also sublinear as shown in Figure 3(a). It is qualitatively interesting because these plots demonstrate that, in typical cases, our unsupervised algorithm learn as good as the supervised setting.
Learnability v/s WD Property: To verify the relationship between learnability and WD property, we experiment with different problem instances of synthetic BSC dataset that are parameterized by varying costs. We test the hypothesis that set of problem instances satisfying the WD property is a maximal learnable set. We fixed an optimal sensor and vary the costs in such a way that we continuously pass from the situation where WD holds and to the case where WD does not hold or for any . If WD does not hold for any problem instance then USS-UCB converges to sub-optimal sensor instead of optimal sensor . In such problem instances, as increase, the cumulative regret (1) will also increase due to selection of sub-optimal sensor by USS-UCB until WD does not hold for that problem instance i.e., . The difference is lower bounded by in such cases, therefore, cannot be less than . We start experiments with the minimum possible value of for which problem instance does satisfy WD property and then increase the value of . Figure 3(b) depicts cumulative regret USS-UCB v/s plots for Synthetic BSC Dataset. It can be seen clearly that there is indeed a transition at .
We studied the problem of selecting the best sensor in a cascade of sensors where they are ordered according to their prediction accuracies. The best sensor optimally trades-off between sensor costs and their prediction accuracy. The challenge in this setup is that the ground truth is not revealed at any time and hence setup is completely unsupervised. We modeled it as stochastic partial monitoring problem and proposed an algorithm that gives sub-linear regret under the Weak Dominance (WD) property. We showed that our algorithm enjoys regret of order (hiding logarithmic terms) and when the problem instance satisfies the more stringent Strong Dominance property, the regret bound improves to . We showed that our algorithm enjoys the same performance under WD property even if the sensor ordering is not necessarily according to the decreasing value of their prediction accuracies.
In the current work, we did not exploit any side information (contexts) available with the tasks. It would be interesting to study the contextual version of this problem where the optimal sensor could be job dependent.
Arun Verma is partially supported by MHRD Fellowship, Govt. of India. M.K. Hanawal is supported by IIT Bombay IRCC SEED grant (16IRCCSG010) and INSPIRE faculty fellowship (IFA-14/ENG-73) from DST, Govt. of India. V. Saligrama acknowledges the support of the NSF through grant 1527618. AV and MKH would like to thank Prof. N. Hemachandra, IEOR, IIT Bombay for many useful discussions. This work was done when Csaba Szepesvári was at leave from the University of Alberta.
- Hanawal et al.  Manjesh Hanawal, Csaba Szepesvari, and Venkatesh Saligrama. Unsupervised sequential sensor acquisition. In Artificial Intelligence and Statistics, pages 803–811, 2017.
- Trapeznikov and Saligrama  Kirill Trapeznikov and Venkatesh Saligrama. Supervised sequential classification under budget constraints. In Artificial Intelligence and Statistics, pages 581–589, 2013.
- Seldin et al.  Yevgeny Seldin, Peter L Bartlett, Koby Crammer, and Yasin Abbasi-Yadkori. Prediction with limited advice and multiarmed bandits with paid observations. In ICML, pages 280–287, 2014.
- Zolghadr et al.  Navid Zolghadr, Gábor Bartók, Russell Greiner, András György, and Csaba Szepesvári. Online learning with costly features and labels. In Advances in Neural Information Processing Systems, pages 1241–1249, 2013.
- Greiner et al.  Russell Greiner, Adam J Grove, and Dan Roth. Learning cost-sensitive active classifiers. Artificial Intelligence, 139(2):137–174, 2002.
Póczos et al. 
Barnabás Póczos, Yasin Abbasi-Yadkori, Csaba Szepesvári, Russell
Greiner, and Nathan Sturtevant.
Learning when to stop thinking and do something!
Proceedings of the 26th Annual International Conference on Machine Learning, pages 825–832. ACM, 2009.
- Platanios et al.  Emmanouil Antonios Platanios, Avrim Blum, and Tom M Mitchell. Estimating accuracy from unlabeled data. In UAI, pages 682–691, 2014.
- Platanios et al.  Emmanouil Antonios Platanios, Avinava Dubey, and Tom Mitchell. Estimating accuracy from unlabeled data: A bayesian approach. In International Conference on Machine Learning, pages 1416–1425, 2016.
- Platanios et al.  Emmanouil Platanios, Hoifung Poon, Tom M Mitchell, and Eric J Horvitz. Estimating accuracy from unlabeled data: A probabilistic logic approach. In Advances in Neural Information Processing Systems, pages 4361–4370, 2017.
- Bonald and Combes  Thomas Bonald and Richard Combes. A minimax optimal algorithm for crowdsourcing. In Advances in Neural Information Processing Systems, pages 4352–4360, 2017.
- Kleindessner and Awasthi  Matthäus Kleindessner and Pranjal Awasthi. Crowdsourcing with arbitrary adversaries. In International Conference on Machine Learning, pages 2713–2722, 2018.
- Cesa-Bianchi et al.  Nicolo Cesa-Bianchi, Gábor Lugosi, and Gilles Stoltz. Regret minimization under partial monitoring. Mathematics of Operations Research, 31(3):562–580, 2006.
- Bartók and Szepesvári  Gábor Bartók and Csaba Szepesvári. Partial monitoring with side information. In International Conference on Algorithmic Learning Theory, pages 305–319. Springer, 2012.
- Bartók et al.  Gábor Bartók, Dean P Foster, Dávid Pál, Alexander Rakhlin, and Csaba Szepesvári. Partial monitoring—classification, regret bounds, and algorithms. Mathematics of Operations Research, 39(4):967–997, 2014.
- Wu et al.  Yifan Wu, András György, and Csaba Szepesvári. Online learning with gaussian payoffs and side observations. In Advances in Neural Information Processing Systems, pages 1360–1368, 2015.
- Auer et al.  Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235–256, 2002.
- Garivier and Cappé  Aurélien Garivier and Olivier Cappé. The kl-ucb algorithm for bounded stochastic bandits and beyond. In Proceedings of the 24th annual Conference On Learning Theory, pages 359–376, 2011.
- Kaggle  UCI Machine Learning, Kaggle. Pima Indians Diabetes Database. 2016. URL https://www.kaggle.com/uciml/pima-indians-diabetes-database.
- Detrano  Robert Detrano. V.A. Medical Center, Long Beach and Cleveland Clinic Foundation: Robert Detrano, MD, Ph.D., Donor: David W. Aha. 1998. URL https://archive.ics.uci.edu/ml/datasets/Heart+Disease.
- Dheeru and Karra Taniskidou  Dua Dheeru and Efi Karra Taniskidou. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
- Hoeffding  Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American statistical association, 58(301):13–30, 1963.
Appendix A Proof of creftype 1
Let be an optimal sensor. Define
Consider following three cases:
As is an optimal sensor therefore . If any sensor then i.e.,
Similarly, . If any sensor then i.e.,
The following definition is convenient for the proof arguments.
Definition 3 (Action Preference ()).
The sensor is optimistically preferred over sensor in round if:
Appendix B Discussion of Remark 1
The algorithm can converge to a sub-optimal sensor when we replace the term in (10b) by . To verify this claim, assume algorithm selects sub-optimal sensor in around and then,
|Since sensor is not used then there is no update in but by definition therefore,|
Hence sub-optimal sensor will always be preferred over the optimal sensor in the subsequent rounds. This can be avoided by using UC term in (10b) because,
|The sub-optimal sensor will not be preferred after sufficient rounds,|
As using LC term can make the decisions stuck to sub-optimal sensor, UC term is used in (10b).
Appendix C Proof of Proposition 3
We first recall the standard Hoeffding’s inequality [21, Theorem 2] that we use in the proof.
Let be independent random variables with common range , , and . Then for all , equationparentequation
We need the following lemmas to prove the Proposition 3.
Leibniz’s rule for ,
Leibniz’s integral rule without any common variable,
Using Leibniz’s rule in (31), we get
Integrating both side,
Let and . Then