1 Introduction
In many applications, one has to tradeoff between accuracy and cost. For example, for detecting some event, it is not only the accuracy of a sensor that matters, but the associated sensing cost is important as well. Also, one may have to predict labels of instances for which groundtruth cannot be obtained. In such scenarios, feedback about the correctness of sensors’ predictions remains unknown. Problems with this structure arise naturally in healthcare, security, and crowdsourcing applications. In healthcare, the patients may not reveal the outcome of treatment due to privacy concerns; hence the effectiveness of the treatment is unknown. In crowdsourcing systems, the expertise of selflistedagents (workers) may not be known; therefore their quality cannot be identified. In a security application, specific threats may not have been seen before, and thus their insitu groundtruth may not be available.
In this work, we focus on the study of sensor selection problems where we do not have the advantage of knowing the groundtruth and hence cannot measure the error rates of the sensors. Here sensors could correspond to medical tests (healthcare), detectors/scanners (security) or workers (crowdsourcing). In these unsupervised sensor selection (USS) problems, the goal is to still find the ‘best’ sensor that gives the best tradeoff between error and cost [1].
In USS setup, it is assumed that the sensors form a cascade, i.e., they are ordered by their prediction efficiency and costs– the average prediction error decreases hence, prediction efficiency increases with every stage of the cascade while the cost of acquiring it increases. Even though it is assumed that the sensor ordering is known and better sensors are associated with higher costs, the exact values of sensor errors are still unknown. The learner’s goal is to find a sensor that has small value of total prediction cost for a given task, which includes both the cost of acquiring the sensor’s outputs and the cost due to incorrect predictions.
Clearly, without the knowledge of the groundtruth, one cannot find the optimal sensor as the sensor accuracies cannot be computed. In the USS setup, the structure of the problem is exploited, and it is shown that under certain conditions, namely strong dominance (SD) and weak dominance (WD), learning is possible. The SD property requires the prediction accuracy of a sensor to stochastically dominate prediction accuracy of other sensors with lower costs in the cascade. Specifically, it assumes that if a sensor’s prediction is correct, then all the sensors that follow this sensor in the cascade also have correct predictions.
Under the SD property, Hanawal et al. [1]
established that USS problem is equivalent to a multiarmed bandit with side observations and exploit the equivalence to give an algorithm with sublinear regret. SD property is quite strong and posits that disagreement probability of the predictions of two sensors is equal to the difference in error rates. This property implies that we can measure accuracy by measuring disagreement probabilities leading to a direct multiarmed bandit (MAB) reduction and analysis.
The WD property relaxes strict stochastic ordering on predictions and allows errors on some instances from better sensors. It is argued that the set of instances satisfying the WD property is maximally learnable, and any further relaxation of this property renders the problems unlearnable. The reduction techniques used under SD property does not apply/extend to WD property. For this case, a heuristic algorithm without any performance guarantee is given in
[1]. Our work bridges this gap. Our contributions are summarized as follows:
We develop an algorithm named USSUCB that has sublinear regret under WD property. We characterize regret in terms of how ‘well’ the problem instances satisfy the WD property and then provide a bound that holds uniformly for all WD instances.

Hanawal et al. assume that sensors are ordered, i.e., their accuracy improve with their index, and used this fact in their algorithms. We relax this assumption in Section 4 where the sensors can have an arbitrary order. For this setup, we show that the same WD property determines the learnability.
1.1 Related Work
Several works consider the problem of sensor selection in either batch, or online settings (e.g., Trapeznikov and Saligrama [2], Seldin et al. [3]). However, they all require that the label of each data point is available or the reward is obtained for each action. Zolghadr et al. [4] considers that the labels are available on payment. Greiner et al. [5], Póczos et al. [6] consider costs associated with tests. However, they assume that loss/reward associated with the players’ action is revealed. In contrast, in our setting, the labels are not revealed at any point and are thus completely unsupervised, and the cost in our setup is related to sensing cost and not that of acquiring a label.
consider the problem of estimating accuracies of the multiple binary classifiers with unlabeled data. Most of these works make strong assumptions such as independence given the labels, knowledge of the true distribution of the labels.
Platanios et al. [7]proposed logistic regression based methods using the classifiers’ agreement rates over unlabeled data,
[8] extend this work to use graphical models, and Platanios et al. [9] proposes method using probabilistic logic. Further, Platanios et al. [9] also uses weighted majority vote for label prediction. All this is in the batch setting and differs from our online setup.In the crowdsourcing problems, various methods have been proposed to estimate unknown skilllevel of crowdworkers from the noisy labels they provide (Bonald and Combes [10], Kleindessner and Awasthi [11]). These methods assume that all workers are having the same cost and aggregate the predictions on a given dataset for estimating the accuracy of each worker. Unlike ours, these methods are not online.
2 USS Problem
We cast the unsupervised, stochastic, cascaded sensor selection as an instance of stochastic partial monitoring problem (SPM). We use sensor and arm interchangeably in the following. Formally, a problem instance in our setting is specified by a pair , where is a distribution over the dimensional hypercube, and is a
dimensional, nonnegative valued vector of costs. While
is known to the learner from the start, is unknown. Henceforth, we identify problem instance by . The instance parameters specify the learnerenvironment interaction as follows: In each round , the environment generates a dimensional binary vector chosen at random from . Here, is the output of sensor , while is the (hidden) label to be guessed by the learner. Simultaneously, the learner chooses an index where , and observes the sensor outputs , i.e., the learner goes through the first sensors and observes their outputs. Dropping the subindex , write . Then,, the joint probability distribution of
and , can be expressed as , where for any and , is (essentially) observable while is not.Hanawal et al. in addition assumes that the sensors are known to be ordered from least accurate to most accurate, i.e., is decreasing in . We relax this assumption later in the Section 4. The cost associated with sensor is denoted by and the cost of choosing action is , as the selection has to be done sequentially. The total cost incurred by the learner in round is thus where is a tradeoff parameter between error rate and cost of using sensors^{1}^{1}1 is a parameter that makes associated cost unitless. For example, assume cost is in $ and associated is . If cost is increased by multiple of ( for cost in cents) then the corresponding will be and viceversa.. Without loss of generality, we set . The goal of the learner is to compete with the best choice knowing the . Let and be the optimal sensor. The cumulative (pseudo)regret of the learner running an algorithm up to the end of round is
(1) 
We say that the (expected) regret is sublinear if as , where the expectation is over , which is random as it depends on past random data. When the regret is sublinear, the learner collects almost as much reward in expectation in the long run as an oracle that knew the optimal action from the beginning. Let be the set of all stochastic, cascaded sensor selection problems. Thus, such that and is decreasing in . Given a subset , we say that is learnable if there exists a learning algorithm such that for any , the expected regret of algorithm on instance is sublinear. A subset is said to be a maximal learnable problem class if it is learnable and for any subset that contains is not learnable.
2.1 Strong and Weak Dominance
The purpose of this section is to introduce the notions of strong and weak dominance from the work of Hanawal et al. [1]. While Hanawal et al. studied learning under strong dominance, here we will focus on weak dominance. We also modify the definition of weak dominance of Hanawal et al. to correct an oversight of them.
The strong dominance (SD) property is defined as follows:
Definition 1 (Strong Dominance (SD)).
An instance is said to satisfy the strong dominance property if for , it holds almost surely (a.s.) that
(2) 
The property implies that if a sensor predicts correctly then, a.s., all the sensors in the subsequent stages of the cascade also predict correctly. The set of all instances satisfying property, i.e., is learnable [1, Theorem 2]. The weaker version of the property is defined as follows:
Definition 2 (Weak Dominance (WD)).
An instance is said to satisfy weak dominance property if
(3) 
Let denote the set of instances satisfying the property. The WD property holds for all problem instances where sensor is an optimal sensor.
Hanawal et al. [1] claimed that is learnable. However, their definition allowed . As it turns out, permitting can prevent from being learnable:
Proposition 1.
The set is not learnable.
Proof.
Let . Theorem 19 of Hanawal et al. [1] constructs instances such that the optimal decision for is sensor , for is sensor . The suboptimality gap on instance is , while on instance is , where is a tunable parameter. At the same time in and in . Theorem 17 of Hanawal et al. [1] implies that a sound algorithm must check . However, no finite amount of data is sufficient to decide this: In particular, one can show that if an algorithm on achieves sublinear regret, then it must suffer linear regret on for small enough. Hence, all algorithms will suffer linear regret on some instance in . ∎
The following theorem is obtained directly from Theorem and Theorem in [1] after excluding the case in their proofs.
Theorem 1.
The set is a maximal learnable set.
In the following, we use an alternate characterization of the property given as
(4) 
Notice that if and only if . Larger the value of ‘stronger’ is the property and easier it is to identify an optimal action. We later characterize the regret bounds in terms of .
3 Algorithm Under WD Property
In the following, we let denote the optimal arm with largest index, i.e., . The optimal sensor satisfies the following inequalities: equationparentequation
(5a)  
(5b) 
Note that the above decision criteria is riskaverse, i.e., if two sensors have the same optimal cost, the sensor with smaller errorrate will be chosen.
A natural candidate for a decision criteria is to replace error rates () by their estimates and look for an index that satisfies (5a) and (5b). However, error rates () cannot be estimated, implying that (5a) and (5b) can not lead to a sound algorithm. Recall the following result from [1]:
Proposition 2 ([1, Proposition 3]).
Let for any , not necessarily in . Then, for any , and hence .
Using Proposition 2, criteria (5a) implies
(6) 
where forms a proxy for . For the case , we can appeal to the property and can replace (5b) by
(7) 
can be estimated as the distribution is observable. Motivated by (6) and (7), we define the selection criteria based on the following sets:
(8)  
(9) 
Lemma 1.
Let . Let . Then contains the optimal sensor.
The proof is in Appendix A.
3.1 UssUcb
In bandit problems, the upper confidence bound (UCB) [16, 17] is highly effective for dealing with the tradeoff between exploration and exploitation. Using UCB idea, we develop an algorithm, named USSUCB, that utilizes the sets (8) and (9) and looks for an index that belongs to both. Since disagreement probabilities, ’s, are unknown (but fixed), they are replaced by their optimistic empirical estimates at round , denoted by where is empirical estimate of and is the confidence term associated with as in UCB algorithm. The new sets for selection criteria are defined as follows: equationparentequation
(10a)  
(10b) 
From the definition, it is easy to verify that and for any pair. Therefore, it is enough for algorithm to only keep track of and for .
Remark 1.
It might be tempting to use lower confidence, i.e., term instead of the upper confidence term in (10b). However, such a change can make the algorithm converge to a suboptimal sensor. A detailed discussion is given in the supplementary material.
The pseudo code of USSUCB is given in Algorithm USSUCB and it works as follows. It takes as an input that tradesoff between exploration and exploitation. In the first round, it selects sensor and initializes the value of number of comparisons and counter of disagreements for each pair , denoted and (Line 3), respectively. In each subsequent round, the algorithm computes estimate for the disagreement probability (Line 5)and the associated confidence (Line 6)). Then and are used for computing sets and (Line 7) which are then used to select the sensor. Specifically, the algorithm selects a sensor that satisfies (10a) and (10b) (Line 9).
3.2 Regret Analysis
Following notations and definition are useful in subsequent proofs. For the optimal sensor and each , let
(11) 
(12a)  
(12b) 
(13a)  
(13b) 
Notice that the values of and for all are positive under the property. Let denote the number of times sensor is selected until round . The following proposition gives the mean number of times a suboptimal sensor is selected.
Proposition 3.
Let be a positive valued increasing function such that in USSUCB. For any , the mean number of times a sensor is selected, is bounded as follows:

for any

and for any
Notice that the mean number of times a sensor is selected, is finite. The regret bounds follows by noting that . Formally, we have the following regret bound.
Theorem 2.
Let be set as in creftype 3. Then, for any , the expected regret of USSUCB in rounds is bounded as below:
Corollary 1.
Let and in creftype 2. Then, expected regret of USSUCB for any in rounds is of .
Corollary 2.
Let technical conditions stated in creftype 1 hold. Then expected regret of USSUCB for any in rounds is of .
Proof.
Since for , . Rest follows from Corollary 1. ∎
We next present problem independent bounds on the expected regret of USSUCB.
Theorem 3.
Let be set as in creftype 3. The expected regret of USSUCB in rounds

for any instance in is bounded as

for any instance in is bounded as
Corollary 3.
The expected regret of USSUCB on is and on it is , where hides logarithmic terms.
The proof of Theorem 3 can be found in the supplementary material. We note that the above uniform bounds do not contradict Theorem in [1] which claimed nonexistence of uniform bounds. The condition considered in [1] incorrectly includes the class of instances satisfying which renders not learnable, whereas in our definition of these instances are excluded and is learnable.
Discussion on optimality of USSUCB: Any partial monitoring problem can be classified as an ‘easy’, ‘hard’ or ‘hopeless’ problem if it has expected regret bounds of the order or , respectively, and there exists no other class in between [14]. The class is regret equivalent to a stochastic multiarmed bandit with side observations [1], for which regret scales as , hence resides in the easy class and our bound on it is optimal. Since , is not easy, and also is learnable, it cannot be hopeless. Therefore, the class is hard. We thus conclude that the regret bound of USSUCB is optimal in . However, optimality concerning other leading constants (in terms of ) is to be explored further.
4 Unknown Ordering of Sensors
The sensor error rates are unknown in our setup and cannot be estimated due to unavailability of groundtruth. Thus, it may happen that we do not know whether error rate of the sensors in the cascade is decreasing or not. In this section, we remove the requirement that sensors are arranged in the decreasing order of their error rates and allow them to be arranged in an arbitrary order that is unknown. We denote the set of USS instances with unknown ordering of sensors by their errorrates as . The rest of the setup is same as in Section 2. We show that even with this relaxation, WD property defined earlier continues to characterize the learnability of the problem.
We begin with the following observation.
Lemma 2.
Let be an optimal sensor. Then, error rate of any sensor is higher than that of .
Proof.
We have for all . For , as costs are increasing with sensors. Hence . ∎
The following corollary directly follows from Prop. 2.
Corollary 4.
For any , .
The following two propositions provide the conditions on sensor costs that allows comparison of their total costs based on disagreement probabilities.
Proposition 4.
Let . Assume
(14) 
Then, iff .
Proposition 5.
Let . Assume
(15) 
Then, iff .
From Lemma (2), for any we have . Propositions (4) and (5) then suggests that the value of are sufficient to select the optimal sensor if the sensors costs satisfy (Eq. 14) for all and Eqn. (15) for all . Since the values of can be estimated for all we can establish the following result.
Proposition 6.
Let be an optimal sensor. Any problem instance is learnable if
Notice that for , and . Hence, the learnability condition reduces to , i.e., same as the WD condition. Hence, we have the following result.
Theorem 4.
The set is learnable.
5 Experiments
In this section, we evaluate the performance of USSUCB on different problem instances derived from synthetic and two ‘real’ datasets: PIMA Indians Diabetes [18] and Heart Disease (Cleveland) [19, 20]. In our experiments, each sensor is represented by a classifier that is arranged in order of their decreasing misclassification error, i.e., errorrate for each dataset. The cost of using a classifier is assigned based on its errorrate – smaller the errorrate higher the cost. The case where sensors’ errorrate need not to decrease in the cascade is also considered.
Synthetic Dataset: We generate synthetic Bernoulli Symmetric Channel (BSC) dataset [1] as follows: The input, , is generated from i.i.d. Bernoullirandom variable. The problem instance used in experiment has three sensors with error rates . To ensure strong dominance, we impose the condition given in Eq. 2 during data generation. When sensor predicts correctly, we introduce error up to 10% to the outputs of sensor and . We use five problem instances by varying the associated cost of each sensor as given in Table 1.
Values/Classifiers  Clf. 1  Clf. 2  Clf. 3  WD Prop. 

Case 1 Costs  0  0.6  0.8  ✓ 
Case 2 Costs  0  0.15  0.35  ✓ 
Case 3 Costs  0  0.65  0.9  ✓ 
Case 4 Costs  0.2  0.36  0.4  ✓ 
Case 5 Costs  0  0.11  0.22  ✕ 
Real Datasets: Both real datasets specify the costs of acquiring individual features. We split these features into three subsets based on their costs and train three linear classifiers on these subsets using logistic regression. For PIMADiabetes dataset (# of samples=768) the first classifier is associated with patient history/profile at the cost of $6, the 2nd classifier, in addition, utilizes glucose tolerance test (cost $ 29) and the 3rd classifier uses all attributes including insulin test (cost $46). For the Heart dataset (# of samples=297) we associate 1st classifier with the first 7 attributes that include cholesterol readings, bloodsugar, and restECG (cost $32), the 2nd classifier utilizes, in addition, the thalach, exang and oldpeak attributes that cost $397 and the 3rd classifier utilizes more extensive tests at a total cost of $601. We scale costs using a tuning parameter (since the costs of features are all greater than one) and consider minimizing a combined objective as stated in Section 2. In our setup, high (low)values for correspond to low (high)budget constraint. For example, if we set a fixed budget of $50, this corresponds to highbudget (small ) and low budget (large ) for PIMA Diabetes (3rd classifier optimal) and Heart Disease (1st classifier optimal) respectively. For performance evaluation, different values of are used in five problem instances for both real datasets as given in Table 2.
Values/
Classifiers 
PIMADiabetes  Heart Disease  WD Pro.  

Clf. 1  Clf. 2  Clf. 3  Clf. 1  Clf. 2  Clf. 3  
Errorrate  0.3125  0.2331  0.2279  0.29292  0.20202  0.14815  
Cost (in $)  4  29  46  32  397  601  
in Case 1  0.01  0.0106  0.015  0.0001  0.0008  0.001  ✓ 
in Case 2  0.01  0.004  0.0038  0.0001  0.0001  0.00035  ✓ 
in Case 3  0.01  0.0113  0.015  0.0001  0.0009  0.001  ✓ 
in Case 4  0.0001  0.0001  0.0001  0.00001  0.00004  0.0001  ✓ 
in Case 5  0.01  0.002  0.0055  0.0042  0.0001  0.00027  ✕ 
Verifying WD property: As we know the errorrate associated with each sensor, we can find an optimal sensor for a given problem instance. Once the optimal sensor is known, WD property is verified by using estimates of disagreement probability after rounds.
Expected Cumulative Regret v/s Time Horizon: The Expected Cumulative Regret of USSUCB with versus Time Horizon plots for the Synthetic BSC Dataset and two real datasets are shown in Figure 2. These plots verify that any instance that satisfies WD property has sublinear regret. The online USSUCB selects an instance randomly from the dataset (with replacement) in each round for fixed time horizon. Further, we make a comparison of Algorithm 2 of [1] and USSUCB for different values of . With same value of , Algorithm 2 of [1] and USSUCB gives same regret whereas USSUCB with gives best result. as shown in the Figure 3. We verify that if WD holds in any problem instance with the arbitrary ordering of sensors by error rates, then the problem is learnable as shown in Figure 3(c). We fix the time horizon to 10000 for our experiments. We repeat each experiment 100 times, and average regret with 95% confidence bound is presented.
Supervised v/s Unsupervised Learning:
We compare USSUCB against an algorithm where the learner receives feedback. In particular, for each action in each round, in the bandit setting, the learner knows whether or not the corresponding sensor output is correct. We implement the “supervised bandit” setting by replacing Step 5 in USSUCB with estimated marginal error rates. We notice that for both high as well as lowcost scenarios, while supervised algorithm does have lower regret, the USSUCB cumulative regret is also sublinear as shown in Figure 3(a). It is qualitatively interesting because these plots demonstrate that, in typical cases, our unsupervised algorithm learn as good as the supervised setting.Learnability v/s WD Property: To verify the relationship between learnability and WD property, we experiment with different problem instances of synthetic BSC dataset that are parameterized by varying costs. We test the hypothesis that set of problem instances satisfying the WD property is a maximal learnable set. We fixed an optimal sensor and vary the costs in such a way that we continuously pass from the situation where WD holds and to the case where WD does not hold or for any . If WD does not hold for any problem instance then USSUCB converges to suboptimal sensor instead of optimal sensor . In such problem instances, as increase, the cumulative regret (1) will also increase due to selection of suboptimal sensor by USSUCB until WD does not hold for that problem instance i.e., . The difference is lower bounded by in such cases, therefore, cannot be less than . We start experiments with the minimum possible value of for which problem instance does satisfy WD property and then increase the value of . Figure 3(b) depicts cumulative regret USSUCB v/s plots for Synthetic BSC Dataset. It can be seen clearly that there is indeed a transition at .
6 Conclusion
We studied the problem of selecting the best sensor in a cascade of sensors where they are ordered according to their prediction accuracies. The best sensor optimally tradesoff between sensor costs and their prediction accuracy. The challenge in this setup is that the ground truth is not revealed at any time and hence setup is completely unsupervised. We modeled it as stochastic partial monitoring problem and proposed an algorithm that gives sublinear regret under the Weak Dominance (WD) property. We showed that our algorithm enjoys regret of order (hiding logarithmic terms) and when the problem instance satisfies the more stringent Strong Dominance property, the regret bound improves to . We showed that our algorithm enjoys the same performance under WD property even if the sensor ordering is not necessarily according to the decreasing value of their prediction accuracies.
In the current work, we did not exploit any side information (contexts) available with the tasks. It would be interesting to study the contextual version of this problem where the optimal sensor could be job dependent.
Acknowledgment
Arun Verma is partially supported by MHRD Fellowship, Govt. of India. M.K. Hanawal is supported by IIT Bombay IRCC SEED grant (16IRCCSG010) and INSPIRE faculty fellowship (IFA14/ENG73) from DST, Govt. of India. V. Saligrama acknowledges the support of the NSF through grant 1527618. AV and MKH would like to thank Prof. N. Hemachandra, IEOR, IIT Bombay for many useful discussions. This work was done when Csaba Szepesvári was at leave from the University of Alberta.
References
 Hanawal et al. [2017] Manjesh Hanawal, Csaba Szepesvari, and Venkatesh Saligrama. Unsupervised sequential sensor acquisition. In Artificial Intelligence and Statistics, pages 803–811, 2017.
 Trapeznikov and Saligrama [2013] Kirill Trapeznikov and Venkatesh Saligrama. Supervised sequential classification under budget constraints. In Artificial Intelligence and Statistics, pages 581–589, 2013.
 Seldin et al. [2014] Yevgeny Seldin, Peter L Bartlett, Koby Crammer, and Yasin AbbasiYadkori. Prediction with limited advice and multiarmed bandits with paid observations. In ICML, pages 280–287, 2014.
 Zolghadr et al. [2013] Navid Zolghadr, Gábor Bartók, Russell Greiner, András György, and Csaba Szepesvári. Online learning with costly features and labels. In Advances in Neural Information Processing Systems, pages 1241–1249, 2013.
 Greiner et al. [2002] Russell Greiner, Adam J Grove, and Dan Roth. Learning costsensitive active classifiers. Artificial Intelligence, 139(2):137–174, 2002.

Póczos et al. [2009]
Barnabás Póczos, Yasin AbbasiYadkori, Csaba Szepesvári, Russell
Greiner, and Nathan Sturtevant.
Learning when to stop thinking and do something!
In
Proceedings of the 26th Annual International Conference on Machine Learning
, pages 825–832. ACM, 2009.  Platanios et al. [2014] Emmanouil Antonios Platanios, Avrim Blum, and Tom M Mitchell. Estimating accuracy from unlabeled data. In UAI, pages 682–691, 2014.
 Platanios et al. [2016] Emmanouil Antonios Platanios, Avinava Dubey, and Tom Mitchell. Estimating accuracy from unlabeled data: A bayesian approach. In International Conference on Machine Learning, pages 1416–1425, 2016.
 Platanios et al. [2017] Emmanouil Platanios, Hoifung Poon, Tom M Mitchell, and Eric J Horvitz. Estimating accuracy from unlabeled data: A probabilistic logic approach. In Advances in Neural Information Processing Systems, pages 4361–4370, 2017.
 Bonald and Combes [2017] Thomas Bonald and Richard Combes. A minimax optimal algorithm for crowdsourcing. In Advances in Neural Information Processing Systems, pages 4352–4360, 2017.
 Kleindessner and Awasthi [2018] Matthäus Kleindessner and Pranjal Awasthi. Crowdsourcing with arbitrary adversaries. In International Conference on Machine Learning, pages 2713–2722, 2018.
 CesaBianchi et al. [2006] Nicolo CesaBianchi, Gábor Lugosi, and Gilles Stoltz. Regret minimization under partial monitoring. Mathematics of Operations Research, 31(3):562–580, 2006.
 Bartók and Szepesvári [2012] Gábor Bartók and Csaba Szepesvári. Partial monitoring with side information. In International Conference on Algorithmic Learning Theory, pages 305–319. Springer, 2012.
 Bartók et al. [2014] Gábor Bartók, Dean P Foster, Dávid Pál, Alexander Rakhlin, and Csaba Szepesvári. Partial monitoring—classification, regret bounds, and algorithms. Mathematics of Operations Research, 39(4):967–997, 2014.
 Wu et al. [2015] Yifan Wu, András György, and Csaba Szepesvári. Online learning with gaussian payoffs and side observations. In Advances in Neural Information Processing Systems, pages 1360–1368, 2015.
 Auer et al. [2002] Peter Auer, Nicolo CesaBianchi, and Paul Fischer. Finitetime analysis of the multiarmed bandit problem. Machine learning, 47(23):235–256, 2002.
 Garivier and Cappé [2011] Aurélien Garivier and Olivier Cappé. The klucb algorithm for bounded stochastic bandits and beyond. In Proceedings of the 24th annual Conference On Learning Theory, pages 359–376, 2011.
 Kaggle [2016] UCI Machine Learning, Kaggle. Pima Indians Diabetes Database. 2016. URL https://www.kaggle.com/uciml/pimaindiansdiabetesdatabase.
 Detrano [1998] Robert Detrano. V.A. Medical Center, Long Beach and Cleveland Clinic Foundation: Robert Detrano, MD, Ph.D., Donor: David W. Aha. 1998. URL https://archive.ics.uci.edu/ml/datasets/Heart+Disease.
 Dheeru and Karra Taniskidou [2017] Dua Dheeru and Efi Karra Taniskidou. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
 Hoeffding [1963] Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American statistical association, 58(301):13–30, 1963.
Appendix A Proof of creftype 1
See 1
Proof.
Let be an optimal sensor. Define
(16)  
(17)  
(18)  
(19)  
(20) 
Consider following three cases:
Case I:
As is an optimal sensor therefore . If any sensor then i.e.,
(21)  
(22) 
Similarly, . If any sensor then i.e.,
(23) 
(24) 
The following definition is convenient for the proof arguments.
Definition 3 (Action Preference ()).
The sensor is optimistically preferred over sensor in round if:
(28a)  
(28b) 
Appendix B Discussion of Remark 1
The algorithm can converge to a suboptimal sensor when we replace the term in (10b) by . To verify this claim, assume algorithm selects suboptimal sensor in around and then,
Since sensor is not used then there is no update in but by definition therefore,  
Hence suboptimal sensor will always be preferred over the optimal sensor in the subsequent rounds. This can be avoided by using UC term in (10b) because,
The suboptimal sensor will not be preferred after sufficient rounds,  
(29) 
As using LC term can make the decisions stuck to suboptimal sensor, UC term is used in (10b).
Appendix C Proof of Proposition 3
We first recall the standard Hoeffding’s inequality [21, Theorem 2] that we use in the proof.
Theorem 5.
Let be independent random variables with common range , , and . Then for all , equationparentequation
(30a)  
(30b) 
We need the following lemmas to prove the Proposition 3.
Lemma 3.
(31) 
Proof.
Leibniz’s rule for ,
Leibniz’s integral rule without any common variable,
Using Leibniz’s rule in (31), we get
Integrating both side,
Lemma 4.
Let and . Then
Comments
There are no comments yet.