Scheduling to Learn In An Unsupervised Online Streaming Model

12/02/2021
by   R. Vaze, et al.
0

An unsupervised online streaming model is considered where samples arrive in an online fashion over T slots. There are M classifiers, whose confusion matrices are unknown a priori. In each slot, at most one sample can be labeled by any classifier. The accuracy of a sample is a function of the set of labels obtained for it from various classifiers. The utility of a sample is a scalar multiple of its accuracy minus the response time (difference of the departure slot and the arrival slot), where the departure slot is also decided by the algorithm. Since each classifier can label at most one sample per slot, there is a tradeoff between obtaining a larger set of labels for a particular sample to improve its accuracy, and its response time. The problem of maximizing the sum of the utilities of all samples is considered, where learning the confusion matrices, sample-classifier matching assignment, and sample departure slot decisions depend on each other. The proposed algorithm first learns the confusion matrices, and then uses a greedy algorithm for sample-classifier matching. A sample departs once its incremental utility turns non-positive. We show that the competitive ratio of the proposed algorithm is 1/2-𝒪(log T/T).

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

12/20/2018

Not Just Age but Age and Quality of Information

A versatile scheduling problem to model a three-way tradeoff between del...
11/14/2017

Robust Online Speed Scaling With Deadline Uncertainty

A speed scaling problem is considered, where time is divided into slots,...
04/02/2022

Inverse is Better! Fast and Accurate Prompt for Few-shot Slot Tagging

Prompting methods recently achieve impressive success in few-shot learni...
07/17/2019

SUMBT: Slot-Utterance Matching for Universal and Scalable Belief Tracking

In goal-oriented dialog systems, belief trackers estimate the probabilit...
07/30/2019

Optimal Dynamic Multi-Resource Management in Earth Observation Oriented Space Information Networks

Space information network (SIN) is an innovative networking architecture...
06/06/2014

Small Sample Learning of Superpixel Classifiers for EM Segmentation- Extended Version

Pixel and superpixel classifiers have become essential tools for EM segm...
07/18/2020

Fast Learning for Renewal Optimization in Online Task Scheduling

This paper considers online optimization of a renewal-reward system. A c...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Dawid and Skene in what is now regarded as seminal work Dawid and Skene (1979)

, considered the basic problem of discovering ground truth or true labels of samples from multiple but possibly erroneous/noisy responses/measurements. The more modern paradigm considers the online variant of Dawid and Skene’s model for unsupervised learning (USL), where samples arrive over time and the problem is to discover their true labels using a large number of classifiers, typically ‘non-expert’ crowdsourcing workers. With the advancement in cloud based services, this model has been successfully implemented in variety of platforms, such as Amazon Mechanical Turk.

Such systems, however, have limitations resulting from the lack of knowledge of ground truth and the precise accuracy information of each of the classifiers. Thus, there is a need for aggregating algorithms that can infer the ground truth using a large ensemble of classifiers. There is a large body of work in this direction, where the main idea is to find a small set of good classifiers and/or estimating the accuracy of all classifiers from the obtained labels for different samples

Li et al. (2013); Raykar et al. (2010); Shaham et al. (2016); Karger et al. (2011); Ho and Vaughan (2012); Natarajan et al. (2013); Zhang et al. (2016); Whitehill et al. (2009); Liu and Liu (2015); Gong and Shroff (2018); Nordio et al. (2018); Karger et al. (2014); Sheng et al. (2008).

To make the model formal, a confusion matrix

is defined for each classifier, whose

entry captures the probability that the classifier labels a sample as belonging to class

when the true class label is . Learning the confusion matrix for each classifier is part of the problem, using which the true label of a sample can be discovered with high probability.

In this paper, we consider a competition model for the online USL in the Dawid-Skene model, where in each time slot, a set of samples arrive into the system. There are a total of classifiers, where classifier is defined by its true confusion matrix , that is unknown a priori. Without loss of generality, we consider that each classifier can label at most one sample in each time slot. See Remark 2 for more details. The accuracy of a sample in a time slot is defined as the probability of classifying it correctly using the optimal combining rule given its label set obtained till that time slot. To consider a general model, we let that a sample can remain in system for more than one time slot to increase its accuracy in hope of getting more labels, however, at the cost of accruing delay. Thus, this model captures the tradeoff between sample accuracy and throughput of the system, where throughput is defined as the rate of sample departures.

The studied model is well-suited for real-time learning paradigms, such as image classification in social networks e.g. instagram, facebook, automated crowd controlling system, medical diagnostics, automated quality management in factories, large dataset handling applications in bioinformatics, etc., where not only good classification is needed but the speed/delay incurred matters. In the interest of space, we refer the reader to Massoulié and Xu (2016); Shah et al. (2020) for more discussion on applications.

1.1 Related Work

For USL models, the competition between accuracy and throughput has been studied in prior work Massoulié and Xu (2016); Shah et al. (2020); Basu et al. (2019) by defining a metric called capacity. Capacity is defined to be the largest stochastic rate of arrival of samples per time slot, such that each sample can be guaranteed to have an accuracy above a certain threshold with high probability, while maintaining queue stability. Queue stability states that the long term average of the expected number of samples in the system (yet to depart the system) remains bounded. Even though the prior work Massoulié and Xu (2016); Shah et al. (2020); Basu et al. (2019) captures the tension between accuracy of samples and their arrival rates, it only considers the metric of stability that is a long term throughput metric, and cannot account for more refined utilities such as per sample delay, etc.

To model a more fine grained system, in this work, we define a per-sample utility, that is a function of the accuracy of that sample, and the time it takes to accrue that accuracy. Accuracy is directly related to the set of classifiers that label that sample. Thus, each sample would ideally like to get labels from the set of good or all classifiers, however, that requires waiting for multiple time slots because of the presence of other samples and each classifier labelling at most one sample per unit time, introducing a delay cost. In particular, we choose the sample utility to be a linear combination of its accuracy at its departure time, and the response time (departure time -arrival time), and the total utility is the sum of the utilities of all samples.

Thus, in each time slot, the problem is to find a matching between the outstanding samples and the classifiers, given the current estimates on the entries of the confusion matrices of each classifier, such that the overall utility is maximized. Even when the confusion matrices are perfectly known a priori, this problem is combinatorial and hard to solve. The problem becomes even more involved in the practical setting, since the quality of estimates of the confusion matrices depend on the prior matching decisions, while future matching decisions depend on the current estimates of the confusion matrices.

To formulate the problem in the online setting, we use the metric of competitive ratio that is defined as the ratio of the utility of any algorithm and the utility of the optimal offline algorithm, minimized over all input sequences of samples. The optimal offline algorithm is assumed to have genie knowledge (perfect knowledge of the confusion matrices) and knows the entire input sequence non-causally. Thus, in contrast to Massoulié and Xu (2016); Shah et al. (2020); Basu et al. (2019), in this paper, we consider that the sample arrival sequence is arbitrary and not necessarily stochastic (e.g. Poisson).

1.2 Our contributions

  • [leftmargin=*]

  • With genie knowledge of the confusion matrices, we define a greedy algorithm and show that its competitive ratio is at least . The main tool in deriving this result is to show that the utility of each sample is submodular, for which the greedy algorithm’s competitive ratio is shown to be at least using results from recent work Rajaraman et al. (2021) compared to classical work Fisher et al. (1978).

  • Next, we define a regret metric between the greedy algorithm with genie knowledge and the greedy algorithm that has to learn the confusion matrices. We show that if the error between the estimate and the true value of the entries of the confusion matrices is small enough, then the matching decisions made by the greedy algorithm in the genie setting and the greedy algorithm after learning are identical. This allows us to show that the expected regret of the greedy algorithm that learns the confusion matrices is at most , where is the time horizon.

  • Combining the above two results, we get our final result that the competitive ratio of the proposed greedy algorithm that also learns the confusion matrices is at least .

  • We also provide experimental results for both the synthetic and the real-world data sets to confirm our theoretical results.

2 System Model

We consider the online analogue of the generalized Dawid-Skene model Dawid and Skene (1979) for unsupervised classification, similar to Zhang et al. (2016); Liu and Liu (2015); Basu et al. (2019). We consider slotted time, where in each time slot/time , a (new) set of samples arrive. is the complete set of samples that arrives over the total time horizon of slots. For each sample, the true label can take two values , i.e., a sample can belong to one of two classes and . The two class problem is considered for simplicity. All the results of the paper can be extended to multi-class problems with classes.

Let . There are a total of classifiers, where the confusion matrix of the classifier is , where is the probability that the sample’s true label is but is labeled as by classifier . Following Berend and Kontorovich (2014), we assume that classifier has a competence level , which is the probability of making a correct prediction, regardless of the original class. This implies that the confusion matrix for all , i.e., is the probability that classifier labels correctly. The assumption that is primarily considered since the error probability bounds in closed form Berend and Kontorovich (2014) are available only for this case. We also consider that , i.e., each classifier is better than making a random guess while no classifier is perfect (similar to Zhang et al. (2016)).

Remark 1

The considered model is equivalent to the one-coin model Zhang et al. (2016) that is quite popular in the literature, where and for classes, . In this paper, we restrict to .

We let that the number of samples arriving in each slot , where is a constant independent of and . Without loss of generality, we assume that each classifier can label at most one sample per time slot similar to Basu et al. (2019). Thus, in each time slot, at most samples can be labelled.

Remark 2

If classifier can process samples per time slot, then we make copies of classifier , each with the same confusion matrix . Since all these copies have the same confusion matrix , the rate at which is learnt is accelerated by a factor of , since the error bounds on learning the confusion matrix depend on the number of samples seen Zhang et al. (2016). Thus with , the system gets a constant speed up both in terms of learning the confusion matrix and processing larger number of samples, however, does not change the order wise results we derive in the paper.

We follow the model Basu et al. (2019), that the true label and the individual classifiers’ labels for each sample are generated once and fixed thereafter. Thus, repeatedly assigning a sample to the same classifier yields no benefit.

Let the time slot at which a sample exits the system be . Sample exits the system on account of either being labelled by all the classifiers or if an algorithm decides do so. We let new samples arrive at the beginning of the slot, while all exits happen at the end of any slot. Let the set of samples that exits the system at the end of slot be .

Let at the end of slot , the set of outstanding samples that have arrived till the beginning of slot but have not yet exited the system be . Thus, . For each sample , let the set of classifiers from which it has already got a label by the end of slot be . In slot , for any sample , a label can be obtained from any one of the previously unused classifiers. In particular, in slot , for from any classifier , while for from any classifier .

For , let be the label obtained from classifier for sample . Given the (possibly partial) label set (), the optimal decision rule (called the weighted majority) Liu and Liu (2015) declares the label of to be if

(1)

and otherwise. Let be the probability that the final label obtained using (1) with the label set is in fact the true label. With the weighted majority rule (1), from Berend and Kontorovich (2014),

(2)

where

(3)

Since the gap between the lower and the upper bound (2) on the success probability is a constant, for the purposes of this paper, we let

(4)

for a constant . We suppress the constant for the rest of the theoretical analysis for notational ease.

From (4), since , each sample can improve its accuracy by getting labels from as many classifiers as possible. However, since at most samples can be labelled in each time slot, each sample can improve its accuracy only at the cost of delay by staying in the system for a longer time.

Definition 3

For each time slot , any classifier can be assigned to any sample . Thus, we consider the pair as the resource block . Thus in each time slot , there are resource blocks. Let the collection of all resource block across time horizon be denoted as

Definition 4

For any , the utility of sample is defined as

(5)

where

where is the probability that the label obtained using (1) with label set from classifiers belonging to is in fact the true label, while is the index of the latest time slot of any resource block in , that counts the delay experienced by sample if all classifiers of are used for it, where is the arrival time of the sample , and is the sample weight that trades off the accuracy versus the delay cost.

Note that expression is the ‘real’ utility for sample , however, it has the property that adding a new resource block , can remain the same or decrease. This is on account of the classifier belonging to having being already used in , thus providing no improvement in , or increase in delay because of inclusion of which increases , thereby potentially decreasing . Since any reasonable algorithm will not assign a new resource block to sample even when it reduces its utility, the considered definition of utility is natural,

Note that with this notation, if the final resource block subset assigned to sample is , then the exit time when the sample exits the system, at which time, possibly , i.e., sample is not labeled by all the classifiers.

Since each classifier can label at most one sample per unit time, we can represent the decision about which sample should be labelled by which classifier at any time, as a bi-partite matching. In particular, let be a bi-partite matching between the set of outstanding samples and the set of classifiers at time slot , where an edge exists between and if , i.e., sample has not been labeled by classifier already. If an edge is part of matching , then a label is obtained for sample from classifier at time slot .

For any algorithm alg, the objective is to maximize the sum of the utilities (5) across the samples, where the decision variables at the beginning of time slot are i) matching between the samples and the classifiers, and ii) the exit time decision for each sample . Note that since are unknown, past matchings of the algorithm controls the quality of the estimate of at time , which consequently impacts the matching decision at time , i.e., . We consider this problem in the online setup, where any algorithm alg has only causal information, i.e., at time slot , the algorithm does not know anything about the samples arriving in future (slots ). Formally, the optimization problem is

(6)

where decisions are allowed to use only causal information.

The performance metric for an online algorithm is called the competitive ratio, that is defined for an algorithm alg as

(7)

where is the input sequence of samples, and is the optimal offline algorithm (unknown) that knows the input sequence in advance, and the true value of a priori. The goal is to design an algorithm with as large a competitive ratio as possible.

We next begin with some preliminaries, that will help us in analysis.

Definition 5

Let be a finite set, and let be the power set of . A real-valued set function is said to be monotone if for , and submodular if for every and every .

We first show that the accuracy (4) is a sub-modular function.

Proposition 6

(4) is a monotone and sub-modular set function.

Proof: Checking monotonicity of is straightforward since for all . Next, we check the sub-modularity of . With slight abuse of notation, from (4), the accuracy for sample with label set is denoted as

(8)

where , and . To check that is a sub-modular function in , we use definition (5), and show that where and . From (8),

(9)

Therefore, where and , is equivalent to showing , which is true since .

Proposition 7

is a sub-modular set function.

Since is sub-modular (Proposition 7), and is a linear function and hence sub-modular, the proposition follows since linear combination of sub-modular functions is sub-modular.

Lemma 8

Utility function is a monotone and sub-modular set function.

Proof: For any , by definition, , thus monotonicity is immediate. For sub-modularity, let and , then consider

Case I: . In this case, we also get that , since is sub-modular as shown in Proposition 6, and is additive. Hence,

Since is sub-modular, we also have

Thus, , proving sub-modularity of .

Case II . In this case, , while by definition . Thus, again we get that proving sub-modularity of .

Using these preliminaries, we next propose algorithms and bound their competitive ratios.

3 Algorithms

3.1 Genie Setting

Definition 9

Let if sample is matched to resource block , classifier in slot (called assignment), and zero otherwise. Any classifier cannot be matched to more than one sample in each slot, i.e., for each , . Let the complete sample-classifier-slot assignment be defined as and its restriction for sample be .

To define the algorithm, it is useful to write the increments of as obtained via incrementing the current solution with an additional assignment (matching sample with classifier in slot ) to completely describe the utility function in a compact form.

We propose a simple greedy algorithm (Algorithm 1), that on arrival of each new sample , creates (equal to the number of classifiers) copies of that sample . To model the restriction that a sample can be labeled by a classifier at most once, we enforce a constraint that any copies of any sample have to be assigned to distinct classifiers, while the copies themselves are indistinguishable. For example, if with classifiers and , the two copies of sample are . Then or are the only two valid sample-classifier assignments possible, which could be done over different slots.

For sample , at time , let the set be the set of classifiers for which at least one copy of has already been assigned to some at of before time . Thus, is the set of ineligible classifiers for any copy of sample in future. A classifier is defined to be eligible for sample at time if .

A classifier is defined to be free at time slot , if there is no sample that has been assigned to it at time slot so far. Let the set of free classifiers at time be , where an element is .

With the knowledge of , the proposed greedy algorithm (Algorithm 1) orders the classifiers in decreasing order of their accuracies . For each time slot , the algorithm picks a free classifier (in order) and assigns that sample copy to it, among the outstanding ones that are eligible, that maximizes the incremental utility given prior assignment , as long as the increment is positive.

A sample exits the system as soon as its incremental valuation over all its outstanding copies turns non-positive over all possible eligible classifiers in the current and the future time slots. Note that if the incremental valuation of assigning a sample copy to classifier is non-positive at time slot , then it remains non-positive for any time . So the exit decision can be computed efficiently.

1:Input Confusion Matrices
2:Initialize: , ,
3:for  do
4:     Set of free classifiers for time slot
5:     New samples arrive, make copies for each
6:       % is the set of outstanding sample copies.
7:     For sample , set of classifiers for which at least one copy of has been assigned to it
8:     for  in decreasing order of  do
9:         Set of eligible sample copies assignable to classifier ,
10:         if  then
11:              Break;
12:         else
13:              Allocate copy to classifier where
14:              Tie: broken arbitrarily
15:              , where for some is chosen in Step 16.
16:              
17:              ,
18:         end if
19:     end for
20:     All sample copies exit for which
21:end for
22:Return
Algorithm 1 Greedy Algorithm
Theorem 10

For Problem LABEL:defn:probstagen, let the online assignment generated by Algorithm 1 (with true value of known to it) be and the assignment be , then we have that: for any input .

The above description is complete for Algorithm 1 as long as it knows the exact values of the entries of the confusion matrices . In practice, they are unknown and need to be learnt. Suppose we use Algorithm 1 in the realistic setting when a learning module is used to estimate . In this case, the estimate of at time depends on the history of the matching decisions, which in turn depends on the prior estimate of . This joint learning and matching aspect makes the problem interesting.

Next, to handle this joint learning and matching aspect, we define a regret metric, that will compare the performance of Algorithm 1 with genie access (true knowledge of ), and another matching algorithm , where the value of is estimated from prior matching information, and then used for matching.

For an algorithm , let be the estimated value of the true at time . For the genie setting, for all . Let the output (assignments made) at time with Algorithm 1 be that uses true value of , while the output of any Algorithm at time be that uses . Then the regret for algorithm is defined as

(10)

where the total time horizon is . Our goal is to show that the for a certain learning algorithm that is obtained by prefixing a learning component with Algorithm 1.

Towards that end, next, we obtain a structural result about Algorithm 1 in Lemma 11. For a sample , let be its copies which can be assigned to distinct classifiers. Since the copies of a sample are indistinguishable, without loss of generality, we let the sample copies are assigned in order of their index, i.e., is assigned before if . Let at time , be the copy of sample with the largest index that has already been assigned to some classifier at time , and the set of classifiers that have been matched to the copies of sample , , be .

Using (3), the incremental utility of assigning next sample copy to a free classifier that is also eligible at time is Consider the difference of incremental utility of assigning a copy of sample , and of sample , to some free and eligible classifier at time , and call it . Let , where set , all pairs of contending sample copies. is the minimum gap in the difference of incremental utility of any two copies belonging to two distinct samples. This definition is similar to the difference of the expected values of the top two arms in the multi-arm bandit settings Auer et al. (2002), that controls the regret.

Let the output (assignments made) of Algorithm 1 with input and be and , respectively. In the next Lemma, we show that if ((3) with ), is close enough to for each classifier , in terms of , then the assignments and are identical,

Lemma 11

Let for the estimated value of ((3) with ), the error bound be , where , an upper bound on the weight of any sample. Then , and consequently the utilities of and are the same.

The way to think about this result is that Algorithm 1 is indifferent to errors in the input as long as they are ‘sufficiently’ small. Lemma 11 is a non-trivial result, since intuitively an algorithm need not satisfy this property. The proof crucially depends on the greedy nature (on incremental utility) of assignment decisions by Algorithm 1. We are going to use Lemma 11 to connect the decisions and utility of (genie) Algorithm 1 and (realistic) Algorithm 2 (proposed next) since effectively they only differ in their input about the values of , similar to Lemma 11.

3.2 Greedy Algorithm with Learning

Consider the following greedy matching algorithm (Algorithm 2), whose operation is divided in two phases. The first phase is dedicated for pure learning (used to learn ), where procedure Online Learn is executed for time interval . In the learning phase, for each one sample from the arriving set is randomly chosen for learning, and the remaining samples exit without being labelled. The second phase is dedicated for greedy matching (follows greedy algorithm (Algorithm 1)) assuming the the learnt values of in the first phase are in fact the true values.

1:Phase I - % Learning
2:Set of samples to be sent for learning
3:for  do
4:     set of samples arrive
5:     Randomly select one sample
6:     
7:end for
8:{ Online Learn ()
9:Phase II - % Matching
10:while  do
11:     Follow Algorithm 1 with to get
12:end while
13:Return
Algorithm 2 Greedy Algorithm With Learning

We next describe the learning algorithm Online Learn for our setup (also called the one coin model) that is the same as Algorithm 2 Zhang et al. (2016) using the following definitions. Let the set of all arriving samples be and the number of classes be . In this paper, we consider only . For any sample , be the label obtained from classifier . For any two classifiers , let and for any classifier , and , . Let if the classifier’s label for sample is . The main Theorem of this paper is as follows.

1:Input : Observed labels , number of classes
2:Initialize
3:If , then for all
4:Iteratively execute the following three steps
5:
6:   
7: Normalize such that
8:
9:Output
Algorithm 3 Online Learn
Theorem 12

Choosing , the competitive ratio of Algorithm 2 is at least .

4 Experiments

In this section, we provide the empirical study comparing the competitive ratio and the regret between Algorithm 1 and 2. Ideally, we would like to include the optimal algorithm as well, however, since it is unknown, we have to preclude that. We perform three sets of experiments, first on synthetic dataset, and next two on real datasets, the binary Bird dataset Welinder et al. (2010) which has labels of bird species, and the multi-class DOG dataset Deng et al. (2009) which has labels for the dog breeds.

For all the experiments, the utility with Algorithm 1 is computed assuming the true values of confusion matrices are available at the first slot, while for Algorithm 2, the first slots are used for learning the confusions matrices, the utility is computed over the remaining slots using the learnt confusion matrices.

Synthetic Data: We consider classifiers/workers, and samples arrive over the time horizon . We consider that each sample can belong to either of two classes , and the true label of each sample is

with equal probability. The number of incoming samples in a slot follows a Poisson distribution with rate

, and the sample weights are chosen uniformly randomly from the set . For each classifier, the confusion matrix is generated as follows: the diagonal entries is given by , , and the off-diagonal entries of are obtained as .

In Fig. 4, we plot the regret (normalized with the average number of samples arriving per time slot, and average weight of any sample to identify the scaling with ) and the competitive ratio between Algorithm 1 and Algorithm 2, as a function of time horizon , where for each value of , the best (numerically optimized) learning interval is found. Fig. 4 shows that the simulated performance is better than the theoretical results; regret is and the competitive ratio is at least .

Figure 1: Regret and competitive ratio for Algorithm 2 and Algorithm 1 with synthetic data

Real Data Sets: Next, we use real two datasets, the Bird dataset Welinder et al. (2010) that has classifiers/workers, and samples in total, and the DOG dataset Deng et al. (2009) for which we take classifiers and samples in total. We modified the multi-class DOG dataset to a binary dataset by clubbing classes as class and classes as class . In Figs. 7 and 9, we plot the regret (normalized) and the competitive ratio between Algorithm 1 and Algorithm 2, as a function of time horizon , where for each value of , the best learning interval is found. More detailed results on best etc. can be found in the Appendix.

Figure 2: Regret and competitive ratio for Algorithm 2 and Algorithm 1 with Bird data set
Figure 3: Regret and competitive ratio for Algorithm 2 and Algorithm 1 with DOG data set

5 Conclusions

In this paper, we have considered a novel online competition model between samples in a unsupervised learning setup. Each arriving sample would like to get labelled by as many classifiers as possible to increase its accuracy while incurring the smallest delay. Each classifier, however, can label at most one sample per unit time slot, and this creates a tension between accuracy of a sample and the delay it incurs to achieve that accuracy. Each classifier is characterized by its confusion matrix, which has to be learnt as part of the problem. The problem is challenging since the matching decisions that assign samples to classifiers depends on the knowledge of the confusion matrix, and the quality of estimated confusion matrix depends on the past matching decisions. We present a two phased algorithm, where in the first phase confusion matrices are learnt, which are then used by a greedy algorithm to assign samples to classifers that maximize the incremental utility. Using the submodularity of the utility function, we show that the competitive ratio of the proposed algorithm is close to .

References

  • Dawid and Skene [1979] Alexander Philip Dawid and Allan M Skene. Maximum likelihood estimation of observer error-rates using the em algorithm. Journal of the Royal Statistical Society: Series C (Applied Statistics), 28(1):20–28, 1979.
  • Li et al. [2013] Hongwei Li, Bin Yu, and Dengyong Zhou. Error rate bounds in crowdsourcing models. arXiv preprint arXiv:1307.2674, 2013.
  • Raykar et al. [2010] Vikas C Raykar, Shipeng Yu, Linda H Zhao, Gerardo Hermosillo Valadez, Charles Florin, Luca Bogoni, and Linda Moy. Learning from crowds.

    Journal of Machine Learning Research

    , 11(4), 2010.
  • Shaham et al. [2016] Uri Shaham, Xiuyuan Cheng, Omer Dror, Ariel Jaffe, Boaz Nadler, Joseph Chang, and Yuval Kluger.

    A deep learning approach to unsupervised ensemble learning.

    In International conference on machine learning, pages 30–39, 2016.
  • Karger et al. [2011] David R Karger, Sewoong Oh, and Devavrat Shah. Iterative learning for reliable crowdsourcing systems. In Advances in neural information processing systems, pages 1953–1961, 2011.
  • Ho and Vaughan [2012] Chien-Ju Ho and Jennifer Wortman Vaughan. Online task assignment in crowdsourcing markets. In

    Twenty-sixth AAAI conference on artificial intelligence

    , 2012.
  • Natarajan et al. [2013] Nagarajan Natarajan, Inderjit S Dhillon, Pradeep K Ravikumar, and Ambuj Tewari. Learning with noisy labels. In Advances in neural information processing systems, pages 1196–1204, 2013.
  • Zhang et al. [2016] Yuchen Zhang, Xi Chen, Dengyong Zhou, and Michael I Jordan. Spectral methods meet em: A provably optimal algorithm for crowdsourcing. The Journal of Machine Learning Research, 17(1):3537–3580, 2016.
  • Whitehill et al. [2009] Jacob Whitehill, Ting-fan Wu, Jacob Bergsma, Javier R Movellan, and Paul L Ruvolo. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Advances in neural information processing systems, pages 2035–2043, 2009.
  • Liu and Liu [2015] Yang Liu and Mingyan Liu. An online learning approach to improving the quality of crowd-sourcing. ACM SIGMETRICS Performance Evaluation Review, 43(1):217–230, 2015.
  • Gong and Shroff [2018] Xiaowen Gong and Ness Shroff. Incentivizing truthful data quality for quality-aware mobile data crowdsourcing. In Proceedings of the Eighteenth ACM International Symposium on Mobile Ad Hoc Networking and Computing, pages 161–170, 2018.
  • Nordio et al. [2018] Alessandro Nordio, Alberto Tarable, Emilio Leonardi, and Marco Ajmone Marsan. Selecting the top-quality item through crowd scoring. ACM Transactions on Modeling and Performance Evaluation of Computing Systems (TOMPECS), 3(1):1–26, 2018.
  • Karger et al. [2014] David R Karger, Sewoong Oh, and Devavrat Shah. Budget-optimal task allocation for reliable crowdsourcing systems. Operations Research, 62(1):1–24, 2014.
  • Sheng et al. [2008] Victor S Sheng, Foster Provost, and Panagiotis G Ipeirotis. Get another label? improving data quality and data mining using multiple, noisy labelers. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 614–622, 2008.
  • Massoulié and Xu [2016] Laurent Massoulié and Kuang Xu. On the capacity of information processing systems. In Conference on Learning Theory, pages 1292–1297, 2016.
  • Shah et al. [2020] Virag Shah, Lennart Gulikers, Laurent Massoulié, and Milan Vojnović. Adaptive matching for expert systems with uncertain task types. Operations Research, 2020.
  • Basu et al. [2019] Soumya Basu, Steven Gutstein, Brent Lance, and Sanjay Shakkottai. Pareto optimal streaming unsupervised classification. In International Conference on Machine Learning, pages 505–514, 2019.
  • Rajaraman et al. [2021] Nived Rajaraman, Rahul Vaze, and Goonwanth Reddy. Not just age but age and quality of information. IEEE Journal on Selected Areas in Communications, 39(5):1325–1338, 2021.
  • Fisher et al. [1978] Marshall L Fisher, George L Nemhauser, and Laurence A Wolsey. An analysis of approximations for maximizing submodular set functions?ii. In Polyhedral combinatorics, pages 73–87. Springer, 1978.
  • Berend and Kontorovich [2014] Daniel Berend and Aryeh Kontorovich. Consistency of weighted majority votes. In Advances in Neural Information Processing Systems, pages 3446–3454, 2014.
  • Auer et al. [2002] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235–256, 2002.
  • Welinder et al. [2010] Peter Welinder, Steve Branson, Pietro Perona, and Serge J Belongie. The multidimensional wisdom of crowds. In Advances in neural information processing systems, pages 2424–2432, 2010.
  • Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In

    2009 IEEE conference on computer vision and pattern recognition

    , pages 248–255. Ieee, 2009.

6 Appendix

6.1 Proof of Theorem 10

One can interpret the greedy algorithm (Algorithm 1) as where both the classifiers and the samples are arriving over time slots, and they are getting assigned sequentially in a greedy manner. This aspect precludes the use of classical results on greedy algorithms for standard online submodular maximization problems, to show that it is -competitive for Problem LABEL:defn:probstagen. However, this case has been dealt in Rajaraman et al. [2021], where a more general problem (Problem 3 Rajaraman et al. [2021]) has been considered for which a greedy algorithm (Algorithm 1 is just a special case of that) has been shown to be at least -competitive. Proof of Theorem 10 is omitted since it directly follows from Rajaraman et al. [2021].

6.2 Proof of Lemma 11

Recall that the assignments made by Algorithm 1, with input and are denoted as , and , respectively.

Let at the end of time slot , let be the set of copies of samples that have arrived till time slot , but not assigned to any classifier by the end of slot , and have not exited. Then the set of copies of samples that are outstanding at the beginning of slot is , where , where is the set of samples that arrives at the beginning of time slot . The set is updated after each iteration in which a sample copy is assigned to classifier in time slot , i.e., if the copy of sample , is assigned to some classifier, then .

Note that the copies of any sample are indistinguishable. Thus we let that they are assigned to classifiers in increasing order of their index, for sample .

We consider some iteration of the Algorithm 1 that happens at time slot , such that the assignments and are identical until the previous iteration. Thus, the following holds. For a sample , let be the smallest index for which its copy has not been assigned to any classifier, and , . Thus, the copy of sample is part of the outstanding set of copies in the current iteration and all the previous copies of have already been assigned to some distinct classifiers in the past.

Let be the (distinct) set of classifiers that Algorithm 1 or 2 (it is the same since they are have made identical assignment till now) has assigned the first copies, of sample . Since we are interested in specific time slot , for brevity, we drop the time index from , for the rest of the proof. Let be the time slot in which the latest sample copy for sample is assigned to some classifier.

Recall that the incremental gain in matching sample copy with classifier at time slot , is

(11)

On any iteration, Algorithm 1 matches classifiers in decreasing order of (accuracy) to outstanding sample copies that maximizes the incremental gain . Note that for , true value of and is used, while for the estimated accuracy is used to define .

Recall the definition of , i.e., , where set , all pairs of contending sample copies.

Property 1: Note that if , where , then since , i.e. the correct order of accuracy of classifiers is discovered with . Since we assume that , clearly, . Thus, in any time slot , the order in which classifiers (decreasing order of accuracy) are assigned by and is the same.

Case I : In the considered iteration of the Algorithm 1, let be the classifier that is going to be assigned to some sample copy. If (the set of outstanding samples in this iteration) consists of copies only belonging to the same sample , then trivially, in both and , sample copy is assigned to classifier . Thus, and remain the same, if they were identical before this iteration.

Case II: Thus, the non-trivial case is when consists of copies belonging to different samples. For any two distinct samples and , let be two copies belonging to different samples which are contending for classifier . Next, we show that if classifier is assigned to in , so is the case in . Since Algorithm 1 assigns classifier at time slot to , it means that

(12)

Using (11), (12) we have that for

(13)

To claim the result, we intend to show that for