Crowdsourcing can be a scalable approach to collecting data for tasks that require human knowledge such as image recognition and natural language processing. Through crowdsourcing platforms such as Amazon Mechanical Turk, a large number of data tasks can be assigned to workers who are asked to give binary or multi-class labels. The goal of much of crowdsourcing research is to estimate the unknown ground truth, given that the quality of the workers can be variable. Indeed, due to the high variability of worker skills, aggregating true labels becomes a challenging problem.
One straightforward approach is to directly estimate the unknown labels by majority voting from the information provided by workers. In this approach, an implicit assumption is that all workers have identical skills on each task; on the other hand, one might expect the answers from reliable workers are more likely to be accurate. In practice, the crowd is often highly heterogeneous in terms of skill levels, and downweighting unskilled workers and upweighting skilled workers can have a significant impact on the performance. Many aggregation methods ranging from the weighted majority vote to more complex schemes that incorporate worker quality and accuracy have been proposed. Theoretically, recent works (Berend and Kontorovich, 2014; Szepesvári, 2015) have investigated the importance of having precise knowledge of skill quality for accurate prediction of ground-truth labels. Moreover, accurate skill estimation can also be useful for other purposes like worker training, task assignment, or for use in worker-compensation schemes.
There are two challenges in estimating skills of workers given that the problem setup is unsupervised. The first challenge is to construct a skill model for each worker. Many papers achieve empirical success by applying Dawid & Skene (DS) model (Dawid and Skene, 1979)
, which is a simple model that parameterized by the probability of a worker answers the true label. In this paper, the basis of our works is the homogeneous DS model where each worker is assumed to have the same skill level on each class. More specifically, we focus on the single-coin (DS) model for binary crowdsourcing problem in this paper (though in Section4, we extend our algorithm to multiclass problems).
The second challenge is that, in practice, workers are only available for a short period of time which means only a small subset of data is labeled by each worker. This introduces a sparse worker-task assignment (Karger et al., 2013; Dalvi et al., 2013). An additional subtle issue is the lack of diversity in terms of interactions between the workers: a worker is often grouped with a limited subset of workers across all tasks. This situation is remarkably evident on benchmark datasets: The ’Web’ dataset has workers, with to workers/task and each worker on average interacting with about another
workers only, while the standard deviation of how many workers a worker is interacting with is. The ’RTE’ dataset has workers, has only workers/task on the average and each worker interacts with fewer than other workers, while the standard deviation of the interaction degree is . This is in contrast to most existing crowdsourcing research which only considered estimate skills with nearly complete data. We are therefore motivated by the need to make spectral methods suitable for non-regular worker-task data often seen in practice.
In this paper, we suppose that the input comes in the form of a sparsely filled worker-task label matrix. The workers possess unique unknown skills, and tasks assume unique unknown labels. The worker-task label matrix collects the random labels provided by the workers for individual tasks. The skill level of a worker is the (scaled) probability of the worker’s label matching the true unknown label for any of the tasks. The observed labels are independent of each other.
to reconstruct the unknown labels is to use weighted majority voting where the weights assigned to the label provided by a worker is equal to the log-odds underlying the worker’s skill. Since skill levels are unknown, we follow prior works(Dalvi et al., 2013; Berend and Kontorovich, 2014; Szepesvári, 2015; Bonald and Combes, 2016) and adopt a two-step approach, whereby worker skills are first estimated and then these skills are used with the optimal weighting method to recover labels. Our main contributions are as follows:
We construct a skill estimator under single-coin model as a weighted least-squares rank-one matrix completion/factorization problem. The matrix being factored is the correlation matrix among the workers, with the weights compensating for the varying accuracy in the inter-worker correlations.
We show that skills can be recovered from the observation data matrix whenever the worker-worker interaction graph does not contain a bipartite connected component. In particular, for any crowdsourcing problem that has non-bipartite worker-worker interaction graph, there always exists a method to estimate true skills.
In the context of minimizing the objective function, we propose to use projected gradient descent which is theoretically verified to converge to the true skills. We give natural and mild conditions on the weighting matrix under which we prove that projected gradient descent, despite the objective being non-convex, is guaranteed to find the rank-one decomposition of the true moment matrix.
We extend our algorithm to multiclass case by applying the homogeneous DS model. Under this model, we prove that any multiclass problem can be formulated as a weighted least-squares rank-one problem where the unknown variable is a linear function of true skills.
Our approach is also of independent interest, as we derive a fundamental result about symmetric rank-one matrix completion: the unobserved entries can be recovered by gradient descent in polynomial time whenever the sampling matrix is irreducible and non-bipartite. Our results for convergence of the proposed gradient descent scheme should be somewhat surprising given that the related weighted low-rank factorization problem is known to be NP-hard even for the rank-one case (Gillis and Glineur, 2011). In contrast to our approach, existing results in low-rank matrix completion require strong assumptions on the weighting matrix, typically some form of incoherence, e.g., (Ge et al., 2016).
2 Related Work
Discriminative Approach: In contrast to our two-step approach, several works adopt a discriminative method for label prediction. Specifically, Li and Yu (2014); Tian and Zhu (2015) directly identify true labels by various aggregation rules that incorporate worker reliability.
Skill Estimation: As mentioned earlier, we work in the problem of estimating skills under the single-coin model. Past approaches to skill estimation are based on maximum likelihood/maximum posteriori (ML/MAP) estimation, or moment matching, or a combination of these. In particular, various versions of the EM algorithm have been proposed to implement ML/MAP estimation, starting with the work of Dawid and Skene (1979). Variants and extensions of this method, tested in various problems, include Hui and Walter (1980); Smyth et al. (1995); Albert and Dodd (2004); Raykar et al. (2010); Liu et al. (2012)
. A number of recent works were concerned with performance guarantees for Expectation Maximization (EM) and some of its variants(Gao and Zhou, 2013; Zhang et al., 2014; Gao et al., 2016). Another popular direction is to add priors over worker skills, labels or worker-task assignments. To properly deal with the extra information, various Bayesian methods (belief propagation, mean-field and variational methods) have been considered (Raykar et al., 2010; Karger et al., 2011; Liu et al., 2012; Karger et al., 2013, 2014). Moment matching is also widely used (Ghosh et al., 2011; Dalvi et al., 2013; Zhang et al., 2014; Gao et al., 2016; Bonald and Combes, 2016; Zhang et al., 2016). With the exception of Bonald and Combes (2016)
, who propose an ad-hoc method, the algorithms in these works use matrix or tensor factorization.111While Ghosh et al. (2011) pioneered the matrix factorization approach, their work is less relevant to this discussion as they estimate the labels directly.
In theory, an ML/MAP method which is guaranteed to maximize the likelihood/posterior, is the ideal method to accommodate irregular worker-task assignments. However, as far as we know, none of the existing algorithms, unless initialized with a moment-matching-based spectral method, is proven to indeed find a satisfactory approximate maximizer of the objective that it is maximizing (Zhang et al., 2016). At the same time, moment matching methods that use spectral (and in general algebraic) algorithms implicitly assume the regularity of worker-task assignments, too. Indeed, the approach of Ghosh et al. (2011) crucially relies on the regularity of the worker-task assignment (as the method proposed uses unnormalized statistics). In particular, this method is not expected to work at all on non-regular data. Other spectral methods, being purely algebraic, implicitly treat all entries in the estimated matrices and tensors as if they had the same accuracy, which, in the case of irregular worker-task assignments, is far from the truth. In particular, the need to explicitly deal with data with unequal accuracy is a widely recognized issue that has a long history in the low-rank factorization community, going back to the work of Gabriel and Zamir (1979). Starting with this work, the standard recommendation is to reformulate the low-rank estimation problem as a weighted least-squares problem (Gabriel and Zamir, 1979; Srebro and Jaakkola, 2003). In this paper, we will also follow this recommendation.
While Dalvi et al. (2013)
also use a weighted least-squares objective, this is not by choice, but rather as a consequence of the need to normalize the data rather than to correct for the inaccuracy of the data. Furthermore, rather than considering the direct minimization of the resulting objective, they use two heuristic approaches that also use an unweighted spectral method.
In this light, our goal is to make spectral methods suitable for non-regular worker-task data often seen in practice.
Matrix Factorization/Completion: Unlike the general matrix factorization problem arising in recommender systems (Koren et al., 2009), we are primarily concerned with a rank-one estimation of square symmetric matrices. Existing results on matrix completion (Ge et al., 2016) for square symmetric matrices are more general but require stronger assumptions on the matrix such as incoherence and random sampling.
Notation and conventions: The set of reals is denoted by , the set of natural numbers which does not include zero is denoted by . For , . Empty sums are defined as zero. We will use
to denote the probability measure over the measure space holding our random variables, whilewill be used to denote the corresponding expectation operator. For , we use to denote the
-norm of vectors. Further,stands for the -norm, is the Frobenius-norm. The cardinality of a set is denoted by . For a real-valued vector , denotes the vector whose th component is . Proofs of new results, missing from the main text are given in the appendix.
3 Formal problem statement
We first consider binary crowdsourcing tasks where a set of workers provide labels for a large number of items. Let be a positive integer denoting the number of workers. A problem instance is given by: a skill vector associating the skill level with worker ; the worker-task assignment set , which captures which workers provide labels on which tasks; and the vector of “ground truth labels” , which are unknown and which we would like to estimate.
When for some , we say that is a finite instance with tasks; otherwise we will say is an infinite instance. We allow infinite tasks to be able to discuss asymptotic identifiability.
It will be convenient to use to denote the set of all possible problem instances, defined as above. For any instance , the worker-task assignment set provides important information about worker interaction structure. Indeed, we can think of two workers as “interacting” if they provide a label for the same task. Formally, we define the interaction graph as follows.
[Interaction Graph] Let be a worker-task assignment set. The (worker) interaction graph underlying is an undirected graph with vertex set such that with if there exists some task such that both and are elements of . In the case of infinite instance, the interaction graph is a unweighted graph where an edge does not have any weight associated with it. For finite instances, it will make sense to assign a weight on the edge which is the number of tasks shared by workers and .
Our goal is to recover the ground truth labels given observations , where is a random variable associated with worker and task . According to the single-coin model, the observations are generated as , where is a collection of mutually independent random variables that satisfy . Note that this is the same as assuming that worker returns with probability and with probability .
Thus a worker with always returns the ground truth, while a worker with always return the opposite label of the ground truth; and a worker with will always provide a random variable with zero expectation regardless of the ground truth, i.e., a uniformly random label.
Remark: As will be discussed in detail later, some additional assumptions on the skill vector will be needed for accurate estimation of ground truth labels; obviously, little can be done if is the zero vector, i.e., if every worker returns uniformly random labels irrespective of the ground truth. Another obvious observation is, in the event that all workers agree, we cannot distinguish the possibility that is proportional to the all ones vector (and every worker provides the right label) from the possibility that is proportional to the negative of the all ones vector (and every worker provides the wrong label). One way to get around this problem is to assume that , i.e., all workers have at least some skill; we will make this assumption when analyzing our projected gradient descent method. A weaker approach is to assume that , i.e., that, on the net, the workers are collectively more prone to return correct rather than incorrect labels; as we discuss later, this is sufficient for identifiability.
A (deterministic) inference method underlying an assignment set takes the observations and returns a real-valued score for each task in ; the signs of the scores give the label-estimates. Formally, we define an inference method as a map , where given , , the component of is the score inferred for task . Inference methods are aimed at working with finite assignment sets. To process an infinite assignment set, we define the notion of inference schema. In particular, an inference schema underlying an infinite assignment set is defined as the infinite sequence of inference methods such that is an inference method for .
When important, we will use the subindex in to denote the probability when the problem instance is . We will use to denote the corresponding expectation operator. With this notation, the expected loss suffered by an inference schema on the first tasks of an instance is
The optimal inference schema for an assignment set given the knowledge of the skill vector is denoted by . The next section gives a simple explicit form for this optimal schema. The average regret of an inference schema for an instance is its excess loss on the instance as compared to the loss of the optimal schema:
If the average regret converges to zero, then the loss suffered by asymptotically converges to the loss of the optimal inference. Based on this, we define asymptotic consistency and learnability: [Consistency and Learnability] An inference schema is said to be (asymptotically) consistent for an instance set if, for any , . An instance set is (asymptotically) learnable if there is a consistent inference schema for it.
3.1 Two-Step Plug-in Approach
In this work we will pursue a two-step approach based on first estimating the skill vector
and then utilizing a plug-in classifier to predict the ground-truth labels. The motivation for a two-step approach stems from existing results that characterize accuracy in terms of skill estimation errors. For the sake of exposition, we recall some of these results now.
It has been shown in Li and Yu (2014)
, the optimal classifier is log-odds weighted majority voting given by the MAP rule. Suppose the prior distribution of true labels is the uniform distribution over; then the Bayes classifier is well-known to be the optimal classifier (Duda et al., 2001),i.e.,
where the third equation follows the assumption that , and the fourth equation, after some algebra, is compact write to write the MAP estimator. Notice that is a function of only one parameter, namely the skill vector .
Regarding the loss of the optimal schema, we start with introducing result when skills are known in advance. In this case, Berend and Kontorovich (2014) provides an upper error bound, as well as an asymptotically matching lower bound, which are stated as follows: For any task , the optimal decision rule satisfies
where is called committee potential.
However, we do not assume that we know the skills of workers in reality, and thus the true optimal inference classifier is unknown to us. One natural way is to construct a true label inference that approximates the optimal Bayes classifier via estimating workers’ skills. Fortunately, in addition to the case of known skills, Szepesvári (2015); Berend and Kontorovich (2014) also provide an error bound when skills are only estimated: For any , the loss with estimated weights satisfies
In turn, the error can be bounded in terms of the multiplicative norm-differences in the skill estimates (see Berend and Kontorovich (2014)): Suppose then .
These results together imply that a plug-in estimator with a guaranteed accuracy on the skill levels, in turn, leads to a bound on the error probability of predicting ground-truth labels. This motivates the skill estimation problem, which we consider in the remainder of this paper.
4 Weighted Least-Squares Estimation
In this section, we propose an asymptotically consistent skill estimator for potentially sparse worker-task assignments. We are motivated by the scenario when, for most workers, only a very small portion of tasks are assigned to them. This induces not only an extremely sparse worker-task assignment graph, but more importantly a sparse worker-worker interaction graph.
Recall that given a problem instance , the data of the learner is given by the matrix which is a collection of independent binary random variables such that and . When is finite, we define to be the matrix whose th entry with gives the number of times the workers and labeled the same task:
and we also let . Note that there is an edge between workers and in the interaction graph exactly when .
When is infinite, may be infinite. In this case, for we also define to denote the number of times workers and provide a label for the same task in the first tasks, and similarly we let for all .
The starting point of our approach is the following observation about the single coin model: the expected correlation between each pair of workers is ground-truth independent. Indeed,
where the second equation used that .
This observation motivates estimating the skills using
Note that the number of terms containing the skill estimate of particular worker in this objective scales with how many other workers this worker works with. Intuitively, this should feel “right”: the more a worker works with others, the more information we should have about its skill level.
As it turns out, there is an alternative form for this objective, which is also very instrumental and which will form the basis of our algorithm and also of our analysis. To introduce this form, define and let its empirical estimation be
An alternative form of the objective in Eq. (2) is given by the following result: Let be defined by
The optimization problem of Eq. (2) is equivalent to the optimization problem
. The proof, which is just simple algebra to show the two objective functions are equal up to a constant shift, is given in Appendix A.
The objective function from Section 4 can be seen as a weighted low-rank objective, first proposed by Gabriel and Zamir (1979). Clearly, the objective prescribes to approximate using , with the error in the th entry scaled by
. Note that this weighting is reasonable as the variance ofis proportional to
and we expect from the theory of least-squares that an objective combining multiple terms where the data is heteroscedastic (has unequal variance), the terms should be weighted with the inverse of the data variances. Since, the weighting function can in general be full-rank, and in this case the general weighted rank-one optimization approximation known to be NP-hard (Gillis and Glineur, 2011).
However, our data has special structure, which will allow one to avoid the existing hardness results. Indeed, on the one hand, as the number of data points increases, will be near rank-one itself; and, on the other hand, we will put natural restrictions on the weighting matrix which are in fact necessary for identifiability. These conditions will allow us to avoid the NP-hardness results of Gillis and Glineur (2011).
4.1 Plug-in Projected Gradient Descent
To solve the weighted least-squares objective, the simplest algorithm is the gradient descent algorithm. In addition, we must ensure our estimated skills are always in the feasible set, i.e., . To address this, we propose a Projected Gradient Descent (PGD) algorithm (cf. Algorithm 1) that sequentially updates the skill level based on following the (negative) gradient of the loss at each time step:
where is a projection function (i.e., truncates its input so that it belongs to the interval it is projecting to), is the step size; is the number of tasks labeled by worker and is a tuning parameter.
The purpose of the projection is to stay away from the boundary of the hypercube, where the log-odds function is changing very rapidly. The justification is that skills close to one have an overwhelming impact on the plug-in rule of Eq. (1). According to Hoeffding inequality, the skill estimates are expected to have an uncertainty proportional to with probability . There is little loss in accuracy in confining the parameter estimates to the appropriately reduced hypercube, while in principle one could tune this parameter, we use in this paper.
As noted earlier, we will be making the assumption that the unknown skill vector satisfies . Thus, line 11 of the algorithm reverses the sign of the skill vector estimate found if necessary to ensure that the estimate also has the property that the total skill level is positive. Note that our theoretical results for this method will be proved under the weaker assumption that .
4.2 An Extension to Multi-class Classification
We now briefly describe how our approach may be extended to the case when the labels are not binary. Above, we have shown how the binary case may be reduced to a noisy rank-one matrix completion problem as in Lemma 4. Here we show how the same approach can be used for the multiclass case.
As before, we suppose that workers are asked to provide labels to a series of -class classification tasks whose ground truths
are unknown. We will use a one-hot encoding of the ground truths, i.e.,will be expressed as .
We will associate a skill level with every worker using a homogeneous Dawid-Skene model, where each worker is assumed to have the same accuracy and error probabilities on each class. Formally, worker provides label with probability
Similar to binary tasks, Li and Yu (2014) showed that the optimal prediction method under homogeneous Dawid-Skene model is weighted majority voting. More specifically, when are known, the oracle MAP rule is
where . The proof of this can be obtained by following the same line as in Section 3.1.
In order to construct the weighted majority voting model, we extend PGD algorithm to handle multi-class tasks by showing the skill estimation problem is still a rank one matrix completion problem as follows. Let us define skill levels
and noisy covariances
Since the random vectors and are independent, we can write the expectation of the inner product of and as
This follows because the inner product is one only if , for some label , and the probability of this is either or depending on whether or .
A simple algebraic manipulation gives the following
We thus have
As a consequence of this lemma, if we define
then can be estimated by solving a rank one matrix completion problem with objective function
As previously, in the limit as , the rank-one problem is an exact match for the problem of recovering skills. In the case where is finite, we will be in the “noisy” regime where can be thought of as a noise-corrupted version of the true rank-one matrix , with the amount of noise ill decaying to zero as .
5 Theoretical Results
Up to now, we have shown how label inference can be reduced to the problem of skill estimation, and addressed the skill estimation problem as a sparse rank-one matrix factorization problem (with noise). In this section, we analyze which properties of the interaction graph ensure learnability as the number of tasks approaches infinity. Subsequently, we analyze the convergence properties of the PGD algorithm for finite tasks.
We start with the analysis for the infinite instance where (see Eq. (3) for a definition). There are different ways to let the number of tasks approach infinity while keeping an interaction graph fixed.
Case A: For a fixed interaction graph we can consider assignment sets such that the minimum number of shared tasks, approaches infinity. Learnability in this context is a property of the interaction graph.
Case B: We can also consider an infinite assignment set and define as the graph where two workers are connected by an edge if . In other words, we define connectivity based on whether two workers interact finitely or infinitely many times.
We will follow the second approach as it is slightly more general than the first (the second approach allows assignment sets where some workers interact only finitely many times, while the first approach does not allow such assignment sets). Thus, we fix an assignment set , we let be the set of instances sharing assignment set , and we will consider the learnability of subsets .
To express complete ignorance towards the true unknown labels assigned to tasks, we will consider which are truth-complete: informally, this means that places no constraints on what the ground truth could be. Formally, truth completeness means that, for any , we require where . Truth-completeness expresses that there is no prior information about the unknown labels.
As discussed before, the inference problem is inherently symmetric: the likelihood assigned to some observed data under an instance is the same as under the instance . Thus, an instance set cannot be learnable unless somehow these symmetric solutions are ruled out.
To express the condition this forces us to adopt will require a few more definitions. In particular, given we let be the set of skill vectors that are present in at least one instance in . For a skill vector we let be the set of workers whose skills are positive and we let be the (incomplete) partitioning of workers into workers with positive and negative skills; note that workers with zero skill are left out.
With these definitions in place, we will say that is rich if there exists and such that (in other words, there must exist and such that we can scale each component of by either or and remain in ). This is a fairly mild condition; it is satisfied if, for instance, there is some point in such that a small open-set around that point that is fully contained in .
Richness is required so that there is sufficient ambiguity about skills. Indeed, if richness is not satisfied, then either every skill vector in is a spammer or hammer (i.e., for all ) or has a specific structure. This structure could potentially be exploited by an algorithm. One might say that assuming richness requires an algorithm to be agnostic to any specific structural knowledge of skill vectors.
We are now ready to state our first main result, which characterizes when rich, truth-complete sets are learnable.
[Characterization of learnability] Fix an infinite assignment set and assume that is connected. Then, a rich, truth-complete set of instances over is learnable if and only if the following hold:
For any such that and , it follows that ;
The graph is non-bipartite, i.e, it has an odd-cycle.
Condition (i) requires that any should be uniquely identified by and knowing which components of have the same sign and which components are zero. For example, this condition will be met if is restricted so that it only contains skill vectors that have a positive sum. For an explanation of why such an assumption is needed, see the (boldfaced) remark in Section 3.
We remark that if the graph is not connected, we can simply apply this theorem to each of its connected components. For example, in the situation where none of the workers have shared a task with any of the workers , one could try to simply recover the skills of workers from their common tasks and then the skills of from their common tasks. This allows us to drop the condition in the theorem that be connected, at the expense of changing (ii) to the assertion that none of the connected components of should be bipartite.
The forward direction of the theorem statement hinges upon the following result which is proved in Appendix: For any , and an assignment set with a connected, non-bipartite interaction graph , there exists a method to recover and . The reverse implication in the theorem statement follows from the following result: Assume that the lengths of all cycles in are even. Then there exists , such that . Learnability for Finite Tasks: We mention in passing that asymptotic learnability is a fundamental requirement, which if not met precludes any reasonable finite time result. In consequence there is no inference schema achieves zero regret in this case.
5.2 Convergence of the PGD Algorithm
The previous section established that for learnability the limiting interaction graph must be a non-bipartite connected graph. We will now show that PGD under these assumptions converges to a unique minimum for both the noisy and noiseless cases.
By the noiseless case, we mean that in the loss of Section 4, we set for . That is, we have infinite number of common tasks to estimate which will then equal the expected value . However, in reality, we always suffer from the estimation error (i.e., ) which leads to a more troublesome problem than a rank one matrix completion. We also provide an analysis of our PGD algorithm in this "noisy" case.
Note that the (non-bipartite) odd-cycle condition together with that is connected gives that the worker-interaction count matrix is irreducible and aperiodic. The signless Laplacian matrix is then defined as
By contrast, we will use to denote the matrix whose ’th entry is ; the matrix will thus have zero diagonal.
The matrix contrasts with the usual Laplacian because the off-diagonal elements have positive signs. It can be shown that if the graph is not bipartite, the matrix is positive definite (Desai and Rao, 1994a). In fact, the following stronger assertion is true. We will use
to denote the smallest eigenvalue of the signless Laplacian matrix of a non-bipartite graph with unit weights; we remark that it as consequence of the results ofDesai and Rao (1994a) that (where, recall, is the number of workers, so that the matrices and are ). Finally, we let be the smallest positive weight among .
We now show that under the condition is non-bipartite the loss has a unique minima and the PGD algorithm recovers the skill vector. For technical convenience, our theorem below considers recovering the absolute values of skills .
Post-Processing for Sign Recovery: Note that recovering absolute values leads to recovering the skill values with their sign pattern. This is because one of our assumptions is that . Without loss of generality, if we were to arbitrarily assign positive skill () to the first worker, we could then identify all the workers that agree with the first worker in terms of the skill pattern. Once we identify the partitions into the two groups, namely, workers with same and opposite signs as the first worker, the real signs can then be assigned based on which group is in the majority.
The PGD Algorithm of Sec 1 for , when initialized in the positive orthant, with the additional post-processing step described above converges in polynomial time to the global minima under the condition (ii) of Theorem 5.1 in the noiseless case and the skill vectors can be uniquely recovered under conditions (i) and (ii) of Theorem 5.1.
As above, if the underlying graph is not connected, we can, as remarked earlier, simply apply this theorem to each of the connected components. The key requirement of Theorem 5.1 – that the underlying graph is not bipartite – then becomes the requirement that none of the connected components of are bipartite.
Theorem 5.2 follows from the following lemma.
Suppose is a non-bipartrite connected graph and the real skill levels are in the cube where . Consider PGD algorithm with the step-size
and initial condition . Then, after iterations (where the -notation hides factors depending on ), we have that .
Lemma 5.2 shows that our PGD algorithm converges in polynomial time (recall that from Desai and Rao (1994a) as long as the worker interaction graph is connected and non-bipartite). While we expect that our bound is quite loose, and better scalings with the graph and with the number of workers are possible, in this work we are happy to obtain a polynomial-time convergence rate for this problem.
Note that our approach requires all skills to be bounded away from zero: this is part of the assumption that belongs to the cube with being a strictly positive number. This assumption is necessary, and seems to be a limitation of using (plain) gradient descent to approach this problem: , so that is a fixed point of gradient descent regardless of the data. As a consequence, if the algorithm starts off at a point that is very close to the origin, it will take a long time to “escape”; this is why our bounds depend inversely on .
We now turn to the proof of lemma 5.2, which relies on a series of propositions. The proofs of the propositions are postponed to appendix. The first of these identifies a natural Lyapunov function for the gradient descent dynamics of PGD algorithm.
and suppose that is positive and is positive and that the positive step-size is small enough so that
Then, it holds that .
The above proposition implies that is a stable point and any sequence that start in a region stays in the same region for all . The following corollary provides the value of and uses this proposition to conclude that the gradient descent dynamics of PGD algorithm stay bounded and is bounded away from the origin.
Suppose and belong to where . If the step-size of PGD algorithm satisfies
then and for all .
Clearly, given the fact that both and belong to . Then, satisfies the step-size condition of Proposition 5.2 at time . Using Proposition 5.2, we can conclude that which implies as well. By applying the same technique iteratively, we have . Since , this implies .
We next turn to deriving some basic properties of the loss function under the noiseless assumption that. Recall that in Lemma 4, we defined
In the noiseless case, we have
Although is not a convex function, it satisfies several inequalities over bounded regions similar to those satisfied by convex functions. The next two propositions obtain upper bounds on how and grow as a function of the distance between two points.
Suppose and . Then, we have that
For any , we have that
Finally, the last proposition shows the gradient of at point is lower bounded by as follows.
Suppose . If is the skill vector sequence obtained by PGD algorithm, and the step size satisfies
then, for all , the gradient of the loss function at point is bounded as
where and is defined in Proposition 5.2.
Now we use these propositions to prove Lemma 5.2.
[Proof of Lemma 5.2] Now observe that the step-size satisfies , where is from Proposition 5.2; and, moreover, it is less than the bound of Corollary 5.2 (this follows because the infinity norm of a matrix is at most than times its Frobenius norm). It thus follows from Corollary 5.2 and the descent lemma (see Lemma 4.24 in Beck (2014))
Writing this out in terms of all the variables, we have that
Using the standard inequality , this implies that every iterations (where the -notation supresses dependence on ), shrinks by a factor of . Since by Proposition 5.2, is at most polynomial in , after times iterations, we have that .
Noisy Observations: We next consider the noisy case, namely, . In particular, let be a matrix with . Alternatively, we may think of our loss function as
The next theorem, whose proof is in the appendix, bounds the distance between any critical point of this function in and . Suppose the worker-interaction matrix satisfies the assumptions in Theorem 2. First, for small enough , will have a critical point in . Second, if is a critical point of in , then
Note that term in numerator and (the smallest eigenvalue of ) in the denominator both scale linearly with the number of tasks. Thus, the theorem states that for small enough perturbations, the error of stationary solutions scales proportionally to , with the proportional constant governed by the squared inverse minimum skill level and the minimum eigenvalue of , which is known to characterize how far the weighted graph with weights is from being “bipartite” (Desai and Rao, 1994b).
Finite-Task Bound: Note that we can directly apply this result to obtain a finite task characterization as well. In particular consider a connected and non-bipartite interaction graph. Define as the maximum degree and as the sum of the degrees. It follows by standard Hoeffding bounds that with probability greater than we have