There has been a lot of work in machine learning concerning learning multiple tasks simultaneously, ranging from multi-task learning[3, 4], to domain adaptation [10, 11], to distributed learning [2, 7, 14]. Another area in similar spirit to this work is meta-learning, where one leverages samples from many different tasks to train a single algorithm that adapts well to all tasks (see e.g. ).
In this work, we focus on a model of collaborative PAC learning, proposed by . In the classic PAC learning setting introduced by , where PAC stands for probably approximately correct, the goal is to learn a task by drawing from a distribution of samples. The optimal classifier that achieves the lowest error on the task with respect to the given distribution is assumed to come from a concept class of VC dimension . The VC theorem  states that for any instance
labeled samples suffice to learn a classifier that achieves low error with probability at least, where the error depends on .
In the collaborative model, there are players attempting to learn their own tasks, each task involving a different distribution of samples. The goal is to learn a single classifier that also performs well on all the tasks. One example from , which motivates this problem, is having hospitals with different patient demographics which want to predict the overall occurrence of a disease. In this case, it would be more fitting as well as cost efficient to develop and distribute a single classifier to all the hospitals. In addition, the requirement for a single classifier is imperative in settings where there are fairness concerns. For example, consider the case that the goal is to find a classifier that predicts loan defaults for a bank by gathering information from bank stores located in neighborhoods with diverse socioeconomic characteristics. In this setting, the samples provided by each bank store come from different distributions while it is desired to guarantee low error rates for all the neighborhoods. Again, in this setting, the bank should employ a single classifier among all the neighborhoods.
If each player were to learn a classifier for their task without collaboration, they would each have to draw a sufficient number of samples from their distribution to train their classifier. Therefore, solving tasks independently would require samples in the worst case. Thus, we are interested in algorithms that utilize samples from all players and solve all tasks with sample complexity .
Blum et al.  give an algorithm with sample complexity for the realizable setting, that is, assuming the existence of a single classifier with zero error on all the tasks. They also extend this result by proving that a slightly modified algorithm returns a classifier with error , under the relaxed assumption that there exists a classifier with error on all the tasks. In addition, they prove a lower bound showing that there is a concept class with where samples are necessary.
In this work, we give two new algorithms based on multiplicative weight updates which have sample complexities and for the realizable setting. Our first algorithm matches the sample complexity of  for the variant of the problem in which the algorithm is allowed to return different classifiers to the players and our second algorithm has the sample complexity almost matching the lower bound of  when and for typical values of . Both are presented in Section 3. Independently of our work,  use the multiplicative weight update approach and achieve the same bounds as we do in that section.
Moreover, in Section 4, we extend our results to the non-realizable setting, presenting two algorithms that generalize the algorithms for the realizable setting. These algorithms learn a classifier with error at most on all the tasks, where is set to a constant value, and have sample complexities and . With constant , these sample complexities are the same as in the realizable case. Finally, we give two algorithms with randomized classifiers whose error probability over the random choice of the example and the classifier’s randomness is at most for all tasks. The sample complexities of these algorithms are and .
In the traditional PAC learning model, there is a space of instances and a set of possible labels for the elements of . A classifier , which matches each element of to a label, is called a hypothesis. The error of a hypothesis with respect to a distribution on is defined as . Let , where is a class of hypotheses. In the realizable setting we assume that there exists a target classifier with zero error, that is, there exists with for all . Given parameters , the goal is to learn a classifier that has error at most , with probability at least . In the non-realizable setting, the optimal classifier is defined to have for any . Given parameters and a new parameter , which can be considered to be a constant, the goal is to learn a classifier that has error at most , with probability at least .
By the VC theorem and its known extension, the desired guarantee can be achieved in both settings by drawing a set of samples of size and returning the classifier with minimum error on that sample. More precisely, in the non-realizable setting, , where is also a constant. We consider an algorithm , where is a set of samples drawn from an arbitrary distribution over the domain , that returns a hypothesis whose error on the sample set satisfies for any , if such a hypothesis exists. The VC theorem guarantees that if , then .
In the collaborative model, there are players with distributions . Similarly, and the goal is to learn a single good classifier for all distributions. In , the authors consider two variants of the model for the realizable setting, the personalized and the centralized. In the former the algorithm can return a different classifier to each player, while in the latter it must return a single good classifier. For the personalized variant, Blum et al. give an algorithm with almost the same sample complexity as the lower bound they provide. We focus on the more restrictive centralized variant of the model, for which the algorithm that Blum et al. give does not match the lower bound. We note that the algorithms we present are improper, meaning that the classifier they return is not necessarily in the concept class .
3 Sample complexity upper bounds for the realizable setting
In this section, we present two algorithms and prove their sample complexity.
Both algorithms employ multiplicative weight updates, meaning that in each round they find a classifier with low error on the weighted mixture of the distributions and double the weights of the players for whom the classifier did not perform well. In this way, the next sample set drawn will include more samples from these players’ distributions so that the next classifier will perform better on them. To identify the players for whom the classifier of the round did not perform well, the algorithms test the classifier on a small number of samples drawn from each player’s distribution. If the error of the classifier on the sample is low, then the error on the player’s distribution can not be too high and vise versa. In the end, both algorithms return the majority function over all the classifiers of the rounds, that is, for each point , the label assigned to is the label that the majority of the classifiers assign to .
We note that for typical values of , Algorithm R2 is better than Algorithm R1. However, Algorithm R1 is always better than the algorithm of  for the centralized variant of the problem and matches their number of samples in the personalized variant, so we present both algorithms in this section. In the algorithms of , the players are divided into classes based on the number of rounds for which that player’s task is not solved with low error. The number of classes could be as large as the number of rounds, which is , and their algorithm uses roughly samples from each class. On the other hand, Algorithm R1 uses only samples across all classes and saves a factor of in the sample complexity. This requires analyzing the change in all classes together as opposed to class by class.
Algorithm R1 runs for rounds and learns a classifier in each round that has low error on the weighted mixture of the distributions . For each player at least of the learned classifiers are “good”, meaning that they have error at most on the player’s distribution. Since the algorithm returns the majority of the classifiers, in order for an instance to be mislabeled, at least of the total number of classifiers should mislabel it. This implies that at least of the “good” classifiers of that player should mislabel it, which amounts to of the “good” classifiers. Therefore, the error of the majority of the functions for that player is at most .
To identify the players for whom the classifier of the round does not perform well, Algorithm R1 uses a procedure called Test. This procedure draws samples from each player’s distribution and tests the classifier on these samples. If the error for a player’s sample set is at most then Test concludes that the classifier is good for that player and adds them to the returned set . The samples that the Test requires from each player suffice to make it capable of distinguishing between the players with error more than and players with error at most with respect to their distributions, with high probability.
For any , and hypothesis class of VC dimension , Algorithm R1 returns a classifier with with probability at least using samples, where
To prove the correctness and sample complexity of Algorithm R1, we need to prove Lemma 1.2, which describes the set that the Test returns. This proof uses the following multiplicative forms of the Chernoff bounds (proved as in Theorems 4.4 and 4.5 of ).
Lemma 1.1 (Chernoff Bounds).
If is the average of independent random variables taking values in
independent random variables taking values in, then
where the latter inequality holds for and the first two hold for .
is such that the following two properties hold, each with probability at least , for all and for a given round .
If , then .
If , then .
Proof of Lemma 1.2.
For this proof we assume that the number of samples for each must be at least . For a given round :
Assume for some . Then
Hence, by union bound, holds for all with probability at least .
Assume for some . We consider two cases and we apply the Chernoff bounds with . Note that if then and the property holds. So we only need to consider . First, we need to prove that
which is true.
If , which implies , then
If , which implies , then:
Hence, by union bound, holds for all with probability at least .
Proof of Theorem 1.
First, we prove that Algorithm R1 indeed learns a good classifier, meaning that for every player the returned classifier has error with probability at least .
Let denote the number of rounds, up until and including round , that did not pass the Test. More formally, .
With probability at least , .
From Lemma 1.2() and union bound, with probability at least , the number of functions that have error more than on is the same as the number of rounds that did not pass the Test, for all . So, if the claim holds, with probability at least , less than functions have error more than on , for all . Equivalently, with probability at least , more than functions have error at most on , for all . As a result, with probability at least , the error of the majority of the functions is for all .
Let us now prove the claim.
Proof of Claim 1.1.
Recall that is the potential function in round . By linearity of expectation, the following holds for the error on the mixture of distributions:
From the VC theorem, it holds that, since and , with probability at least , . From Lemma 1.2(), with probability at least , for all . So with probability at least the two hold simultaneously. Combining these inequalities with (4), we get that with probability at least ,
Since only the weights of players are doubled, it holds that for a given round
Therefore with probability at least , the inequality holds for all rounds, by union bound. By induction:
Also, for every it holds that , as each weight is only doubled every time does not pass the Test. Since the potential function is the sum of all weights, the following inequality is true.
So with probability at least , . ∎
As for the total number of samples, it is the sum of Test’s samples and the samples for each round. Since Test is called times and each time requests samples from each of the players, the total number of samples that it requests is . Substituting and , this yields
samples in total.
In addition, the sum of the samples drawn in each round to learn the classifier for the mixture for rounds is . Again, substituting and , we get:
samples in total.
Hence, the overall bound is:
Algorithm R1 is the natural boosting alternative to the algorithm of  for the centralized variant of the model. Although it is discussed in  and mentioned to have the same sample complexity as their algorithm, it turns out that it is more efficient. Its sample complexity is slightly better (or the same, depending on the parameter regime) compared to the one of the algorithm for the personalized setting presented in , which is .
However, in the setting of the lower bound in  where , there is a gap of multiplicatively between the sample complexity of Algorithm R1 and the lower bound. This difference stems from the fact that in every round, the algorithm uses roughly samples to find a classifier but roughly samples to test the classifier for tasks. Motivated by this discrepancy, we develop Algorithm R2, which is similar to Algorithm R1 but uses fewer samples to test the performance of each classifier on the players’ distributions. To achieve high success probability, Algorithm R2 uses a higher number of rounds.