Weakly Supervised End-to-End Learning
Aggregating multiple sources of weak supervision (WS) can ease the data-labeling bottleneck prevalent in many machine learning applications, by replacing the tedious manual collection of ground truth labels. Current state of the art approaches that do not use any labeled training data, however, require two separate modeling steps: Learning a probabilistic latent variable model based on the WS sources – making assumptions that rarely hold in practice – followed by downstream model training. Importantly, the first step of modeling does not consider the performance of the downstream model. To address these caveats we propose an end-to-end approach for directly learning the downstream model by maximizing its agreement with probabilistic labels generated by reparameterizing previous probabilistic posteriors with a neural network. Our results show improved performance over prior work in terms of end model performance on downstream test sets, as well as in terms of improved robustness to dependencies among weak supervision sources.READ FULL TEXT VIEW PDF
Weak supervision is a popular method for building machine learning model...
Clustering and prediction are two primary tasks in the fields of unsuper...
A challenge in training discriminative models like neural networks is
In this paper, we propose a weak supervision framework for neural
Labeling training data is a key bottleneck in the modern machine learnin...
Recent Weak Supervision (WS) approaches have had widespread success in
Labeling training data is increasingly the largest bottleneck in deployi...
Weakly Supervised End-to-End Learning
The success of supervised machine learning methods relies on the availability of large amounts of labeled data. The common process of manual data annotation by humans, especially when domain experts need to be involved, is expensive, both in terms of effort and cost, and as such presents a major bottleneck for deploying supervised learning methods to new domains and applications.
Recently, data programming, a paradigm that makes use of multiple sources of noisy labels, has emerged as a promising alternative to manual data annotation Ratner et al. (2016). It encompasses previous paradigms such as distant supervision from external knowledge bases Mintz et al. (2009); Riedel et al. (2010), crowdsourcing Dawid and Skene (1979); Karger et al. (2011); Dalvi et al. (2013); Zhang et al. (2016)
, and general heuristic and rule-based labeling of dataGupta and Manning (2014); Goh et al. (2018)
. In the data programming framework, users encode domain knowledge into so called labeling functions (LFs), which are functions (e.g. domain heuristics or knowledge base derived rules) that noisily label subsets of data. The main task for learning from multiple sources of weak supervision is then to recover the sources’ accuracies in order to estimate the latent true label, without access to ground truth data. In previous workRatner et al. (2016, 2019b); Fu et al. (2020), this is achieved by first learning a generative probabilistic graphical model (PGM) over the weak supervision sources and the latent true label to estimate probabilistic labels, which are then used in the second step to train a downstream model
via a noise-aware loss function.
Data programming has led to a wide variety of success stories in domains such as healthcare Fries et al. (2019); Dunnmon et al. (2020) and e-commerce Bach et al. (2019), but the existing PGM based frameworks still come with a number of drawbacks. The separate PGM does not take the predictions of the downstream model into account, and indeed this model is trained independently of the PGM. In addition, current approaches for estimating the unknown class label via a PGM need to rely on computationally expensive approximate sampling methods Ratner et al. (2016), estimation of the full inverse of the LFs covariance matrix Ratner et al. (2019b), or they need to make strong independence assumptions Fu et al. (2020). Furthermore, existing prior work and the associated theoretical analyses make assumptions that may not hold in practice Ratner et al. (2016, 2019b); Fu et al. (2020)
, such as availability of a well-specified generative model structure (i.e. that the dependencies and correlations between the weak sources have been correctly specified by the user), that LF errors are randomly distributed across samples, and that the latent label is independent of the features given the weak labels (i.e. only the joint distribution between the sources and labels needs to be modeled).
We introduce WeaSEL, our Weakly Supervised End-to-end Learner model for training neural networks with, exclusively, multiple sources of weak supervision as noisy signals for the latent labels. WeaSEL is based on 1) reparameterizing previous PGM based posteriors with a neural encoder network that produces attention scores for each weak supervision source; and 2) training the encoder and downstream model on the same target loss, using the other model’s predictions as constant targets, to maximize the agreement between both models. The proposed method needs no labeled training data, and neither assumes sample-independent source accuracies nor redundant features for latent label modeling. We show empirically that it is not susceptible to highly correlated LFs. In addition, the proposed approach can learn from multiple probabilistic sources of weak supervision.
Our contributions include:
We introduce a flexible, end-to-end method for learning models from multiple sources of weak supervision.
We empirically demonstrate that the method is naturally robust to adversarial sources as well as highly correlated weak supervision sources.
We show that our method outperforms, by as much as 6.1 F1 points, state-of-the-art latent label modeling approaches on 4 out of 5 relevant benchmark datasets, and achieves state-of-the-art performance on a crowdsourcing dataset against methods specifically designed for this setting.
Multi-source Weak Supervision The data programming paradigm Ratner et al. (2016) allows users to programmatically label data through multiple noisy sources of labels, by treating the true label as a latent variable of a generative PGM. Several approaches for learning the parameters of the generative model have been introduced Ratner et al. (2019b); Fu et al. (2020); Chen et al. (2021) to address computational complexity issues. Existing methods are susceptible to misspecification of the dependencies and correlations between the LFs, which can lead to substantial losses in performance Cachay et al. (2021). Indeed, it is common practice to assume a conditionally independent model – without any dependencies between the sources – in popular libraries Bach et al. (2019); Ratner et al. (2019a) and related research Dawid and Skene (1979); Anandkumar et al. (2014); Varma and Ré (2018); Boecking et al. (2021), even though methods to learn the intra-LF structure have been proposed Bach et al. (2017); Varma et al. (2019, 2017). As in the approach proposed in this paper, the aforementioned methods do not assume any labeled training data, i.e. the downstream model is learned based solely on outputs of multiple LFs on unlabeled data. The traditional co-training paradigm Blum and Mitchell (1998) on the other hand is similar in spirit but requires some labeled data to be available. Recent methods that study the co-training setup where labeled training data supplements multiple WS sources, include Awasthi et al. (2020); Karamanolakis et al. (2021). Note that the experiments in Awasthi et al. (2020); Karamanolakis et al. (2021)
rely on large pre-trained language models, making the applicability of the approach without such models or to non-text domains unclear.
Crowdsourcing Aggregating multiple noisy labels is also a core problem studied in the crowdsourcing literature. Common approaches model worker performance and the unknown label jointly Dawid and Skene (1979); Dalvi et al. (2013); Zhang et al. (2016)
using expectation maximization (EM) or similar approaches. Some core differences to learning from weak supervision sources are that errors by crowdworkers are usually assumed to be random, and that task assignment is not always fixed but can be optimized for. Multiple end-to-end methods have been proposed in the crowdsourcing problem settingRaykar et al. (2010); Guan et al. (2018); Welinder et al. (2010); Khetan et al. (2018); Rodrigues and Pereira (2018); Cao et al. (2019), often focusing on image labeling and EM-like algorithms for modeling the workers. Importantly, our proposed approach can be used in general applications with weak supervision from multiple sources without any restrictive assumptions specific to crowdsourcing, and we show that our approach outperforms the aforementioned methods on a crowdsourcing benchmark task.
In this section we present our flexible base algorithm that we call WeaSEL, which can be extended to probabilistic sources and other network architectures (Section 7). See Algorithm 1 for its pseudocode.
Let be the data generating distribution, where the unknown labels belong to one of classes: . As in Ratner et al. (2016), users provide an unlabeled training set , and labeling functions , where means that the LF abstained from labeling for any class. We write for the one-hot representation of the LF votes provided by the LFs for classes. Our goal is to train a downstream model on a noise-aware loss that operates on the model’s predictions and probabilistic labels generated by an encoder model that has access to LF votes and features. Note that prior work restricts the probabilistic labels to only being estimated from the LFs .
Previous PGM based approaches assume that the joint distribution of the LFs and the latent true label can be modeled as a Markov Random Field (MRF) with pairwise dependencies between weak supervision sources Ratner et al. (2016, 2019a, 2019b); Fu et al. (2020); Chen et al. (2021). These models are parameterized by a set of LF accuracy and intra-LF correlation parameters and in some cases by additional parameters to model LF and class label propensity. Note however, that the aforementioned models ignore features when modeling the latent labels and therefore disregard that LFs may differ in their accuracy across samples and data slices.
We relax these assumptions, and instead view the latent label as an aggregation of the LF votes that is a function of the entire set of LF votes and features, on a sample-by-sample basis
. That is, we model the probability of a particular samplehaving the class label as
where weighs the LF votes on a sample-by-sample basis and the softmax for class on is defined as
While we do not use the class balance in our experiments for our own model, WeaSEL, it is frequently assumed to be known Ratner et al. (2019b); Fu et al. (2020); Chen et al. (2021), and can be estimated from a small validation set, or from unlabeled data as described in Ratner et al. (2019b). Our formulation can be seen as a reparameterization of the posterior of the pairwise MRFs in Ratner et al. (2019a, b); Fu et al. (2020), where corresponds to the LF accuracies that are fixed across the dataset and are solely learned via LF agreement and disagreement signals, ignoring the informative features. We further motivate this formulation and expand upon this connection in the appendix A.
Based on the setup introduced in the previous section and captured in Eq. (1), our goal is to estimate latent labels by means of learning sample-dependent accuracy scores , which we propose to parameterize by a neural encoder . This network takes as input the features and the corresponding LF outputs for a data point, and outputs unnormalized scores . Specifically, we define
where is a constant factor that scales the final softmax transformation in relation to the number of LFs , and is equivalent to an inverse temperature for the output softmax in Eq. 1. It is motivated by the fact that most LFs are sparse in practice, and especially when the number of LFs is large this leads to small accuracy magnitudes without scaling (since, without scaling, the accuracies after the softmax sum up to one)222In our main experiments we set ..
is an inverse temperature hyperparameter that controls the smoothness of the predicted accuracy scores: The loweris, the less emphasis is given to a small number of LFs – as , the model aggregates according to the equal weighted vote. The transformation naturally encodes our understanding of wanting to aggregate the weak sources to generate the latent label.
The key question now is how to train , i.e. how can we learn an accurate mapping of the sample-by-sample accuracies, given that we do not observe any labels?
We hypothesize that in most practical cases, features, latent label, and labeling function aggregations are intrinsically correlated due to the design decisions made by the users defining the features and LFs. Thus, we can jointly optimize and by maximizing their agreement with respect to the target downstream loss in an end-to-end manner.
See Algorithm 1 for pseudocode of the resulting WeaSEL algorithm. The natural classification loss is the cross-entropy, which we use in our experiments, but in order to encode our desire of maximizing the agreement of the two separate models that predict based on different views of the data, we adapt it333This holds for any asymmetric loss, while for symmetric losses this is not needed. in the following form: The loss is symmetrized in order to compute the gradient of both models using the other model’s predictions as targets. To that end, it is crucial to use the stop-grad operation on the targets (the second argument of ), i.e. to treat them as though they were ground truth labels. This choice is supported by our synthetic experiment and ablations. This operation has also been shown to be crucial in siamese, non-contrastive, self-supervised learning, both empirically Grill et al. (2020); Chen and He (2021) and theoretically Tian et al. (2021). By minimizing simultaneously, both, and to jointly learn the network parameters for and the downstream model respectively, we learn the accuracies of the noisy sources that best explain the patterns observed in the data, and vice versa the feature-based predictions that are best explained by aggregations of LF voting patterns.
Note that it is necessary to encode the inductive bias that the unobserved ground truth label is a (normalized) linear combination of LF votes – weighted by sample- and feature-dependent accuracy scores. Otherwise, if the encoder network directly predicts instead of the accuracies , the pair of networks have no incentive to output the desired latent label, without observed labels. We do acknowledge that this two-player cooperation game with strong inductive biases could still allow for degenerate solutions. However, we empirically show that our simple WeaSEL model that goes beyond multiple earlier WS assumptions is 1) competitive and frequently outperforms state-of-the-art PGM-based and crowdsourcing models (see Tables 1 and LABEL:tab:labelMe); and 2) is robust against massive LF correlations and able to recover the performance of a fully supervised model on a synthetic example, while all other models break in this setting (see section 4.3 and appendix F).
|Model||Spouses (9 LFs)||ProfTeacher (99 LFs)||IMDB (136 LFs)||IMDB (12 LFs)||Amazon (175 LFs)|
|Sup. (Val. set)|
|Snorkel||82.22 0.18||74.45 0.58|
|Majority vote||85.44 0.37||84.20 0.52|
|WeaSEL||51.98 1.60||86.98 0.45||82.10 0.45||77.22 1.02||86.60 0.71|
As in related work on label models for weak supervision Ratner et al. (2016, 2019b); Fu et al. (2020); Chen et al. (2021), we focus for simplicity on the binary classification case with unobserved ground truth labels .
See Table 3 for details about dataset sizes and the number of LFs used.
We also run an experiment on a multi-class, crowdsourcing dataset (see subsection 4.2).
We evaluate the proposed end-to-end system for learning a downstream model from multiple weak supervision sources on previously used benchmark datasets in weak supervision work Ratner et al. (2019a); Boecking et al. (2021); Chen et al. (2021). Specifically, we evaluate test set performance on the following classification datasets:
The IMDB movie review dataset Maas et al. (2011)
contains movie reviews to be classified into positive and negative sentiment. We run two separate experiments, where in one we use the same 12 labeling functions as inChen et al. (2021), and for the other we choose 136 text-pattern based LFs. More details on the LFs can be found in the appendix C.
To evaluate the proposed system, we benchmark it against state-of-the-art systems that aggregate multiple weak supervision sources for classification problems, without any labeled training data. We compare our proposed approach with the following systems: 1) Snorkel, a popular system proposed in Ratner et al. (2019a, b); 2) Triplet, exploits a closed-form solution for binary classification under certain assumptions Fu et al. (2020); and 3) Triplet-mean and Triplet-median Chen et al. (2021), which are follow-up methods based on Triplet with the aim of making the method more robust.
We report the held-out test set performance of WeaSEL’s downstream model . Note that in many settings it is often not possible to apply the encoder model to make predictions at test time, since the LFs usually do not cover all data points (e.g. in Spouses only 25.8% of training samples get at least one LF vote), and can be difficult to apply to new samples (e.g. when the LFs are crowdsourced annotations). In contrast, the downstream model is expected to generalize to arbitrary unseen data points.
We observe strong results for our model, with 4 out of 5 top scores, and a lift of 6.1 F1 points over the next best label model in the Amazon dataset. Our results are summarized in Table 1. Since our model is based on a neural network, we hypothesize that the large relative lift in performance on the Amazon review dataset is due to it being the largest dataset size on which we evaluate on – we expect this lift to hold or become larger as the training set size increases. To obtain the comparisons shown in Table 1, we run Snorkel over six different label model hyperparameter configurations, and report the best score based on the validation set AUC. We do not report Triplet-median in the main table, since it only converged for the two tasks with very small numbers of labeling functions. Interestingly, we observed that training the downstream model on the hard labels induced by majority vote leads to a competitive performance, better than triplet methods in four out of five datasets. This baseline is not reported in previous papers (only the raw majority vote is usually reported, without training a classifier). Our own model, WeaSEL, on the other hand consistently improves over the majority vote baseline (which in Table 4, in the appendix, can be seen to lead to similar performance as an untrained encoder network, , that is left at its random initialization).
|MBEM Khetan et al. (2018)|
|DoctorNet Guan et al. (2018)|
|CrowdLayer Rodrigues and Pereira (2018)|
|AggNet Albarqouni et al. (2016)|
|MaxMIG Cao et al. (2019)||85.45 1.0|
Data programming and crowdsourcing methods have been rarely compared against each other, even though the problem setup is quite similar. Indeed, end-to-end systems specifically for crowdsourcing have been proposed Raykar et al. (2010); Khetan et al. (2018); Rodrigues and Pereira (2018); Cao et al. (2019)
. These methods follow crowdsourcing-specific assumptions and modeling choices (e.g. independent crowdworkers, a confusion matrix model for each worker, and in general build uponDawid and Skene (1979)). Still, since crowdworkers can be seen as a specific type of labeling functions, the performance of general WS methods on crowdsourcing datasets is of interest, but has so far not been studied. We therefore choose to also evaluate our method on the multi-class LabelMe image classification dataset that was previously used in the core related crowdsourcing literature Rodrigues and Pereira (2018); Cao et al. (2019). The results are reported in Table LABEL:tab:labelMe, and more details on this experiment can be found in Appendix E. Note that the evaluation procedure in Cao et al. (2019) reports the best test set performance for all models, while we follow the more standard practice of reporting results obtained by tuning based on a small validation set – as in our main experiments. We find that our model, WeaSEL, is able to outperform Snorkel as well as multiple state-of-the-art methods that were specifically designed for crowdsourcing (including several end-to-end approaches). Interestingly, this is achieved by using the mutual information gain loss (MIG) function introduced in Cao et al. (2019), which significantly boosts performance of both Snorkel (the end-model, , trained on the MIG loss with respect to soft labels generated by the first Snorkel label model step) and WeaSEL that use the cross-entropy (CE) loss. This suggests that the MIG loss is a great choice for the special case of crowdsourcing, due to its strong assumptions common to crowdsourcing which are much less likely to hold for general LFs. This is reflected in our ablations too, where using the MIG loss leads to a consistently worse performance on our main multi-source weak supervision datasets.
is significantly more robust against correlated adversarial (left) or random (right) LFs than prior work whose assumptions make them equivalent to a Naive Bayes model. For subfigure (a), we duplicate a fake adversarial LF up to 10 times, and observe that our end-to-end system is robust against the adversarial LF, while other systems quickly degrade in performance (over ten random seeds). In (b), we let one LF be the true labelsand then duplicate a LF that votes according to a coin flip 2, 5, …, 2000 times. We plot the test AUC performance curve as a function of the epochs, averaged out over the different number of duplications (and five random seeds). WeaSEL consistently recovers the test performance of the supervised end-model trained directly on the true labels , whose end performance (AUC ) is shown in red.
Users will sometimes generate sources they mistakenly think are accurate. This also encompasses the ’Spammer’ crowdworker-type studied in the crowdsourcing literature.
Therefore, it is desirable to build models that are robust against such sources.
We argue that our system that is trained by maximizing the agreement between an aggregation of the sources and the downstream model’s predictions should be able to distinguish the adversarial sources.
In Fig. 1(a) we show that our system does not degrade in its initial performance, even after duplicating an adversarial LF ten times.
Prior latent label models, on the other hand, rapidly degrade, given that they often assume the weak label sources to be conditionally independent given the latent label, equivalent to a Naive Bayes generative model.
Note that the popular open-source implementation of Ratner et al. (2019a, b)
does not support user-provided LF dependencies modeling, while Fu et al. (2020); Chen et al. (2021) did not converge in our experiments when modeling dependencies, and as such we were not able to test their performance when the correlation dependencies between the duplicates are provided (which in practice, of course, are not known).
We also run a synthetic experiment inspired by Cao et al. (2019), where one LF is set to the true labels of the ProfTeacher dataset, i.e. , while the other LF simply votes according to a coin flip, i.e. , and we then duplicate this latter LF, i.e. . Under this setting, our WeaSEL model is able to consistently recover the fully supervised performance of the same downstream model directly trained on the true labels , even when we duplicate the random LF up to times (). Snorkel and triplet methods, on the other hand, were unable to recover the true label (AUC ). Importantly, we find that the design choices for WeaSEL are to a large extent key in order to recover the true labels in a stable manner as in 1(b). Various other choices either collapse similarly to the baselines, are not able to fully recover the supervised performance, or lead to unstable test performance curves, see Fig. 5 in the appendix. More details on the experimental design and an extensive discussion, ablation, and figures based on the synthetic experiment can be found in the appendix F.
Here we provide a high-level overview over the used encoder architecture, the LF sets, and the features. More details, especially hyperparameter and architecture details, are provided in Appendix C. All downstream models are trained with the (binary) cross-entropy loss, and our model with the symmetric version of it that uses stop-grad on the targets.
Encoder network The encoder network
does not need to follow a specific neural network architecture and we therefore use a simple multi-layer perceptron (MLP) in our benchmark experiments.
respectively. The remaining three LF sets were selected by us prior to running experiments. These LFs are all pattern- and regex-based heuristics, while the Spouses experiments also contain LFs that are distant supervision sources based on DBPedia.
|Dataset||#LFs||Cov. (in %)|
In this section we demonstrate the strength of the WeaSEL model design decisions. We perform extensive ablations on all four main datasets but Spouses for twenty configurations of WeaSEL with different encoder architectures, hyperparameters, and loss functions.
The tabular results and a more detailed discussion than in the following can be found in Appendix D.
We observe that ignoring the features when modeling the sample-dependent accuracies, i.e. , usually underperforms by up to 1.2 F1 points. A more drastic drop in performance, up to 4.9 points, occurs when the encoder network is linear, i.e. without hidden layers, as in Cao et al. (2019). It also proves helpful to scale the in Eq. 3 by via the inverse temperature parameter . Further, while the MIG loss proved important for WeaSEL to achieve state-of-the-art performance on the crowdsourcing dataset (with a similar lift in performance observable for Snorkel using MIG for downstream model training), this does not hold for the main datasets. This indicates that the MIG loss is a good choice for crowdsourcing, but not for more general WS settings.
Our ablations also show that it is important to restrict the accuracies to a positive interval (e.g. (0, 1), with the sigmoid function being a good alternative to the softmax we use). On the one hand, this encodes the inductive bias that LFs are not adversarial, i.e. can not have negative accuracies, (using tanh to output accuracy scores does not perform well), and on the other hand does not give the encoder network too much freedom in the scale of the scores (using ReLU underperforms significantly as well).
Overall, WeaSEL avoids trivial overfitting and degenerate solutions by hard-coding the encoder generated labels as a (normalized) linear combination of the LF outputs, weighted by sample-dependent accuracy scores.
This design choice also ensures that the randomly initialized will lead the downstream model that is trained on soft labels generated by the random encoder, to obtain performance similar to when is trained on majority vote labels. In fact, the random-encoder-WeaSEL variant itself often outperforms other baselines, and triplet methods in particular (see appendix B).
Empirically, we only observed degenerate solutions when training for too many epochs. Early-stopping on a small validation set ensures that a strong final solution is returned, and should be done whenever such a set exists or is easy to create. When no validation set is available, we find that choosing the temperature hyperparameter in Eq. 3 such that avoids collapsed solutions on all our datasets. This can be explained by the fact that a lower inverse temperature forces the encoder-predicted label to always depend on multiple LF votes when available, rather than a single one (which happens when the in Eq. 3 becomes a max as ). This makes it harder for the encoder to overfit to individual LFs. Our ablations indicate that this temperature parameter setting comes at a small cost in terms of loss in downstream performance, compared to when using a validation set for early stopping. Thus, when no validation set is available, we advise to lower .
We have shown that WeaSEL achieves competitive or state-of-the-art performance on all datasets we tried it on, for a given set of LFs. In practice, however, this LF set needs to first be defined by users. This can be done via an iterative process, where the feedback is sourced from the quality of the probabilistic labels generated by the label model. A limitation of our model, is that each such iteration would require training the downstream model, . When is slow to train, this may slow down the LF development cycle and lead to unnecessary energy consumption. A practical solution to this can be to a) do the iteration cycle with a less complex downstream model; or b) use the fast to train PGM-based label models to choose a good LF set, and then move to WeaSEL in order to achieve better downstream performance.
Our learning method can easily support labeling functions that output continuous scores instead of discrete labels as in Chatterjee et al. (2020). In particular, this includes probabilistic sources that output a distribution over the potential class labels. This can be encoded in our model by changing the one-hot representation of our base model to a continuous representation .
While we use a simple multi-layer perceptron (MLP) as our encoder in our benchmark experiments, our formulation is flexible to support arbitrarily complex networks. In particular, we can naturally model dependencies amongst weak sources via edges in a Graph Neural Network (GNN), where each LF is represented by a node that is given the LF outputs as features. Furthermore, while we only explicitly reparameterized the accuracy parameters of the sources in our base model, it is straightforward to augment with additional sufficient statistics, e.g. the fixing or priority dependencies from Ratner et al. (2016); Cachay et al. (2021) that encode that one source fixes (i.e. should be given priority over) the other whenever both vote.
We proposed WeaSEL, a new approach for end-to-end learning of neural network models for classification from, exclusively, multiple sources of weak supervision that streamlines prior latent variable models. We evaluated the proposed approach on benchmark datasets and observe that the downstream models outperform state-of-the-art data programming approaches in 4 out of 5 cases while remaining highly competitive on the remaining task, and outperforming several state-of-the-art crowdsourcing methods on a crowdsourcing task. We also demonstrated that our integrated approach can be more robust to dependencies between the labeling functions as well as to adversarial labeling scenarios. The proposed method works with discrete and probabilistic labeling functions and can utilize various neural network designs for probabilistic label generation. This end-to-end approach can simplify the process of developing effective machine learning models using weak supervision as the primary source of training signal, and help adoption of this form of learning in a wide range of practical applications.
AggNet: deep learning from crowds for mitosis detection in breast cancer histology images. IEEE Transactions on Medical Imaging 35 (5), pp. 1313–1321. Cited by: Table 2.
Proceedings of the Eleventh Annual Conference on Computational Learning Theory, COLT’ 98, New York, NY, USA, pp. 92–100. External Links: Cited by: §2.
Comparing the value of labeled and unlabeled data in method-of-moments latent variable estimation.. AISTATS. Cited by: Table 4, Appendix A, Appendix C, Appendix F, §2, §3.2, §3.2, Table 1, §4.1, §4.3, §4.4, §4.
Proceedings of the AAAI Conference on Artificial Intelligence, Cited by: §2, Table 2.
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pp. 1003–1011. Cited by: §1.
In this section we motivate the design choices and inductive biases that we encode into our neural encoder network , which is the network that is used to model the relative accuracies of the weak supervision sources . Recall that we model the probability of a particular sample having the class label as
where weighs the LF votes on a sample-by-sample basis and the softmax for class on is defined as
Connection to prior PGM models We now motivate this choice by deriving a less expressive variant of it from the standard Markov Random Field (MRF) used in the related work. If we view the attention scores , that assign sample-dependent accuracies to each labeling function, as sample-independent parameters and, by that, drop the features from the equation – as is done in the related work Ratner et al. [2016, 2019b], Fu et al. , Chen et al.  – we can rewrite Eq. 4 as
|Let , and, for clarity of writing, we drop the class balance, then this becomes|
where in the second step we multiplied the denominator and numerator with the same quantity , and now parameterizes the joint distribution of the latent label and weak sources as
We can recognize as a distribution from the exponential familiy, and more specifically as a pairwise MRF, or factor graph, with canonical parameters and corresponding sufficient statistics, or factors, , as well as the log partition function . The accuracy factors and parameters are the core component of this model and sometimes take the form in binary models as in Ratner et al. , Fu et al. , Chen et al. . The label-independent factors have, as can be seen from the derivation above, no direct influence on the latent label posterior, but are often used to model labeling propensities and correlation dependencies , which can be important for PGM parameter learning, but are susceptible to misspecifications Varma et al. , Chen et al. , Cachay et al. . Our own parameterization therefore is a more expressive variant of these latent-variable PGM models, where we are able to assign LF accuracies on a sample-by-sample basis. Furthermore, our neural encoder network outputs them as a function of the LF outputs and features, and is expected to learn the easy to misspecify dependencies and label-independent statistics implicitly. Indeed, our empirical findings and subsection 4.3 support this.
|Model||Spouses (9 LFs)||ProfTeacher (99 LFs)||IMDB (136 LFs)||IMDB (12 LFs)||Amazon (175 LFs)|
|Sup. (Val. set)|
|Snorkel||82.22 0.18||74.45 0.58|
|Majority vote||85.44 0.37||84.20 0.52|
|WeaSEL||51.98 1.60||86.98 0.45||82.10 0.45||77.22 1.02||86.60 0.71|
We provide more detailed results in Table 4. Here, we include WeaSEL-random, which corresponds to WeaSEL with a randomly initialized encoder network that is not trained/updated. As expected, this setting produces performance often similar compared to training an end model on the hard majority vote labels. This is due to the strong inductive bias in our encoder model that constrains the encoder labels to be a normalized linear combination of the LF votes, weighted by positive accuracy scores. In fact, WeaSEL-random itself is often able to outperform the PGM-based baselines, in particular the triplet methods. Our results show that WeaSEL consistently improves significantly upon these baselines via training the encoder network to maximize its agreement with the downstream model.
For the Spouses dataset, and the IMDB variant with 12 LFs, we use the same LFs as in Fu et al.  and Chen et al. , respectively444All necessary label matrices are available in our research source code. The Spouses LFs and data are also available at the following URL: https://github.com/snorkel-team/snorkel-tutorials/blob/master/spouse/spouse_demo.ipynb.
The set of 12 IMDB LFs was specifically chosen to have a large coverage, see Table 3.
These LFs and the larger set of LFs that we introduce for the second IMDB experiment are all pattern- and regex-based heuristics, i.e. LFs that label whenever a certain word or bi-gram appears in a text document. For instance, ’excellent’ would label for the positive movie review sentiment (and would do so with accuracy on the samples where it does not abstain). This holds for the other text datasets as well, while the Spouses experiments also contain LFs that are distant supervision sources based on DBPedia.
For the remaining datasets (IMDB with 136 LFs, Bias Bios, and Amazon), we created the respective LF sets ourselves, prior to running experiments.
In all experiments, we use a simple multi-layer perceptron (MLP) as the encoder
, with two hidden layers, batch normalization, and ReLU activation functions. For the Spouses dataset, we use a bottleneck-structured network of sizes 50, 5. This is motivated by the small size of the set of samples labeled by at least one LF. For all other datasets we use hidden dimensions of 70, 70. We show in the ablations (Table5), that our end-to-end model also succeeds for different encoder architecture choices.
For all datasets besides Spouses, we use a three-layer MLP with hidden dimensions 50, 50, 25. For Spouses, we use a single-layer bidirectional LSTM with a hidden dimension of 150, followed by two fully-connected readout layers with dimensions 64, 32. All fully-connected, layers use ReLU activation functions. We choose simple downstream architectures as we are interested in the relative improvements over other label models. More sophisticated architectures are expected to further improve the performances, however.
Unless explicitly mentioned, all reported experiments are averaged out over seven random seeds.
We use an L2 weight decay of and dropout of for both encoder and downstream model for all datasets but Spouses (where the LSTM does not use dropout).
All models are optimized with Adam, with early-stopping based on AUC performance on the small validation set, and a maximum number of epochs ( for Spouses). The batch size is set to .
The loss function is set to the (binary) cross-entropy.
For each dataset and each model/baseline, we run the same experiment for learning rates of and , and then report the model chosen according to the best ROC-AUC performance on the small validation set.
For Spouses we additionally run experiments with a L2 weight decay of which due to the risk of overfitting to the small size of LF-covered data points boosts performance for all models.
For our own model, WeaSEL, we also run additional experiments for Spouses with different configurations of the temperature hyperparameter, and again report the test performance as measured by the best validation ROC-AUC.
The probabilistic labels from Snorkel used for downstream model training are chosen over six different configurations of the learning rate and number of epochs for Snorkel’s label model (again with respect to validation set ROC-AUC). For all binary classification datasets (i.e. all except for LabelMe), we tune the downstream model’s decision threshold based on the resulting F1 validation score for all models. We believe that this, alternatively to reporting test ROC-AUC scores, makes the comparison fairer, since F1 is a threshold dependent metric. All label model baselines are provided with the class balance, which WeaSEL does not use (but which is expected to be helpful for unbalanced classes, where no validation set is available).
|Change||ProfTeacher||IMDB- LFs||IMDB- LFs||Amazon|
|1 hidden layer||87.1 0.7|
|1e-4 weight decay||87.4 0.4||77.9 0.6|
|Squared Hellinger loss||87.4 0.3||82.2 0.6|
|CE asymm. loss||77.3 3.7||77.7 1.1||71.7 0.3||78.7 1.2|
|CE asymm. loss||73.1 6.8||71.9 1.9||69.7 0.7||70.1 1.1|
|No stop-grad||80.4 2.1||76.2 0.5||71.0 0.6|
|78.0 0.7||86.9 0.3|
|1e-5||69.1 2.1||74.2 2.7|
|71.9 4.0||67.0 0.8||67.0 1.1||67.3 1.1|
The full ablations are reported in Table 5, where in each row we change or remove exactly one component of our proposed model, WeaSEL.
We find that the design choices of WeaSEL which were inspired by sensible inductive biases for an encoder label model are hard to beat by various changes to the architecture, loss function, or hyperparameters.
Indeed, most changes consistently underperform WeaSEL, and the occasional positive changes – 1e-4 weight decay, and the Squared Hellinger loss instead of the symmetric cross-entropy – only beat the base WeaSEL performance in at most two datasets, and never significantly. In practice, we advise to explore these strongest configurations if a small validation set is available.
We find that letting the accuracy scores depend on the input features (first row), usually boosts performance, but not by much (1.2 F1 points at most). On the other hand, it proves very important to allow these accuracy scores to depend non-linearly on the LF votes and the features: A linear encoder network, as in Cao et al. , significantly underperforms WeaSEL with at least one hidden layer by up to 4.9 F1 score points. Conversely, a deeper encoder network (of hidden dimensionalities , see fourth row) does not improve results. This may be due to the sample-dependent accuracies not being a too complex function to learn.
While the effect of the inverse temperature parameter –which controls the softness of the encoder-predicted accuracy scores–on downstream performance is not large, it can have significant effects on the learning dynamics and robustness, see Fig 3 for such learning curves as a function of epoch number. In particular, a lower makes the dynamics more robust, since the accuracy score weights are more evenly distributed across LFs, which appears to help avoid overfitting. When overfitting is not easily detectable due to a lack of a validation set, it is therefore advisable to use a lower . It also proves helpful to scale the in Eq. 3 by , rather than not scaling it ( row) or scaling by .
Changing the loss function from the symmetric cross-entropy to the MIG function Cao et al.  or the L1 loss consistently leads to worse performance. The former is interesting, since using the MIG loss for the crowdsourcing dataset LabelMe, see subsection 4.2, was important in order to achieve state-of-the-art crowdsourcing performance (with a similar lift in performance observable for Snorkel using MIG for downstream model training). The result provides some evidence that the MIG loss may be inappropiate for weak supervision settings other than crowdsourcing, while its use may be recommended for that specific setting.
We find that it is important to constrain the accuracy score space to a positive interval, either by viewing them as an aggregation of the LFs via the scaled in Eq. 3, or by replacing the with a sigmoid function. Indeed, using a less constrained activation function for the estimated accuracies (last two rows, where the 1e-5 in the ReLU row avoids accuracy scores equal to zero) significantly underperforms: Allowing the accuracies to be negative (last row) leads to collapse and bad downstream performance. This is likely due to the removal of the inductive bias that LFs are better-than-random, which makes the joint optimization more likely to find trivial solutions. Additionally, we find that our choice of using the symmetric cross-entropy loss with stop-grad applied to the targets is crucial for the strong performance of WeaSEL. Removing the stop-grad operation, or using the standard cross-entropy (without stop-grad on the target) leads to significantly worse scores and a very brittle model. This is somewhat expected, since conceptually our goal is to have an objective that maximizes the agreement between a pair of models that predict based on two different views of the latent label, the features and the LF votes. The cross-entropy with stop-grad on the target555or, due to the stop-grad operation, equivalently the KL divergence naturally encodes this understanding, since each model uses the other model’s predictions as a reference distribution. Losses that already are symmetric (e.g. L1 or Squared Hellinger loss) neither need to be symmetrized nor use stop-grad. While the L1 loss consistently underperforms, we find that the Squared Hellinger loss can lead to better performance on two out of four datasets.
However, only the symmetric cross-entropy loss with stop-grad on the targets is shown to be robust and able to recover the true labels in our synthetic experiments in appendix F, see Fig. 5 in particular. The synthetic ablation in appendix F gives interesting insights, and strongly supports the proposed design of WeaSEL. Indeed, many choices for WeaSEL that perform well enough on the real datasets, such as no features for the encoder, , sigmoid parameterized accuracies, and all other objectives that we evaluated, lead to significantly worse performance and less robust learning on the synthetic adversarial setups.
As the crowdsourcing dataset, we choose the multi-class LabelMe image classification dataset that was previously used in the most related crowdsourcing literature Rodrigues and Pereira , Cao et al. . Note that this dataset consists of samples, of which only are unique, in the sense that the rest are augmented versions of the . They were annotated by crowdworkers, with a mean overlap of annotations per image. The downstream model is identical to the previously reported one Rodrigues and Pereira , Cao et al. . That is, a VGG-16 neural network is used as feature extractor, and a single fully-connected layer (with 128 units and ReLU activation) and one output layer is put on top, using 50 % dropout.
Experiments were conducted over seven random seeds with a learning rate of 1e-4 and 50 epochs. The reported scores are the ones with best validation set accuracy for a L2 weight decay 7e-7, 1e-4 . The validation set is of size 200, and was split at random from the training set prior to running the experiments.
As is usual in the related work for multi-class settings Ratner et al. [2019a], we employ class-conditional accuracies instead of only class-independent accuracies. Recall the LF outputs indicator matrix,
. To compute the resulting output softmax logits, we set and , where is the element-wise matrix product and we sum up the resulting matrix across the LF votes dimension.
In this section we give more details on the experiments that validate the robustness of our approach against (strongly) correlated LFs that are not better than a random coin flip. In addition, we present one further experiment where the random LFs are independent of each other – a more difficult setup for learning (but which does not violate any assumptions of the PGM-based methods) – and our model, WeaSEL, again is shown to be robust to a large extent.
In contrast to WeaSEL, prior PGM-based work Ratner et al. [2019a], Fu et al. , Chen et al.  attain significantly worse performance under these settings, due to assuming a Naive Bayes generative model where the weak label sources are conditionally independent given the latent label.
For this experiment we use our set of 12 LFs for the IMDB dataset and generate a fake adversarial source by flipping the abstain votes, of the 80%-accurate LF that labels for the positive sentiment on ’excellent’, to negative ones.
In this set of synthetic experiments we again validate the robustness of our approach. We focus on the Bias in Bios dataset, and use the features and true labels, , therein. We let our initial LF set consist of 1) a 100% accurate LF, that is we set , and 2) a LF that votes according to the class balance (i.e. a coin flip with probabilities for tail/head set according to the class balance), i.e. . In the first experiment we then add the same random LF multiple times into the LF set (i.e. we duplicate it), see F.2.1, while in the second one, we incrementally add random LFs independently of (and independently of any other LF already in the LF set), see F.2.2. For both setups, our model, WeaSEL, is able to recover the performance of the same downstream model, , that is directly trained on the true labels, (F1 , ROC-AUC , see Table 4). In contrast, the PGM-based baselines quickly collapse.
This experiment is inspired by the theoretical comparison in Appendix E of Cao et al.  between the authors’ end-to-end system and maximum likelihood estimation (MLE) approaches that assume mutually independent LFs.
The authors show that such MLE methods are not robust against the following simple example with correlated LFs.
Based on the setup described above in F.2, we duplicate the random LF multiple times, i.e. .
We run experiments for varying number of duplicates .
With this synthetic set of LFs, where one LF is accurate while the other LFs are just as good as a random guess, we train WeaSEL in the usual way on the features from the Bias in Bios dataset as well as the corresponding, just created, LF votes.
WeaSEL is able to consistently and almost completely recover this fully supervised performance, even when the number of duplicates is very high (). Snorkel and triplets methods, on the other hand, fare far worse (AUC ) for all numbers of duplicates. This behavior is similar to the one observed in F.1 (see Fig. 2 for the performance of the baselines and WeaSEL averaged out over the varying number of duplicates, and Fig. 5a-c for the separate performance of WeaSEL for each number of duplicates).
We also run an additional ablation study on this synthetic experiment that shows that the observed robustness does not hold for all configurations of WeaSEL.
In Fig. 5 we plot the test performance curves over the training epochs for each number of LF duplications.
Our proposed model, WeaSEL enjoys a stable and robust test curve (Fig. 4(c)) and quickly recovers the fully supervised performance, even with 2000 LF duplicates (although convergence becomes slower as the LF set contains more duplicates). On the other hand, we find that many other configurations and designs of WeaSEL lead to less robust and worse converging curves, collapses or bad performances. Indeed, for this experiment it is key to use as the loss function the proposed symmetric cross-entropy with stop-grad applied to the targets (see Fig. 4(e), 4(f)), accuracies parameterized by a scaled (Fig. 4(h)) softmax (Fig. 4(g)), and, to a lesser extent, using the features an input to the encoder (Fig. 4(d)).
While the impact of not using stop-grad, or using an asymmetric cross-entropy loss is similarly bad in the main ablations on our real datasets, other configurations, and in particular sigmoid-parameterized accuracies (the choice in Karamanolakis et al. ), an unscaled softmax, and no features for the encoder, often perform well there. This additional ablation, however, provides support for why the good performances on the real datasets notwithstanding, our proposed design choices are most appropriate in order to attain strong test performances as well as stable and robust learning.
We start with the same setup as above in F.2, but instead of duplicating the same LF multiple times as in F.2.1, we now draw a new, independent random LF at each iteration. That is, we start with as our initial LFs, and the incrementally add new LFs that have no better skill than a coin flip. Note that this is arguably a harder setup than the one in the previous experiments, since there the LF set was corrupted by a single LF voting pattern. In this experiment, multiple equally bad, but independent, LFs corrupt the 100% accurate signal of . Notably, since these are independent, we are not violating the independence assumptions of PGM-based methods. Nonetheless, we find that these PGM-based baselines break with only three () of such random, but independent LFs, while WeaSEL is shown to be fully robust and able to recover the ground truth LF for up to 10 random LFs (). For more LFs, WeaSEL starts deteriorating in performance, but is still able to consistently outperform the trivial solution of voting randomly according to the class balance (i.e. based on ) and the baselines, see Fig. 4.
Large labeled datasets are important to many machine learning applications. Reducing the expensive human effort required to annotate such datasets is an important step towards making machine learning more accessible, more manageable, more beneficial, and therefore used more broadly. Our proposed end-to-end learning for weak supervision approach provides another step towards the practical utility of learning from multiple sources of weak labels on large datasets. Methods such as the one presented in our paper must be applied with care. One of the risks to consider and mitigate in a particular application is the possibility of incorporating biases from subjective humans who chose weak labeling sources. This is particularly the case when heuristics might apply differently to different subgroups in data, such as may be the case in scenarios highlighted in recent research towards fairness in machine learning.