A Survey on Programmatic Weak Supervision

Labeling training data has become one of the major roadblocks to using machine learning. Among various weak supervision paradigms, programmatic weak supervision (PWS) has achieved remarkable success in easing the manual labeling bottleneck by programmatically synthesizing training labels from multiple potentially noisy supervision sources. This paper presents a comprehensive survey of recent advances in PWS. In particular, we give a brief introduction of the PWS learning paradigm, and review representative approaches for each component within PWS's learning workflow. In addition, we discuss complementary learning paradigms for tackling limited labeled data scenarios and how these related approaches can be used in conjunction with PWS. Finally, we identify several critical challenges that remain under-explored in the area to hopefully inspire future research directions in the field.


page 1

page 2

page 3

page 4


WRENCH: A Comprehensive Benchmark for Weak Supervision

Recent Weak Supervision (WS) approaches have had widespread success in e...

Recent Advancements in Self-Supervised Paradigms for Visual Feature Representation

We witnessed a massive growth in the supervised learning paradigm in the...

Creating Training Sets via Weak Indirect Supervision

Creating labeled training sets has become one of the major roadblocks in...

Multi-Resolution Weak Supervision for Sequential Data

Since manually labeling training data is slow and expensive, recent indu...

Nemo: Guiding and Contextualizing Weak Supervision for Interactive Data Programming

Weak Supervision (WS) techniques allow users to efficiently create large...

Evaluation of mathematical questioning strategies using data collected through weak supervision

A large body of research demonstrates how teachers' questioning strategi...

Snorkel DryBell: A Case Study in Deploying Weak Supervision at Industrial Scale

Labeling training data is one of the most costly bottlenecks in developi...

1 Introduction

During the last decade, deep learning and other representation learning approaches have achieved remarkable success, largely obviating the need for manual feature engineering and achieving new state-of-the-art scores across a broad range of data types, tasks, and domains. However, they have largely done so via complex architectures that have required massive labeled training data sets. Unfortunately, manually collecting, curating, and labeling these training sets is often prohibitively time-consuming and labor-intensive. The data-hungry nature of these models has thus led to increased demand for innovative ways of collecting cheap yet substantial labeled training data sets, and in particular, labeling them.

To tackle the label scarcity bottleneck, a variety of classical approaches have seen a resurgence of interest. For instance, active learning (AL) 

[47, 44]

aims to select the most informative samples to train the model with a limited labeling budget. Semi-supervised learning (SSL) 

[50, 57]

leverages a set of unlabeled data to improve the model’s performance. Transfer learning approaches 

[37, 56] pre-train a model or a set of representations on a source domain to enhance the performance on a different target domain. However, these approaches still require a set of clean labeled data to achieve satisfactory performance, thus do not fully address the label scarcity bottleneck.

To truly reduce the burdens of training data annotation, practitioners have resorted to cheaper sources of labels. One classic approach is distant supervision where external knowledge bases are leveraged to obtain noisy labels [19]. There are also other options, including crowdsourced labels [62]

, heuristic rules 

[2], feature annotation [31], and others. A natural question is: could we combine these approaches, and an even broader range of potential weak supervision inputs, in a principled and abstracted way?

The recently-proposed programmatic weak supervision (PWS) frameworks provided affirmative answer to this question [42, 41]. Specifically, in PWS, users encode weak supervision sources, e.g., heuristics, knowledge bases, and pre-trained models, in the form of labeling functions (LFs), which are user-defined programs that each provide labels for some subset of the data, collectively generating a large set of training labels.

The labeling functions are usually noisy with varying error rates and may generate conflicting labels on certain data points. To address these issues, researchers have developed label models [42, 40, 15, 54] which aggregate the noisy votes of labeling functions to produce training labels. Then, the training labels is in turn used to train an end model for downstream tasks. These two-stage methods mainly focus on the efficiency and effectiveness of label model, while maintaining the maximal flexibility of the end model. In addition to the two-stage methods, later researchers also explored the possibility of coupling the label model and the end model in an end-to-end manner [45, 22]. We refer to these one-stage methods as joint models. An overview of weak supervision pipeline can be found in Fig.1.

In addition, these LFs often have clear dependencies among them [42] and therefore it is crucial to specify and take into consideration the appropriate dependency structure [8]. However, manually specifying the dependency structure would bring extra burden to practitioners; to reduce human efforts, researches have attempted to learn the dependency structure automatically [3, 51, 52]. Very recently, researchers have also explored the possibility of generating these LFs automatically [53] or interactively [7].

In this paper, we present the first survey on PWS to introduce its recent advances, with special focus on its formulations, methodology, applications, and future research directions. We organize this survey as follows: after a brief introduction of PWS in Sec. 2, we review approaches for each component within a standard PWS workflow, namely, the label model (Sec. 4), end model (Sec. 5), and joint model (Sec. 6). Then, we briefly address complementary approaches for the limited label scenario and how they interact with PWS. Finally, we discuss the challenges and future directions (Sec. 8). We hope that this survey can provide a comprehensive review for interested researchers, and inspire more research in this and related areas.

2 Preliminary

Module Target Task Method Input
X P(Y) Additional Information LF dependency
Label Model Classification Data Programming [42]
MeTaL [40]
FlyingSquid [15]
CAGE [10] User-provided Quality of LFs
NPLM  [60]
PLRM  [63]
Sequence Tagging Dugong [54]
HMM [26]
Linked HMM [46] Linking Functions
CHMM [24]
Classification, Ranking, Regression UWS [48]
Learning in Hyperbolic Manifolds
End Model Classification COSINE [61]
Joint Model Classification Denoise [45]
WeaSEL [9]
ALL [1] Error Rate of LFs
AMCL [32] Set of Labeled data
ImplyLoss [2] Exemplar Data of LFs
ASTRA [20] Set of Labeled data
SPEAR [29] Set of Labeled data
Sequence Tagging ConNet [22]
DWS [38]
Table 1: Comparisons among existing methods for each component of the PWS pipeline. *: NPLM and PLRM are able to utilize new types of LFs as described in Sec 4.

Now, we formally define the setting of PWS. We are given a dataset with data points and the -th data point is denoted by . For each , there is an unobserved true label denoted by . Let be the number of sources , each assigning a label to some to vote on its respective , or abstaining (). In addition, some methods could handle the dependencies among sources by inputting the dependency graph of sources . For concreteness, we follow the general convention of PWS [42] and refer to these sources as labeling functions (LFs) throughout the paper. The goal is to apply LFs to the unlabeled dataset to create an label matrix , and to then use and to produce an end machine learning model .

Figure 1: An overview of PWS pipeline [64].

In general, PWS methods could be classified into two categories as shown in Fig 


Two-stage Method.

A two-stage method works as follows. In the first stage, a label model is used to combine the label matrix into either probabilistic soft training labels or one-hot hard training labels, which are in turn used to train the desired end model in the second stage. We review the label models and end models in the literature separately.

One-stage Method.

The one-stage methods attempt to train a label model and end model simultaneously. Specifically, they usually design a neural network for label aggregation while utilizing another neural network for final prediction. These approaches offer a more straightforward way for tackling weak labels. We refer to the model designed for one-stage method as a

joint model.

3 Labeling Functions

At the core of PWS are the labeling functions (LFs) that provide potentially noisy weak labels that fuel the entire learning pipeline. In this section, we provide an overview over the popular types of LFs, how they are generally developed, and the potential dependency structure among the LFs.

3.1 Labeling Function Types

In PWS, users encode different weak supervision sources into LFs, each of which noisily annotates a subset of data points. While an LF can be as general as any function that takes as input a data point and either outputs a corresponding label or abstain, we introduce the most common types of LFs used in practice.

3.1.1 User-written Heuristics

In practical applications, users generally have domain knowledge about the target learning task of interest. One common type of LF is to express the domain knowledge into heuristic labeling rules that associate corresponding labels to the data points. For example, in text applications, users write keyword- or regex-based LFs that assign corresponding labels to the data points that contains the keyword or matches the specified regular expression [41, 34, 2]. In image applications, users write LFs that provide labels to the image inputs containing specific objects, or possessing some user-specified visual/spatial properties [51, 11, 15].

3.1.2 Existing Knowledge

Knowledge Bases.

Oftentimes, external knowledge bases can be used to provide weak supervision over the learning task of interest, commonly known as the distant supervision approach [19, 25]. For example, in a relation extraction task to identify mentions of spouse relationships in news article, [41]

writes LFs that match the text inputs against the knowledge base DBPedia

111https://www.dbpedia.org/ to search for known spouse relationships.

Pre-trained Models.

Existing pre-trained models from a related task can be used as LFs to provide weak labels. For example, in a product classification task at Google, [4] leverages existing semantic topic model to identify contents irrelevant to the category of products of interest. In [63], pre-trained image classification model that has a different output label space from the target classification task is used as LFs to provide indirect weak supervision for the learning task of interest.

Third-party Tools.

To collect weak labels cheaply, there are several existing third-party tools available that can serve as LFs. For example, for review sentiment analysis, users can simply use TextBlob


to assign labels for each review. Take named entity recognition (NER) as another example, there are several tagging tools such as spaCy

333https://spacy.io/, NLTK444https://www.nltk.org/, etc, and [26] adopt them as LFs for weakly-supervised NER task. Note that, the above tools are not perfect, as the weak labels generated via their outputs contains much noise.

3.1.3 Crowd-sourced Labels.

Crowd-sourcing is the classic and well-studied approach of obtaining less accurate label annotations from non-expert contributors with lower annotation cost [14, 62]. In the PWS setting, each crowd-sourcing contributor can be represented as an LF that noisily annotates the data points [41, 22]. For example, in a weather sentiment classification task, each crowd-source contributor—who grades the sentiment of tweets relating to the weather into five different categories—is considered as a LF.

3.2 Labeling Function Generation

In the PWS learning paradigm, the first and foremost step is to create a set of LFs that are used to generate the weak labels for learning the subsequent models. In practice, the LFs are typically developed by subject matter experts (SMEs) who have adequate knowledge about the task of interest. When developing LFs, in addition to leveraging existing domain knowledge, SMEs usually refer to a small subset of data points sampled from the unlabeled set, called the development set, to extract further task/dataset-specific labeling heuristics that complement the pre-existing domain knowledge [41]. This process of LF development could sometimes be challenging and time-consuming even for domain experts. For example, it often requires SMEs to explore a considerable amount of development data to generate ideas for LFs [53, 16, 7]. As a result, researchers have recently aim to reduce the amount of efforts spent in designing weak supervision sources through three main directions, namely, automatic generation, interactive generation, and guided generation of LFs.

Automatic Generation.

One direction to alleviate the burden in designing LFs in PWS paradigm is to automate the process of LF development. [53] propose a system, Snuba, that generates LFs automatically by learning weak classification models on a small set of labeled dataset. TALLOR [23] takes as input an initial set of seed LFs that are generally simpler, and automatically learn more accurate compound LFs from multiple simple labeling rules. Similarly, GLaRA [65] learns to augment a set of seed LFs automatically by exploiting the semantic relationships between candidate and seed LFs through a graph-based model. Notably, while we refer to this line of methods as “automatic generation” approaches, they do require a minimum amount of initial supervision, either in the form of small labeled set or seed LFs.

Interactive Generation.

In contrast to fully automating the generation of LFs after given a seed supervision set, interactive generation approaches cast LF development as an interactive process where users are iteratively queried for feedback used in discovering useful LFs from a large set of candidates [16, 7]. Specifically, in Darwin [16] and IWS [7], a set of candidate LFs is first generated based on -grams or context-free grammar information. Then, in each iteration, the user is queried to annotate whether a presented LF, proposed by the system, is useful or not (i.e., better than random accuracy). Based on the feedback provided in each iteration, the systems learn to adapt and identify a set of high-precision LFs from the candidate set, which is used as the final set of LFs in the PWS learning pipeline. Compared to the standard active learning approaches which relies on instance-level annotations, the interactive generation approaches are shown to achieve better performance with lower annotation costs.

Guided Generation.

Based on the current workflow of LF development where SMEs write LFs by looking at a small development set of data, guided-generation approaches aim to assist the users in developing LFs by intelligently curate the development set in order to efficiently guide SMEs in exploring the data and developing informative LFs that could lead to strong resultant models [12]. The idea resembles traditional active learning [47] in the sense that the goal is to strategically select data points from the unlabeled set and solicit informative supervision from the users, except that the supervision is provided at the functional-level (i.e., LFs) instead of individual label level.

4 Label Model

The multiple LFs we have for a given dataset often overlap and conflict with each other. In PWS, label model is used to integrate the LFs’ output predictions into probabilistic labels, aiming to accurately recover the unobserved ground truth labels. Till now, various label models have been proposed and most of them are based on probabilistic graphical models. It is worthwhile to note that LFs developed in practice often exhibit statistical dependency among each other [42, 8]

. Incorporating the dependency information into the label model has been shown to be critical to the model’s ability to correctly estimate the latent ground truths 

[42, 3, 51, 8]. However, not all label models take into account the LF dependency structure when aggregating the LFs’ votes, where some approaches simply assume conditional independence between the LFs.

In this section, we first discuss general approaches used to incorporate LF dependency in label model. Then, we introduce more in detail different existing label models, categorized by their target learning tasks, with discussion on how the LF dependency is handled in some of the approaches.

4.1 LF Dependency Structure

Earlier work on PWS rely on users to manually specify the dependency structure among the LFs [42]. For example, users could specify two LFs to be similar; or one LF to be fixing or reinforcing another; or two LFs are exclusive. Nevertheless, as manually specifying such dependency structure is generally hard for users, researchers have recently turned to learning or inferring the dependency structure automatically without user supervision. To automatically learn the dependency structure, [3] proposes to maximize the -regularized marginal pseudo-likelihood of a factor graph with high-order dependencies and select the dependencies that have non-zero weights; [52] exploits the sparsity of label model and leverages robust PCA technique to capture the underlying dependency structure. On the other hand, instead of learning the structure from the observed labels, [51] proposes an alternative approach that infers the relations between different LFs by statically analyzing the source code of the LFs.

Having the dependency structure on hand, whether manually specified or automatically learned/inferred, a prevailing approach to incorporate the dependency information into the label model is to embed the dependency relationships into label models, which are typically graphical models, through factor functions [42, 48] or graph structure [40, 15, 54]. In the following subsections, we introduce the label models for different learning tasks in more detail, and provide an overview of these methods in Table 1.

4.2 Label Model for Classification

For classification problems, majority voting (MV) is the most straight-forward approach for aggregating different LFs, as it simply uses the consensus from the multiple LFs to obtain more reliable labels without introducing any trainable parameters. Crowdsourcing models [14, 13, 43, 21]

usually leverage the expectation maximization (EM) algorithm to estimate the accuracy of each worker as well as infer the latent ground truth labels, which can also be applied here when we regard each LF as a worker. Apart from these approaches, we review several label models tailored for PWS problems. These label models are all based on probabilistic graphical model and aim to maximize the probability of observing the outputs of LFs. Specifically, they share an optimization problem as following:


The key differences among existing label models are the way they parameterize the joint distribution

and how the parameters are estimated. In particular, Data programming (DP) [42] models the distribution

as a factor graph. It is able to describe the distribution in terms of pre-defined factor functions, which reflects the dependency of any subset of random variables and are also used to encode the dependency structure of LFs. The log-likelihood is optimized by SGD where the gradient is estimated by Gibbs sampling. MeTaL 

[40], instead, models the distribution via a Markov Network and recover the parameters via a matrix completion-style approach. Later on, FlyingSquid [15] is proposed to accelerate the learning process for binary classification problems. It models the distribution as a binary Ising model, where each LF is represented by two random variables, and a Triplet Method is used to recover the parameters and therefore no learning is needed. Notably, the latter two methods encode the dependency structure of LFs into the structure of the graphical model and require label prior as input.

Additionally, researchers have attempted to extend the scope of usable LFs. CAGE [10] extends the existing label models to support continuous LFs. In addition, it leverages user-provided quality for LFs to increase the training stability and making it less sensitive to initialization. Moreover, NPLM [60] enables users to utilize partial LFs that output a subset of possible class labels and PLRM [63] allows the usage of indirect LFs that only predict unseen but related class; both works are built on probabilistic graphical model similar to [42] and greatly expand the scope of usable LFs in PWS.

4.3 Label Model for Sequence Tagging

Sequence tagging problems are more complex since there are dependencies among consecutive tokens. To model such properties, Hidden Markov Models (HMM) 

[5] have been proposed, which represent true labels as latent variables and inferring them from the independently observed noisy labels through expectation-maximization algorithm [55]. [26] directly apply HMM for named entity recognition task and [46]

propose Linked-HMM to incorporate unique linking rules as an adjunct supervision source additional to general weak labels on tokens. Moreover, Conditional hidden Markov model (CHMM) 

[24] substitutes the constant transition and emission matrices by token-wise counterpart predicted from the BERT embeddings to model the evolve of true labels in a context-dependent manner. Another characteristics for sequence tagging problems is that the supervision can be provided at different resolutions (e.g. frame, window, and scene-level for videos). To integrate them together, Dugong [54] has been propose to assign probabilistic labels for data with graphical models. Dugong also accelerates the inference speed with SGD based optimization techniques. Finally, as shown in [64], label models for classification task could also be applied on sequence tagging problem with certain adaptations.

4.4 Label Model for General Learning Tasks

Very recently, UWS [48]

goes beyond the traditional tasks and generalizes PWS frameworks to handle more kinds of tasks including ranking, regression, and learning in hyperbolic manifolds with an efficient method-of-moments approach in the embedding space.

5 End Model

After obtaining the probabilistic labels, the end model is used to train a discriminative model on downstream tasks. Since the probabilistic training labels derived from the label model may still contain noise, [41] suggests using a noise-aware loss as the training objective for the end model. However, one drawback for such end models are that they are usually trained only on the data covered by weak supervision, but there may exist an ignorable portion of data that are not covered by any LFs. Motivated by this, COSINE [61] designs a better end model by leveraging the data uncovered by LFs. Specifically, it utilizes these uncovered data in a self-training manner and generates pseudo labels for each unlabel data. Apart from the above methods, other approaches designed for learning with noisy labels [49] can also be utilized as end models.

6 Joint Model

The traditional pipeline for PWS usually trains the label model and end model separately, in contrast, joint model aims to train the label model and the end model in an end-to-end manner, allowing them to enhance each other mutually. In addition, the joint model usually leverages neural network as label model instead of aforementioned statistical label model; such a design choice not only facilitates the co-training of label model and end model, but also reflects the motivation of considering data feature during the training of label model, leading to a instance-dependent label model, i.e., .

As opposed to statistical label models (Sec. 4) that explicitly incorporate LF dependency through the graph structure of underlying graphical models, it is observed that neural network based joint models are able to implicitly capture the dependencies among the LFs in the learning process [9]. However, existing joint models generally cannot incorporate pre-given dependency structure.

6.1 Joint Model for Classification

Denoise [45] and WeaSEL [9] first reparameterize prior probabilistic posteriors with a neural network, then assign scores for each PWS source for aggregation. After that, the posterior network and the end model are trained simultaneously to maximize the agreement between them. [1, 32] both formulate the weakly supervised classification problems as a constrained min-max optimization problem, and ALL [1] learns a prediction model that has the highest expected accuracy with respect to an adversarial labeling of an unlabeled dataset, where this labeling must satisfy error constraints on the weak supervision sources. Differently, AMCL [33] constructs the constraints based on the expected loss within a small set of clean data.

To denoise LFs more effectively, several methods propose to use a small number of labeled data in training. ImplyLoss [2] jointly train a rule denoising network based on exemplars for each label, as well as a classification model with a soft implication loss. SPEAR [29]

extends ImplyLoss by designing additional loss functions on both labeled and unlabeled data and encourages the consistency between the two models. In addition, ASTRA 

[20] adopts self-training for PWS with a teacher-student framework. The student model is initialized with a small number of labeled data and generates pseudo-labels for instances not covered by LFs, while the teacher model combines LFs with the output from the student model for the final prediction.

6.2 Joint Model for Sequence Tagging

For sequence tagging problem, Consensus Network (ConNet) [22] trains BiLSTM-CRF [28] with multiple CRF layers for each labeling source individually. Then, it aggregates the CRF transitions with attention scores conditioned on the quality of LFs and outputs a unified label sequence. DWS [38] uses a CRF layer to capture statistical dependencies among tokens, weak labels and latent true labels. Moreover, it adopts hard EM algorithm for model training: in the E-step, it finds the most probable labels for the given sequence; in the M-step, it maximizes the probability for the corresponding labels.

7 Complementary Approaches

In this section, we briefly describe how PWS can be connected to or combined with complementary machine learning approaches that also aim to deal with the label scarcity issue.

Active Learning.

Active learning (AL) attempts to handle the label scarcity issue by interactively annotating the most informative samples to achieve good performance. As complementary approach, PWS could be utilized to improve AL. For example, [30] expands the initial labeled set in AL by querying labels for those which are the most relevant to existing labeled ones based on LFs, and [35] applied PWS to generate initial noisy training labels to improve the efficiency of a later active learning process. On the other hand, AL could in turn help PWS: [6] asks experts to provide labels for which the label model is most likely to be mistaken, and Asterisk [36] employs AL to enhance the label model and proposed a selection policy based on the estimate accuracy of LFs and the output of label model.

Transfer Learning.

Transfer learning (TL), which adapts a trained model to the new tasks and consequently tends to require less labeled data than training from scratch, has recently attract increasing attention, especially for the great success of fine-tuning huge pretrained models with few labels. We note that TL and PWS are orthogonal to each other and could be combined together to achieve the best performance, since TL could reduce but not eliminate the demand of labeled data, which could be offered by PWS. Indeed, current state-of-the-art PWS methods usually rely on fine-tuning pretrained models with labels produced by label model [64].

Semi-Supervised Learning.

Semi-supervised Learning (SSL) aims to train the model with a small amount of labeled data with a large amount of unlabeled data. The idea of leveraging unlabeled data to improve training has also been applied to PWS methods as

[20, 61] use self-training to bootstrap over unlabeled data. Moreover, [58] improves SSL by leveraging the idea of PWS; specifically, they use the labeled data to generate LFs that are in turn used to annotate the unlabeled data, finally the model is trained on the whole dataset with provided or synthesized labels. To sum up, SSL and PWS are also complementary and future works include developing more advanced methods to combine clean labels and weak labels together to further boost the performance.

8 Challenges and Future Directions

Extend to More Complex Tasks.

The majority of the PWS methods only support classification or sequence tagging tasks, while there are a variety of tasks that require high-level reasoning over concepts such as question answering [39], navigation [18] and scene graph generation [59], and curating labeled data for these tasks requires even more human efforts. Moreover, in these tasks, the input data may come from multiple modalities including text, image and tables, while the current PWS methods only consider LFs with one specific modality. Hence, it is crucial while challenging to develop multi-modal PWS methods to improve the data efficiency on these tasks.

Extend the Scope of Usable LFs.

Although researchers have made attempts to extend the scope of usable LFs [63, 60], there are other sources that could potentially be used as LFs, e.g., physical rules, for more complex tasks. The ultimate goal of PWS is to leverage as more existing sources as possible to minimize human efforts in the curation of training data.

Ethical and Trustworthy AI.

One of the most pressing concerns in the AI community right now is ensuring that AI techniques and models are applied ethically. Within this area of focus, one of the most important and challenging topics is ensuring that the training data which informs models is ethically labeled and managed, transparent, auditable, and bias-free. PWS approaches offer a step-change opportunity in this regard, since they result in training labels generated by code which can be inspected, audited, governed, and edited to reduce bias. However, by the same token, PWS methods can also lead to more direct bias in training data sets if used and modeled improperly [17, 27]. Overall, further systematic study in this area is highly critical, and has great opportunity for improving the state of data in AI from an ethics and governance perspective.

9 Conclusion

Manual annotations are always of great importance to training machine learning models, but usually expensive and time-consuming. Programmatic weak supervision (PWS) offers a promising direction to achieve large-scale annotations with minimal human efforts. In this article, we review the PWS area by introducing existing approaches for each component inside a PWS workflow. We also describe how PWS could interact with methods from related fields for better performance on downstream applications. Then, we list existing datasets and recent applications of PWS in the literature. Finally, we discuss current challenges and future directions in the PWS area, hoping to inspire future research advances in PWS.


  • [1] C. Arachie and B. Huang (2019) Adversarial label learning. In AAAI, Cited by: Table 1, §6.1.
  • [2] A. Awasthi, S. Ghosh, R. Goyal, and S. Sarawagi (2020) Learning from rules generalizing labeled exemplars. In ICLR, External Links: Link Cited by: §1, Table 1, §3.1.1, §6.1.
  • [3] S. H. Bach, B. He, A. Ratner, and C. Ré (2017) Learning the structure of generative models without labeled data. In ICML, Cited by: §1, §4.1, §4.
  • [4] S. H. Bach, D. Rodriguez, Y. Liu, C. Luo, H. Shao, C. Xia, S. Sen, A. Ratner, B. Hancock, H. Alborzi, et al. (2019) Snorkel drybell: a case study in deploying weak supervision at industrial scale. In SIGMOD, Cited by: §3.1.2.
  • [5] L. E. Baum and T. Petrie (1966)

    Statistical inference for probabilistic functions of finite state markov chains

    Ann. Math. Stat 37 (6). Cited by: §4.3.
  • [6] S. Biegel, R. El-Khatib, L. O. V. B. Oliveira, M. Baak, and N. Aben (2021) Active weasul: improving weak supervision with active learning. arXiv preprint arXiv:2104.14847. Cited by: §7.
  • [7] B. Boecking, W. Neiswanger, E. Xing, and A. Dubrawski (2021) Interactive weak supervision: learning useful heuristics for data labeling. In ICLR, Cited by: §1, §3.2, §3.2.
  • [8] S. R. Cachay, B. Boecking, and A. Dubrawski (2021) Dependency structure misspecification in multi-source weak supervision models. ICLR Workshop on Weakly Supervised Learning. Cited by: §1, §4.
  • [9] S. R. Cachay, B. Boecking, and A. Dubrawski (2021) End-to-end weak supervision. NeurIPS. Cited by: Table 1, §6.1, §6.
  • [10] O. Chatterjee, G. Ramakrishnan, and S. Sarawagi (2020) Robust data programming with precision-guided labeling functions. In AAAI, Cited by: Table 1, §4.2.
  • [11] V. S. Chen, P. Varma, R. Krishna, M. Bernstein, C. Re, and L. Fei-Fei (2019) Scene graph prediction with limited labels. In ICCV, Cited by: §3.1.1.
  • [12] B. Cohen-Wang, S. Mussmann, A. Ratner, and C. Ré (2019) Interactive programmatic labeling for weak supervision. KDD DCCL Workshop. Cited by: §3.2.
  • [13] N. Dalvi, A. Dasgupta, R. Kumar, and V. Rastogi (2013) Aggregating crowdsourced binary ratings. In WWW, Cited by: §4.2.
  • [14] A. P. Dawid and A. M. Skene (1979) Maximum likelihood estimation of observer error-rates using the em algorithm. Journal of the Royal Statistical Society. External Links: ISSN 00359254, 14679876, Link Cited by: §3.1.3, §4.2.
  • [15] D. Y. Fu, M. F. Chen, F. Sala, S. M. Hooper, K. Fatahalian, and C. Ré (2020) Fast and three-rious: speeding up weak supervision with triplet methods. In ICML, Cited by: §1, Table 1, §3.1.1, §4.1, §4.2.
  • [16] S. Galhotra, B. Golshan, and W. Tan (2021) Adaptive rule discovery for labeling text data. In SIGMOD, External Links: ISBN 9781450383431 Cited by: §3.2, §3.2.
  • [17] M. Geva, Y. Goldberg, and J. Berant (2019) Are we modeling the task or the annotator? an investigation of annotator bias in natural language understanding datasets. In EMNLP, Cited by: §8.
  • [18] S. Gupta, J. Davidson, S. Levine, R. Sukthankar, and J. Malik (2017) Cognitive mapping and planning for visual navigation. In CVPR, Cited by: §8.
  • [19] R. Hoffmann, C. Zhang, X. Ling, L. Zettlemoyer, and D. S. Weld (2011) Knowledge-based weak supervision for information extraction of overlapping relations. In ACL, Cited by: §1, §3.1.2.
  • [20] G. Karamanolakis, S. Mukherjee, G. Zheng, and A. H. Awadallah (2021) Self-training with weak supervision. In NAACL, External Links: Link, Document Cited by: Table 1, §6.1, §7.
  • [21] A. Khetan, Z. C. Lipton, and A. Anandkumar (2018) Learning from noisy singly-labeled data. In ICLR, External Links: Link Cited by: §4.2.
  • [22] O. Lan, X. Huang, B. Y. Lin, H. Jiang, L. Liu, and X. Ren (2020) Learning to contextually aggregate multi-source supervision for sequence labeling. In ACL, External Links: Link, Document Cited by: §1, Table 1, §3.1.3, §6.2.
  • [23] J. Li, H. Ding, J. Shang, J. McAuley, and Z. Feng (2021) Weakly supervised named entity tagging with learnable logical rules. In ACL, External Links: Link, Document Cited by: §3.2.
  • [24] Y. Li, P. Shetty, L. Liu, C. Zhang, and L. Song (2021) BERTifying the hidden markov model for multi-source weakly supervised named entity recognition. In ACL, Cited by: Table 1, §4.3.
  • [25] C. Liang, Y. Yu, H. Jiang, S. Er, R. Wang, T. Zhao, and C. Zhang (2020) Bond: bert-assisted open-domain named entity recognition with distant supervision. In KDD, Cited by: §3.1.2.
  • [26] P. Lison, J. Barnes, A. Hubin, and S. Touileb (2020) Named entity recognition without labelled data: a weak supervision approach. In ACL, Cited by: Table 1, §3.1.2, §4.3.
  • [27] L. Lucy and D. Bamman (2021)

    Gender and representation bias in gpt-3 generated stories

    In NAACL Workshop on Narrative Understanding, Cited by: §8.
  • [28] X. Ma and E. Hovy (2016) End-to-end sequence labeling via bi-directional lstm-cnns-crf. In ACL, Cited by: §6.2.
  • [29] A. Maheshwari, O. Chatterjee, K. Killamsetty, G. Ramakrishnan, and R. Iyer (2021) Semi-supervised data programming with subset selection. In Findings of ACL, Cited by: Table 1, §6.1.
  • [30] N. R. Mallinar, A. Shah, T. Ho, R. Ugrani, and A. Gupta (2020) Iterative data programming for expanding text classification corpora. ArXiv abs/2002.01412. Cited by: §7.
  • [31] G. S. Mann and A. McCallum (2010) Generalized expectation criteria for semi-supervised learning with weakly labeled data.. JMLR. Cited by: §1.
  • [32] A. Mazzetto, C. Cousins, D. Sam, S. H. Bach, and E. Upfal (2021) Adversarial multiclass learning under weak supervision with performance guarantees. In ICML, Cited by: Table 1, §6.1.
  • [33] A. Mazzetto, D. Sam, A. Park, E. Upfal, and S. H. Bach (2021) Semi-supervised aggregation of dependent weak supervision sources with performance guarantees. In AISTATS, Cited by: §6.1.
  • [34] Y. Meng, J. Shen, C. Zhang, and J. Han (2018) Weakly-supervised neural text classification. In CIKM, Cited by: §3.1.1.
  • [35] M. Nashaat, A. Ghosh, J. Miller, S. Quader, C. Marston, and J. Puget (2018) Hybridization of active learning and data programming for labeling large industrial datasets. IEEE Big Data. Cited by: §7.
  • [36] M. Nashaat, A. Ghosh, J. Miller, and S. Quader (2020) Asterisk: generating large training datasets with automatic active supervision.

    ACM Transactions on Data Science

    Cited by: §7.
  • [37] S. J. Pan and Q. Yang (2009) A survey on transfer learning. TKDE. Cited by: §1.
  • [38] J. Parker and S. Yu (2021) Named entity recognition through deep representation learning and weak supervision. In Findings of ACL, External Links: Link, Document Cited by: Table 1, §6.2.
  • [39] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) SQuAD: 100, 000+ questions for machine comprehension of text. In EMNLP, Cited by: §8.
  • [40] A. J. Ratner, B. Hancock, J. Dunnmon, F. Sala, S. Pandey, and C. Ré (2019) Training complex models with multi-task weak supervision. In AAAI, Cited by: §1, Table 1, §4.1, §4.2.
  • [41] A. J. Ratner, S. H. Bach, H. Ehrenberg, J. Fries, S. Wu, and C. Ré (2017) Snorkel: rapid training data creation with weak supervision. In VLDB, Cited by: §1, §3.1.1, §3.1.2, §3.1.3, §3.2, §5.
  • [42] A. J. Ratner, C. M. De Sa, S. Wu, D. Selsam, and C. Ré (2016) Data programming: creating large training sets, quickly. In NeurIPS, Cited by: §1, §1, §1, Table 1, §2, §4.1, §4.1, §4.2, §4.2, §4.
  • [43] V. C. Raykar, S. Yu, L. H. Zhao, G. H. Valadez, C. Florin, et al. (2010) Learning from crowds.. JMLR. Cited by: §4.2.
  • [44] P. Ren, Y. Xiao, X. Chang, P. Huang, Z. Li, B. B. Gupta, X. Chen, and X. Wang (2021) A survey of deep active learning. CSUR. Cited by: §1.
  • [45] W. Ren, Y. Li, H. Su, D. Kartchner, C. Mitchell, and C. Zhang (2020) Denoising multi-source weak supervision for neural text classification. In Findings of EMNLP, Cited by: §1, Table 1, §6.1.
  • [46] E. Safranchik, S. Luo, and S. Bach (2020) Weakly supervised sequence tagging from noisy rules. In AAAI, Vol. 34. Cited by: Table 1, §4.3.
  • [47] B. Settles (2009) Active learning literature survey. Cited by: §1, §3.2.
  • [48] C. Shin, W. Li, H. Vishwakarma, N. Roberts, and F. Sala (2022) Universalizing weak supervision. In ICLR, Cited by: Table 1, §4.1, §4.4.
  • [49] H. Song, M. Kim, D. Park, Y. Shin, and J. Lee (2020) Learning from noisy labels with deep neural networks: a survey. arXiv preprint arXiv:2007.08199. Cited by: §5.
  • [50] A. Tarvainen and H. Valpola (2017) Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. In NeurIPS, Cited by: §1.
  • [51] P. Varma, B. D. He, P. Bajaj, N. Khandwala, I. Banerjee, D. Rubin, and C. Ré (2017) Inferring generative model structure with static analysis. NeurIPS 30. Cited by: §1, §3.1.1, §4.1, §4.
  • [52] P. Varma, F. Sala, A. He, A. J. Ratner, and C. Ré (2019) Learning dependency structures for weak supervision models. In ICML, Cited by: §1, §4.1.
  • [53] P. Varma and C. Ré (2018) Snuba: automating weak supervision to label training data. In VLDB, Vol. 12. Cited by: §1, §3.2, §3.2.
  • [54] P. Varma, F. Sala, S. Sagawa, J. Fries, D. Fu, S. Khattar, A. Ramamoorthy, K. Xiao, K. Fatahalian, J. Priest, and C. Ré (2019) Multi-resolution weak supervision for sequential data. In NeurIPS, Vol. 32, pp. . External Links: Link Cited by: §1, Table 1, §4.1, §4.3.
  • [55] L. R. Welch (2003) Hidden markov models and the baum-welch algorithm. IEEE ITS Newsletter. Cited by: §4.3.
  • [56] G. Wilson and D. J. Cook (2020) A survey of unsupervised deep domain adaptation. TIST. Cited by: §1.
  • [57] Q. Xie, Z. Dai, E. Hovy, T. Luong, and Q. Le (2020) Unsupervised data augmentation for consistency training. NeurIPS. Cited by: §1.
  • [58] Y. Xu, J. Ding, L. Zhang, and S. Zhou (2021) DP-ssl: towards robust semi-supervised learning with a few labeled samples. NeurIPS. Cited by: §7.
  • [59] K. Ye and A. Kovashka (2021) Linguistic structures as weak supervision for visual scene graph generation. In CVPR, Cited by: §8.
  • [60] P. Yu, T. Ding, and S. H. Bach (2022) Learning from multiple noisy partial labelers. AISTATS. Cited by: Table 1, §4.2, §8.
  • [61] Y. Yu, S. Zuo, H. Jiang, W. Ren, T. Zhao, and C. Zhang (2021) Fine-tuning pre-trained language model with weak supervision: a contrastive-regularized self-training approach. In NAACL, External Links: Link Cited by: Table 1, §5, §7.
  • [62] M. Yuen, I. King, and K. Leung (2011) A survey of crowdsourcing systems. In SocialCom, Cited by: §1, §3.1.3.
  • [63] J. Zhang, B. Wang, X. Song, Y. Wang, Y. Yang, J. Bai, and A. Ratner (2022) Creating training sets via weak indirect supervision. In ICLR, Cited by: Table 1, §3.1.2, §4.2, §8.
  • [64] J. Zhang, Y. Yu, Y. Li, Y. Wang, Y. Yang, M. Yang, and A. Ratner (2021) WRENCH: a comprehensive benchmark for weak supervision. In NeurIPS, External Links: Link Cited by: Figure 1, §4.3, §7.
  • [65] X. Zhao, H. Ding, and Z. Feng (2021) GLaRA: graph-based labeling rule augmentation for weakly supervised named entity recognition. In EACL, External Links: Link Cited by: §3.2.