Log In Sign Up

Label Augmentation with Reinforced Labeling for Weak Supervision

by   Gurkan Solmaz, et al.

Weak supervision (WS) is an alternative to the traditional supervised learning to address the need for ground truth. Data programming is a practical WS approach that allows programmatic labeling data samples using labeling functions (LFs) instead of hand-labeling each data point. However, the existing approach fails to fully exploit the domain knowledge encoded into LFs, especially when the LFs' coverage is low. This is due to the common data programming pipeline that neglects to utilize data features during the generative process. This paper proposes a new approach called reinforced labeling (RL). Given an unlabeled dataset and a set of LFs, RL augments the LFs' outputs to cases not covered by LFs based on similarities among samples. Thus, RL can lead to higher labeling coverage for training an end classifier. The experiments on several domains (classification of YouTube comments, wine quality, and weather prediction) result in considerable gains. The new approach produces significant performance improvement, leading up to +21 points in accuracy and +61 points in F1 scores compared to the state-of-the-art data programming approach.


page 17

page 19

page 20

page 21


GOGGLES: Automatic Training Data Generation with Affinity Coding

Generating large labeled training data is becoming the biggest bottlenec...

Automatic Synthesis of Diverse Weak Supervision Sources for Behavior Analysis

Obtaining annotations for large training sets is expensive, especially i...

Dependency Structure Misspecification in Multi-Source Weak Supervision Models

Data programming (DP) has proven to be an attractive alternative to cost...

WeaNF: Weak Supervision with Normalizing Flows

A popular approach to decrease the need for costly manual annotation of ...

Data Programming using Continuous and Quality-Guided Labeling Functions

Scarcity of labeled data is a bottleneck for supervised learning models....

Inspector Gadget: A Data Programming-based Labeling System for Industrial Images

As machine learning for images becomes democratized in the Software 2.0 ...

Bootstrapping Conversational Agents With Weak Supervision

Many conversational agents in the market today follow a standard bot dev...

1 Introduction

Supervised machine learning has proven to be very powerful and effective for solving various classification problems. However, training fully-supervised models can be costly since many applications require large amounts of labeled data. Manually annotating each data point of a large dataset may take up to weeks or even months. Furthermore, only domain experts can label the data in highly specialized scenarios such as healthcare and industrial production. Thus, the costs of data labeling might become very high.

In the past few years, a new weak supervision (WS) approach, namely data programming [ratner2016data; ratner2017snorkel

], has been proposed to significantly reduce the time for dataset preparation. In this approach, a domain expert writes heuristic functions named labeling functions (LFs) instead of labeling each data point. Each function annotates a subset of the dataset with an accuracy expected to be better than a random prediction. Data programming has been successfully applied to various classification tasks. However, writing LFs might not always be trivial, for instance, when data points are huge vectors of numbers or when they are not intuitively understandable. Developers can quickly code a few simple functions, but having heuristics to cover many corner cases is still a burden. Further, simple heuristics might cover only a tiny portion of the unlabeled dataset (small coverage problem).

The existing data programming framework Snorkel [ratner2016data; ratner2017snorkel] implements a machine learning pipeline as follows. The LFs are applied to the unlabeled data points and the outcomes of LFs produce a labeling matrix where each data point might be annotated by multiple, even conflicting, labels. A generative model processes the labeling matrix to make single label predictions for a subset of data points, based on the agreements and disagreements on the LF outputs for a given data point

using techniques such as majority voting (MV) or minimizing the marginalized log-likelihood estimates. Later, the label predictions are used to train a supervised model that serves as the end classifier (discriminative model). This approach has two major limitations: 1)

Coarse information about the dataset (only LF outputs) fed to the generative model, 2) Lack of generalization due to the sparsity of labeling matrix and relying only on end classifier to generalize. The current data programming does not take the data points’ data features into account during the generative process, even though they are available throughout the pipeline. It utilizes the data features of only the data points with label predictions to train the end classifier. Various additional techniques are considered to complement the existing approach [varma2018snuba; chatterjee2020robust; varma2019learning; nashaat2018hybridization; varma2016socratic] and improve the learning process. However, these approaches do not address this major problem in the design of existing data programming.

This article proposes a label augmentation approach for weak supervision that takes the data features and the LF outputs into account earlier in the generative process. The proposed approach utilizes the features for augmenting labels. The augmentation algorithm, namely reinforcement labeling (RL), checks similarities between data features to project existing LF outcomes to non-existing labels in the matrix (i.e., to unknown cases or abstains). Moreover, it uses a heuristic that considers unknown cases “gravitate” towards the known cases with LF output labels. In such a way, RL enables generalization early on and creates a “reinforced” label set to train an end classifier.

Label augmentation extends the data programming to new scenarios, such as when LFs have low coverage, domain experts can implement only a limited number of LFs, or LFs outcome result in a sparse labeling matrix. Label augmentation can provide satisfactory performances in these cases, although data programming was previously non-applicable. The proposed approach can reduce the time spent by the domain experts to train a classifier as they need to implement fewer LFs. One advantage compared to the existing complementary approaches is that RL does not require any additional effort for labeling data, annotating data, or implementing additional LFs. In other words, the label augmentation enhances classification without any further development burden or assumption of available labeled datasets (e.g., the so-called “gold data”). Furthermore, one can easily combine this approach with the existing solutions.

The RL method is implemented and tested compared to Snorkel (Sn) using different fully-supervised models as end classifiers. The experimental results span classification tasks from several domains (YouTube comments, white/red wine datasets, weather prediction). The new approach outperforms the existing model in terms of accuracy and F1 scores, having closer outcomes to the fully-supervised learning, thanks to the improved coverage that enables end classifier convergence.

2 Method description

2.1 Background on data programming

Figure 1: The label augmentation pipeline that brings data features and weakly-supervised labels together in the generative process.

In data programming [bach2019snorkel; ratner2017snorkel], a set of LFs annotate a portion (subset) of the original unlabeled dataset with a total labeling coverage of , where . Given a data point , an takes as input and annotates the input with a label. LFs are considered weak supervisors implemented by application developers, and they can programmatically annotate many data points at once, as opposed to hand-labeling data points one by one. On the other hand, LFs may have lower accuracies than ground truth for the data points.

For a binary classification task, an LF may return two classes or abstain from making a prediction. For simplicity, let us consider returning 1 or 0 as the class labels and -1 (abstain) when it refrains from class prediction. LFs’ outputs form a labeling matrix where rows represent the data point indices , and columns represent the LF indices . A generative model takes the labeling matrix as input, filters out the data points with no label (all LFs voted for abstains), and tries to predict a label for the remaining data points. An example of a generative model might be a majority voter (MV) based on LF outputs or by minimizing the negative log marginal likelihood [ratner2017snorkel] (likelihood over latent variables, i.e., LF outputs). Dependency structures between the LFs are learned as shown in Alg. 1 in [bach2017learning]. In both MV and marginal likelihood approaches, the generative model takes the labeling matrix as input and tries to make a label decision based on the agreements or disagreements of LFs. The generative model may fail to make a decision in certain cases, such as when equal numbers of LFs disagree on a data point. Lastly, the features of the weakly-labeled data points within and the labels from the generative model are used to supervise an end classifier (discriminative) model. The design of the data programming model is agnostic to the end classifier (discriminative) model, so various supervised machine learning models can be candidates as the end model.

2.2 Label augmentation and reinforced labels

In the described design of data programming, the two limitations mentioned above (in Sec. 1), namely the coarse information in the labeling matrix and sparsity, lead to failure for generalizing to new and unseen data points. Therefore, these limitations may lead to reduced performances. In such scenarios, the outputs of the existing generative model may not be satisfactory to train the end classifier. Therefore, the end classifier model may not converge or generalize well enough to cover different cases. Implementing many LFs that cover different cases is costly and not very straightforward in most scenarios. Although various additional techniques focus on the weak supervision problem [varma2018snuba; chatterjee2020robust; varma2019learning; nashaat2018hybridization; varma2016socratic], they rather extend the existing pipeline with additional features. On the other hand, label augmentation targets these major limitations by eliminating the sparsity using data features. Thus, it can lead to higher accuracy.

Figure 2: Illustration of the reinforced labeling method (see Alg. 1 in Supp. material A.3).

Fig. 1 illustrates the new pipeline for label augmentation. The new generative process brings together the outputs of , and the data features of data points in the unlabeled dataset early on. Different methods can utilize data features in the generative process for augmenting the labels in the labeling matrix. This label augmentation approach differs from the existing “data augmentation” approaches [wang2019implicit; tran2017bayesian; cubuk2019autoaugment] that create new (synthetic) data points, as the new goal is to project LF outcomes to the existing data points as opposed to creating new data points.

The outcome of the new generative process can be more representative of the data and weak supervisors than the outputs of the previously existing generative process due to additional coverage and accuracy gains without any additional LF implementation or data annotation. For instance, abstain values (-1s) from the LFs’ outputs representing the unknown cases (left side of the matrix in Fig. 1) can be predicted by the new generative process (as outlined in Sec. 2.3). The abstain values in the labeling matrix can be augmented with classification prediction values.

Data programming applies statistics only to data points that are already covered by LFs, resulting in a single predicted label for a subset of the dataset. Similarly, by using likelihood estimation over the augmented matrix, the generative process predicts “reinforced labels”. The details of the generative model implementation can be seen in [bach2017learning]. After the generative process, the reinforced labels are used to train the end classifier. The end classifier can still be the same supervised machine learning model. As the augmented matrix has more density compared to the labeling matrix, it may lead to a larger training set. As a simple example, a disagreement between two LFs for a data point can be eliminated by a new prediction for the abstain case of a third LF. As a result, more data points can be used to train the end model. The benefits of using the end classifier include further generalization for the data points that are still not labeled. Furthermore, the reinforced labels can fine-tune pretrained machine learning models.

2.3 Implementing reinforced labeling

Let us describe the RL method for label augmentation in the generative process. The intuition behind this method is that a data point , that a LF does not label, might have very similar features compared to another data point, such as a data point labeled by the same LF. might not be labeled due to the conditions or boundaries in the LF or missing a subset of the data features in its heuristic implementation. Reinforcing the labels means predicting a label for the previously unlabeled data points by certain LFs and therefore creating a denser version of the labeling matrix during the generative process. Furthermore, RL could lead to a higher coverage for training the end classifier.

Fig. 2 illustrates the RL method. The method iterates over and for each on the left part of the matrix, it finds the data points s that abstains from (-1s on the matrix). The abstain points are compared with the points that are labeled by for their distances based on data features. The similarity between a point labeled by point and the unlabeled is represented by the “effect function” . As inputs, the effect function has the data features of the labeled , output label and the features of the unlabeled . Based on the label, the effect may take a positive value (if ) or a negative value (if ). For any other data point that the also abstains, it takes the value zero. These values can be aggregated as for any and where .


Figure 3: Illustration of the gravitation-based RL method


Figure 4: The IQR factor adjustment to dynamically select for gravitation.

We propose a heuristic algorithm to implement RL based on gravitation. Fig. 4 illustrates the heuristic method considering a simple case of having only two data features . In the gravitation-based RL method, for the abstain point that is not labeled by , each other point that labels (shown as colored particles in Fig. 4) is considered as a particle that attracts towards positive () or negative () aggregated effect. Fig. 4 shows positive or negative attractions by different-colored lines between the data points. The attraction is inversely proportional to the pairwise distance of every labeled point to . For optimizing the calculations, we consider a maximum distance threshold above which attraction does not take place. Hence, the effect function is defined as follows:


where and are constants for optional adjustment of the effect and distance relationship. As a distance function RL can use different metrics based on the data types (e.g., sensor readings, text, image) such as Euclidean, Levenshtein or Mahalanobis distances. The effects of all labeled points are aggregated based on their positive or negative attractions. The decision of updating an abstain value depends upon a parameter called the aggregated effect threshold . is labeled with that class for the given if the following condition holds:


adjusts the degree of reinforcement and the density of the resulting augmented matrix. In the evaluation section, we consider various empirical parameters and a heuristic for automatic configuration.

3 Experimental evaluation

3.1 Benchmark framework

The experimental study includes four metrics: Classification accuracy (# correct/# tested), precision, recall, and F1 score (F1). Particularly, higher accuracy and F1 score are important to understand the performance for the given classification task for non-biased and biased tasks, respectively. Other than these performance metrics, the experimental evaluation includes insights on the labeling itself (labeling metrics) such as: Number of LFs: The number of LFs used as weak supervisors; Labeled samples: The number of samples labeled for training the end models; LF coverage: The ratio of the number of data points labeled by LFs to the number of samples; LF overlap: The ratio of LFs’ overlapping outputs with each other to the number of samples; LF conflicts: The ratio of LFs’ conflicting outputs with other each other to the number of samples. The last three metrics represent the mean values, averaged based on the number of LFs.

We evaluate the RL approach with four datasets from different domains: YouTube comments [TubeSpam], red and white wine quality [cortez2009modeling], and Australia Weather (rain) datasets111 YouTube comments dataset consists of texts that may be legit user comments or spam. This text-based dataset is used for benchmarking various data programming approaches [chen2020train; evensen2020ruler; ren2020denoising; karamanolakis2021self; sedova2021knodle; awasthi2020learning] and also as Snorkel’s tutorial for LFs. The YouTube dataset is largely unlabeled except for a small testing dataset. Only the text comment is used as a feature for the YouTube dataset, whereas we removed the others (such as user ID and timestamp). On the other hand, the end models get a sparse matrix of token counts computed with Scikit-learn CountVectorizer222

. To classify the wine quality, there exist 12 real number features (e.g., acidity, residual sugar) for the two wine datasets. For training and testing, a wine is considered good (labeled with 1) when the wine quality feature is more than 5 out of 10; otherwise considered a bad wine (0). The Australia Weather dataset is widely used for the next-day rain predictions. It also consists of 62 features based on daily weather observations spanning ten years period. The wine and Australia Weather datasets are fully labeled with ground truth. For all datasets Euclidean distance metric computes distances between data point pairs. For the YouTube dataset, Euclidean distance is applied to the one-hot encoding of the tokenized texts


Snorkel (Sn) and fully-supervised learning (Sup) are the two main approaches for comparison. In addition to those, the majority voting (MV) approach is tested as the simple generative model. For each data point, MV labels the point based on the majority of the LF outputs (i.e., 0 or 1) from the LFs that do not abstain. We also experiment with MV combined with RL (MVRL). For all results, the existing data programming approaches Sn and MV use the same set of LFs as RL. The LFs are listed in Supp. material A.5. We implement Sn and MV as well as the RL approach using the Snorkel library (version 0.9.5) with the existing features in the tutorial. The framework uses absolute latent labels for training end model (0,1), RL adapts the same scheme for the generative model and training end classifiers.Supervised learning leverages ground truth data (30% of the dataset) for the training. For the experimental scenario, supervised learning is considered the optimal result of machine learning, whereas the other two models are based on only programmatic labels. For supervised learning, the small testing dataset of YouTube comments is used for both training and testing due to lacking ground truth in the training set.

For the end classifier, different machine learning models are used for testing purposes. The models include two logistic regression models: the first one, namely “logit”, is the model that Snorkel uses by default (inverse of regularization strength

and liblinear solver); the second one, “LogR”, uses lbfgs solver for optimization. In addition, we test the random forest (RF), naive Bayes (NaivB), decision tree (DT), k-nearest neighbor (knn), support vector machine (svm), and multi-layer perceptron (mlp) end models. In each experiment, all approaches use the same end classifier. Each experiment consists of 5 runs, and the results are averaged. We do not observe any notable difference between experiment runs. Our assumption is that the tested end models learning behavior is rather deterministic as they use the same the data points for their trainings. There are no additional hyperparameters used other than the stated in this section.

3.2 Experimental results

Benefits of RL

Reinforced labeling Snorkel Supervised learning
Dataset Acc Prec Rec F1 F1-Gain Acc Prec Rec F1 Acc Prec Rec F1
YouTube 0.75 0.98 0.47 0.64 +61 0.54 1.00 0.02 0.03 0.91 0.96 0.84 0.90
Red Wine 0.71 0.80 0.66 0.72 +7 0.61 0.66 0.77 0.65 0.75 0.81 0.74 0.76
White Wine 0.63 0.64 0.98 0.78 +34 0.50 0.82 0.32 0.44 0.54 0.71 0.48 0.57
Weather 0.59 0.29 0.78 0.42 +34 0.54 0.06 0.10 0.08 0.90 0.86 0.59 0.70
Figure 5: RL, Snorkel, and (fully-)supervised learning results: Accuracy, precision, recall, and F1. F1-Gain shows the F1 score advantage of RL compared to Snorkel.
Reinforced labeling Snorkel All
Dataset RL labels LF cov. LF ov. LF con. Sn labels LF cov. LF ov. LF con. End model LFs
YouTube 1273 0.22 0.11 0.04 75 5.0 916 0.16 0.08 0.03 svm 5
Red Wine 375 0.12 0.02 0.02 125 0.5 247 0.08 0.02 0.01 RF 3
White Wine 3269 0.39 0.15 0.07 350 0.5 1995 0.21 0.04 0.01 NaivB 3
Weather 3415 0.58 0.56 0.49 200 5.0 2384 0.19 0.13 0.10 RF 6
Table 1: Labeling metrics for the approaches and dataset statistics.

The first set of results shows the advantage of using RL compared to the existing data programming approach Sn in terms of accuracy and F1 score gains. Table 5 includes results from the four datasets in terms of the four performance metrics. In all datasets, we observe substantial gains in accuracy and F1 scores compared to the benchmark Sn approach. Moreover, RL performance is closer to fully-supervised learning, although it does not use any ground truth labels, even when LFs are relatively few. For the YouTube dataset, we experience that although the Snorkel and RL approaches have the same set of LFs, RL provides up to 64% F1, whereas Sn provides less than 3% F1. In RL, the end model can converge thanks to the additional coverage by the reinforced labels. Table 1 reports the number of computed labels after the application of the generative model with reinforcement (RL) and without it (Snorkel). The gains for accuracy and F1 require no additional human effort or involvement. Only two parameters and configure the RL (see Table 1 for Table 5 experiments settings). Later in this section, we present how these parameters could be configured automatically for any given scenario based on labeling metrics (i.e., LF coverage, LF overlaps, and LF conflicts). Furthermore, our approach consistently outperforms Sn for training any of the tested end models. An extended table in Supp. material A.1 reports the experimental results varying end models for each dataset. Lastly, we observe that affects the performance as it adjusts the trade-off between augmenting similar labels and the bias (noise) of the augmentation.

Auto-adjustment of the RL method

We observe that is dependent on the dataset and the set of LFs. A simple heuristic method to choose is as follows: First calculate the distribution of aggregated effects of unlabeled data point for each LF (through a boxplot as in Fig. 4), then set to have boundaries and

far enough from the quartiles

and . When using 2 such boundaries are symmetric to aggregate effect equal to 0 and they are chosen by the data scientist as we did for the results shown in Tables 5. Another way to calculate such boundaries is to have them symmetrical to the aggregated effect distribution. We can achieve this with the formula (and similarly for ) where and is a parameter, namely the IQR factor. When

, the gravitation method labels only outliers of the aggregated effect distribution. At this point, there is no need to set the

parameter and the boundaries automatically adapts to the aggregated effects that depends on the features range, distance metrics and the sparsity of the initial labeling matrix. However, varying (as tested for the range ) affects the performances. Thus, the problem of finding an optimal still persists.

Fig. 5(a)-top row shows the comparison with Snorkel generative model, while Fig. 5(a)-middle row shows the comparison with MV. As in Fig. 5(a)-top, cause a substantial F1 gain for fewer LFs (e.g., 5 to 8 LFs), whereas it may cause detrimental bias for the case of a higher number of LFs as the sum of LFs’ coverage increases. In this case, the reinforcement may not need to be applied as extensively as when the sum of LFs’ coverage is low. In the latter scenario, using would be the most conservative approach (no reinforcement) that makes sure to add no additional noise. Similar behavior is observed with majority voter as label aggregation algorithm (Fig. 5(a)-middle row).

Fig. 5(a)-bottom row shows the effect of RL with different in terms of labeled samples, LFs’ coverage, mean overlaps, and mean conflicts. The smaller is IQR factor, the higher the values for those metrics are because the boundaries are closer to the IQR. One can infer that for a higher number of LFs, a smaller IQR factor results in excessive noise confusing the generative model (Sn or MV) and degrading the end model. Thus, we define a simple heuristic to automatically configure the gravitation method (shown in Alg. 2 in Supp. material A.3) through calculating by linking it to the LF statistics for the given dataset. The below formula uses these statistics and an empirical constant .


We set through empirical experiments. By this approach, the higher density the labeling matrix has (e.g., more LFs, higher LF coverage), the fewer the abstain labels updated by RL are.

We test the auto-adjustment approach and show the results in Fig. 5(b) for both Sn (top row) and MV (middle row). We have also set the second parameter virtually to infinite, relying on no hyper-parameter. When fewer LFs exist (e.g., 5, 6, or 7), RL provides significant performance gains, especially for F1 scores. On the other hand, when more LFs exist, RL adjusts itself by becoming more conservative on the reinforcement. Therefore, auto-adjustment preserves the good performances when having sufficient LFs, leading to a larger area under the curve for both Sn and MV. Our proposed RL approach, together with this heuristic, does not require any additional parameter setting or human effort by automatically adjusting its IQR factor. However, additional studies are due for more enhanced methods of auto-adjustment.

Fig. 5(b) shows the effect of the automatic IQR adaptation in terms of labeled samples and LF metrics. As expected, when fewer LFs exist, RL provides gains for the number of labeled data points and mean LF coverage, overlaps, and conflicts. Furthermore, it adapts itself and gradually provides lower numbers of additional labels for higher numbers of LFs, thus reducing the noise.

(a) Top: YouTube comments classification results by increasing LFs (from 5 to 12) with different IQR factors.
Bottom: RL effects on the total number of labeled samples, and mean LF coverage, overlaps, and conflicts.
(b) Top: YouTube comments classification results by increasing LFs (from 5 to 12) with auto-adjusted IQR factor.
Bottom: effects of RL with auto-adjusted IQR factor on the total number of labeled samples and mean LFs coverage, overlaps, and conflicts.
Figure 6: (a) Effects of the by the number of LFs. (b) RL with the auto-adjusted .

Varying distance metrics

Our gravitation method highly rely on similarities between data points. In the experiments described above, we always use the Euclidean distance for the real number features. Therefore, we investigate how different distance metrics affect the performances. Table 2 shows the results on the white wine dataset with RF as the end classifier and the previous heuristics to calculate the aggregated effect boundaries with . RL consistently provides improved performance even when the distance metric changes.

Reinforced LFs Snorkel
Distance metric Acc Prec Rec F1 F1-Gain Acc Prec Rec F1
Chebyshev 0.63 0.65 0.94 0.77 +20 0.53 0.71 0.48 0.57
Cosine 0.66 0.67 0.96 0.79 +25 0.51 0.70 0.44 0.54
Euclidean 0.66 0.68 0.92 0.78 +23 0.51 0.70 0.46 0.55
Hamming 0.62 0.66 0.88 0.75 +16 0.52 0.68 0.52 0.59
Jaccard 0.60 0.67 0.79 0.71 +14 0.51 0.69 0.49 0.57
Mahalanobis 0.64 0.66 0.94 0.78 +24 0.51 0.70 0.44 0.54
Minkowski 0.57 0.68 0.67 0.66 +9 0.52 0.69 0.49 0.57
Table 2: Testing RL with various distance metrics for WhiteWine: RF as end model, .


As label augmentation relies on the data features, using a poor embedding for distance computation might lead to detrimental noise in the augmented labeling matrix. In addition, a more advanced augmentation adjustment would avoid performance degradation due to the tradeoff between the coverage and noise.

4 Related Work

Data programming [ratner2016data] enables programmatically labeling data and training using WS sources. In particular, the Snorkel framework [ratner2017snorkel; bach2019snorkel] provides an interface for data programming where users can write LFs and apply them to generate a training dataset for their end models. The generation of the training dataset relies on the generative model, and several studies focus on this aspect [ratner2017snorkel; bach2017learning; varma2017inferring; varma2019learning].

Various recent works focus on extending the existing data programming approach. The extensions include multi-task classification [ratner2019training], using small labeled gold data for augmenting WS [varma2018snuba], learning tasks (or sub-tasks) in slices of dataset [chen2019slice], user guidance for LF precision [chatterjee2020robust], making the training process faster [fu2020fast], generative adversarial data programming [pal2020generative; pal2018adversarial], user supervision for LF error estimates [arachie2019adversarial], learning LF dependency structures [varma2019learning], user annotation of LFs [boecking2020interactive], language description of LFs [hancock2018training

], active learning 

nashaat2018hybridization, and so on. varma2016socratic aims to learn common features by using a “difference model” and feeding these features back to generative model. Mallinar2019

takes advantage of the natural language processing query engine to expand gold labels and generate a label matrix as input for the generative model.

Zhou2020Nero adopts a soft LFs matcher approach based on the distances between LFs’ conditions and data points. chen2020train uses pre-trained machine learning models to estimate distances for natural language processing. The last two studies focus on the semantic similarity of texts to improve the labeling.

Although the studies mentioned above may improve the existing generative model (e.g., through additional human supervision), they do not focus on the problem of LF abstraction with coarse information. Solving this problem would improve the validity of data programming in various scenarios especially since the human supervision is limited in its nature. In this paper, we identify this problem and propose the reinforced labeling that takes the data features into account early on in the generative process. Using this approach, one can leverage the data features and augment the matrix for further generalization and producing satisfactory performance without additional human supervision.

Weak supervision approaches outside the context of data programming consider learning from a set of rules, experts, or workers as in crowdsourcing. platanios2017estimating infer accuracy from multiple classifier outputs based on the confidences. safranchik2020weakly

study the usage of Hidden Markov Models for tagging data sequences.

dehghani2018fidelity train deep NNs using weakly-labeled data. Their approach is semi-supervised, where a teacher network based on rules adjusts the predictions of a student network trained iteratively by given data samples. In another study, dehghani2017neural

propose WS to train neural ranking models in natural language processing and computer vision tasks.

takeoka2020learning consider leveraging unsure responses of human annotators as WS sources to improve the traditional supervised approach. kulkarni2018interactive study labeling based on consensus and interactive learning based on active labeling for multi-label image classification tasks. khetan2018learning

propose an expectation-maximization algorithm for learning workers’ quality, where each worker represents a WS source for image classification tasks.

qian2020learning propose WS with active learning for learning structured representations of entity names. guan2018said learn individual’s weights for predicting a weighted sum of noisy labelers (experts). Das2020Goggles propose a domain-agnostic approach to replace the needs of LFs that apply affinity functions to relate samples with each other. This approach uses a small gold dataset with probabilistic distributions to infer probabilistic labels.

Similar approaches other than WS include domain-specific machine learning applications such as ontology matching. Doan2004 use a relaxation method to label a node into a graph dataset by analyzing features of the node’s neighborhood in the graph. The relaxation process is based on constraint and knowledge that leads to the final labeling. In a similar approach, li-srikumar-2019-augmenting

describe a methodology framework to augment labels guided by external knowledge. In both approaches, label augmentation is in the final phase. These approaches do not involve any generative process. Lastly, literature related to semi-supervised learning (e.g., [

zhu2003semi]) or other hybrid approaches (e.g., [awasthi2020learning]) consider using a mix of clean and noisy labels, whereas this paper focuses on using only the labels from LFs and improve the validity of the existing approach.

5 Conclusion

This paper proposes a novel method for label augmentation in weak supervision. In the new machine learning pipeline, the proposed RL method in the generative process leverages existing LF outputs and data features to augment the weakly-supervised labels. The experimental evaluation shows the benefits of RL for four classification tasks compared to the existing data programming approach in terms of substantial accuracy and F1 gains. Furthermore, the new method enables the convergence of the end classifier even when there exist few LFs. We consider applying RL for matching problems (e.g., entity matching) and active learning as future work. We consider RL as an initial approach for the identified limitation of the generative process, whereas the pipeline opens up the possibility for more advanced (e.g., machine learning) models to leverage data features during the generative process.


Appendix A Appendix

a.1 Reinforced labeling results on different datasets varying end models

Table 3 extends Table 5 by reporting the results of 8 end models and 4 datasets. The hyperparameter configurations of and are listed in Table 1. For all experiments and end models RL outperforms Snorkel and maintains closer performance to the fully-supervised machine learning.

Reinforced LFs Snorkel Supervised learning
Dataset Acc Prec Rec F1 F1-Gain Acc Prec Rec F1 Acc Prec Rec F1 End model
YouTube 0.75 0.98 0.47 0.64 +61 0.54 1.00 0.02 0.03 0.91 0.96 0.84 0.90 svm
YouTube 0.72 1.00 0.40 0.57 +55 0.53 1.00 0.01 0.02 0.81 1.00 0.61 0.76 LogR
YouTube 0.68 0.98 0.34 0.50 +50 0.53 0.00 0.00 0.00 0.92 0.92 0.90 0.91 DT
YouTube 0.72 1.00 0.41 0.58 +56 0.53 1.00 0.01 0.02 0.91 1.00 0.82 0.90 logit
YouTube 0.67 0.93 0.33 0.49 +44 0.53 0.50 0.03 0.05 0.76 1.00 0.53 0.69 knn
YouTube 0.69 1.00 0.34 0.51 +51 0.53 0.00 0.00 0.00 0.86 1.00 0.70 0.82 RF
Red Wine 0.71 0.82 0.62 0.70 +1 0.70 0.80 0.64 0.69 0.74 0.82 0.70 0.74 svm
Red Wine 0.68 0.85 0.53 0.64 +3 0.68 0.85 0.55 0.61 0.73 0.80 0.71 0.74 LogR
Red Wine 0.69 0.78 0.63 0.69 +4 0.60 0.65 0.78 0.65 0.65 0.70 0.65 0.67 DT
Red Wine 0.71 0.79 0.68 0.72 +2 0.68 0.73 0.74 0.70 0.73 0.82 0.69 0.73 logit
Red Wine 0.71 0.82 0.61 0.70 +5 0.68 0.80 0.59 0.65 0.75 0.80 0.75 0.77 knn
Red Wine 0.69 0.69 0.81 0.74 0 0.66 0.63 0.89 0.74 0.71 0.77 0.67 0.71 NaivB
Red Wine 0.71 0.80 0.66 0.72 +7 0.61 0.66 0.77 0.65 0.75 0.81 0.74 0.76 RF
Red Wine 0.70 0.83 0.60 0.69 +1 0.70 0.83 0.61 0.68 0.74 0.82 0.71 0.74 mlp
White Wine 0.65 0.65 1.00 0.79 +23 0.51 0.67 0.48 0.56 0.72 0.73 0.91 0.81 svm
White Wine 0.65 0.65 1.00 0.79 +22 0.52 0.68 0.51 0.57 0.72 0.74 0.90 0.81 LogR
White Wine 0.63 0.64 0.97 0.77 +26 0.50 0.69 0.41 0.51 0.56 0.70 0.57 0.63 DT
White Wine 0.65 0.66 0.99 0.79 +23 0.50 0.67 0.49 0.56 0.67 0.86 0.62 0.69 logit
White Wine 0.64 0.65 0.99 0.78 +8 0.59 0.67 0.73 0.70 0.65 0.75 0.70 0.72 knn
White Wine 0.63 0.64 0.98 0.78 +34 0.50 0.82 0.32 0.44 0.54 0.71 0.48 0.57 NaivB
White Wine 0.64 0.65 0.99 0.78 +27 0.50 0.69 0.41 0.51 0.71 0.81 0.73 0.76 RF
White Wine 0.66 0.66 1.00 0.79 +18 0.54 0.68 0.56 0.61 0.71 0.82 0.71 0.75 mlp
Australia Rain 0.58 0.28 0.77 0.41 +31 0.46 0.07 0.15 0.10 0.86 0.75 0.40 0.52 svm
Australia Rain 0.58 0.28 0.76 0.41 +32 0.49 0.07 0.13 0.09 0.86 0.73 0.40 0.52 LogR
Australia Rain 0.55 0.28 0.83 0.42 +30 0.37 0.08 0.21 0.12 0.85 0.61 0.69 0.65 DT
Australia Rain 0.59 0.29 0.76 0.42 +32 0.44 0.07 0.16 0.10 0.87 0.74 0.53 0.61 logit
Australia Rain 0.49 0.25 0.81 0.38 +29 0.53 0.07 0.12 0.09 0.82 0.54 0.55 0.55 knn
Australia Rain 0.54 0.26 0.75 0.39 +27 0.50 0.09 0.18 0.12 0.69 0.34 0.65 0.45 NaivB
Australia Rain 0.59 0.29 0.78 0.42 +34 0.54 0.06 0.10 0.08 0.90 0.86 0.59 0.70 RF
Australia Rain 0.57 0.27 0.82 0.41 +33 0.48 0.06 0.12 0.08 0.87 0.66 0.55 0.60 mlp
Table 3: RL, Snorkel, and (fully-)supervised model results: Accuracy, recall, precision and F1 scores. F1-Gain shows the F1 score advantage of RL compared to Snorkel.

a.2 Symbols and Notations in the paper

Table 4 lists and describes the frequently used symbols throughout the paper. Some listed parameters are not normalized (e.g., aggregated effects) or adjusted for simplicity, whereas they can be easily normalized to the range , and so based on the observed values, without causing any change in the outcomes.

Symbol Description
Data point composed of features
Data feature
Dataset coverage
Number of data points
Labeling function with index
Output of on data point . Possible outcomes
are classes 0 or 1, or abstain -1
Output of labeling function applied on the whole dataset
Effect function of data points and , and
output of on data point
Aggregated effect on point for
Distance (e.g., Euclidean) between two data points
Cut off distance for an effect to be considered
= 1, = 1, = 0.35 Constants
Thresholds on the aggregated effected to augment a label
as negative or positive
= = Symmetric aggregated effect threshold
, , Quartiles. 25th, 50th and 75th percentile respectively.
InterQuartile Range
IQR factor. is used to calculate the outliers range
, ,
Statistics of the also in relations with all the other
Table 4: Frequently used symbols

a.3 Reinforced labeling pseudocode for label augmentation

Input: LFs and unlabeled data points , where has features . Gravity parameters . Distance threshold . IQR adjusting parameter
Output: Label for a subset of the data points (augmented labels) , where
1 for  do
2       for  do
4       end for
6 end for
7 for  do
8       for  do
9             if  then
10                   for  do
11                         if   then
12                               if  is undefined then
14                               if  then
15                                     if  then
17                                     else
19                                     end if
23                   end for
25       end for
27 end for
28 for  do
29       for  do
30             if  then
31                   if  then
33                   if  then
37       end for
39 end for
Algorithm 1 Reinforced labeling algorithm
Input: where and similarly and ; : Array of aggregated effects,
1 for  do
3 end for
Algorithm 2

Alg. 1 shows the pseudocode of a similarity-based heuristic algorithm for reinforced labeling. Given LFs and the unlabeled dataset of size , the algorithm outputs the augmented labeling matrix. The listed gravity parameters are constants used for all datasets in the experimental study. The distance threshold is an optional parameter to optimize the computation or to remove outliers, whereas it is not always used in the experimental study ( is set to a high number). are 2D arrays, where rows represent the index of the data points () and columns represent the index of the LFs (). is a 2D array representing the distance between any two data points. For instance, represents the distance between and .

The values that represent abstains of LFs are updated based on their similarity given by the function. This function can be implemented using various distance metrics such as Euclidean, Cosine, or Mahalanobis distances. Lastly, the updated array represents the augmented labeling matrix that can be given to a generative process such as Snorkel’s generative model.

Alg. 2 shows the pseudocode of the used for the RL implementation. The heuristic in Alg. 2 leverages three LF statistics, that are LF coverage, overlaps, and conflicts, to calculate the boundaries that are effectively the aggregated effect threshold of RL.

a.4 Additional experimental insights

In addition to the experimental study, we implement and test a hybrid approach of leveraging labels of LFs as well as strong supervision through a gold dataset. We call this approach the generative neural network

for data programming (GNN). In GNN, the label outputs of LFs and data features are fed to a simple neural network (NN) along with the labels. The NN model contains two hidden layers (# nodes 12, 8) with ReLU activation function, and it uses Adam optimizer. The output layer has the sigmoid activation function. The NN model is used in different stages of the pipeline. First, GNN replaces the generative part using labels of Snorkel (Sn+GNN+) or RL (RL+GNN+). Then, the outputs of GNN are fed to an end classifier as usual. Second, the GNN itself serves as the end classifier as well as the generative model (Sn+GNN or RL+GNN). We applied the GNN model to both red wine and white wine datasets as these datasets have the available ground truth data to create gold datasets.

Figure 7: Experimental results of the red wine and white wine datasets for different approaches: Sn, RL, Sup., Sn+GNN, RL+GNN, Sn+GNN+<end_model> and RL+GNN+<end_model>.

Fig. 7 shows results of the red and white wine datasets in more detail, including the hybrid GNN approach using RF and NaivB end models, respectively. The results of the following 7 approaches are listed in order: Sn+<endmodel>, RL+<endmodel>, Sup+<endmodel>, Sn+GNN+<endmodel>, RL+GNN+<endmodel, Sn+GNN, RL+GNN>. We observe that RL+RF outperforms the Sn+RF benchmark for the white wine dataset by +13 points accuracy and +34 points F1 and even provides better results than Sup (RF). For the red wine dataset, RL+NaivB outperforms Sn+NaivB by +10 points in accuracy and +7 points in F1 score. Moreover, although approaches such as Sup or GNN leverage ground truth labels in their training, outcomes of RL are competitive for the red wine dataset, whereas RL outperforms Sup (NaivB), Sn+GNN+NaivB, and RL+GNN+NaivB for the white wine dataset (see Fig. 7-right).

Figure 8: Experimental results for the YouTube comments and Australia rain datasets testing different approaches: Snorkel (Sn), Reinforced Labeling (RL), Supervised learning (Sup.).

Fig. 8 shows the bar graph for the results of the YouTube comments and Australia Rain datasets (also listed in Table 5) in terms of the four metrics: Accuracy, precision, recall, and F1. The results of the following approaches are listed in order: Sn+<endmodel>, RL+<endmodel>, Sup+<endmodel>. We observe that RL+svm outperforms the Sn+svm benchmark for the YouTube dataset by +21 points accuracy and +61 points F1. For the Australia Rain dataset, RL+RF outperforms Sn+RF by +5 points in accuracy and +34 points in F1 score.

a.5 Labeling functions

We use the Snorkel library to implement LFs and encode domain knowledge programmatically. The LFs are implemented using the interactive and user-friendly features of the Snorkel framework, such as providing LF statistics and allowing high-level definitions of LFs.

As in almost all the previous data programming studies, we do not follow a certain scheme or strict guidance on implementing LFs but rather rely on the best effort based on some understanding of the datasets by visualizing the features and interactively checking the LF coverages, overlaps, and conflicts. In general, the assumption is that the developers would make the best effort to write LFs.

The Australia Weather dataset is used to build a model that predicts for any given day if it will rain the day after444 LFs in Listing 1 use weather data features to label data points as GOOD (1) or BAD (0) based on the features such as Humidity, Rain Today, Temperature at 9am, Humidity at 9am, and pressure at 3pm.

1from snorkel.labeling import labeling_function
3NO_RAIN = 0
4RAIN = 1
7def check_humidity3pm(x):
8    if x.Humidity3pm is None:
9        return ABSTAIN
10    elif x.Humidity3pm>0.75:
11        return RAIN
12    elif x.Humidity3pm<0.15:
13        return NO_RAIN
14    else:
15        return ABSTAIN
18def check_rain_today(x):
19    if x.RainToday is None:
20        return ABSTAIN
21    elif x.RainToday==1:
22        return RAIN
23    else:
24        return ABSTAIN
27def check_temp9am(x):
28    if x.Temp9am is None:
29        return ABSTAIN
30    elif x.Temp9am>0.60:
31        return RAIN
32    else:
33        return ABSTAIN
36def check_rainfall(x):
37    if x.Rainfall is None:
38        return ABSTAIN
39    elif x.Rainfall>0.60:
40        return RAIN
41    else:
42        return ABSTAIN
45def check_humidity9am(x):
46    if x.Humidity9am is None:
47        return ABSTAIN
48    elif x.Humidity9am>0.90:
49        return NO_RAIN
50    elif x.Humidity9am<0.20:
51        return RAIN
52    else:
53        return ABSTAIN
56def check_pressure3pm(x):
57    if x.Pressure3pm is None:
58        return ABSTAIN
59    elif x.Pressure3pm<0.05:
60        return RAIN
61    elif x.Pressure3pm>0.70:
62        return NO_RAIN
63    else:
64        return ABSTAIN
Listing 1: Australia rain labeling functions

The wine datasets have various data features such as alcohol, sulfates, citric acid levels, etc. Listing 2 and 3 include the LFs implemented for wine quality classification for the red wine and white wine datasets, respectively. LFs labels data points as GOOD (1) or BAD (0) quality wine.

1from snorkel.labeling import labeling_function
3BAD = 0
4GOOD = 1
7def check_alcohol(x):
8    if x.alcohol is None:
9        return ABSTAIN
10    elif x.alcohol>0.75:
11        return GOOD
12    elif x.alcohol<0.15:
13        return BAD
14    else:
15        return ABSTAIN
18def check_sulphate(x):
19    if x.sulphates is None:
20        return ABSTAIN
21    elif x.sulphates>0.3:
22        return GOOD
23    else:
24        return ABSTAIN
27def check_citric(x):
28    if x.acidity_citric is None:
29        return ABSTAIN
30    elif x.acidity_citric>0.7:
31        return GOOD
32    else:
33        return ABSTAIN
Listing 2: Red wine labeling functions
1from snorkel.labeling import labeling_function
3BAD = 0
4GOOD = 1
7def check_alcohol(x):
8    if x.alcohol is None:
9        return ABSTAIN
10    elif x.alcohol>0.75:
11        return GOOD
12    elif x.alcohol<0.15:
13        return BAD
14    else:
15        return ABSTAIN
18def check_sulphate(x):
19    if x.sulphates is None:
20        return ABSTAIN
21    elif x.sulphates>0.3:
22        return GOOD
23    else:
24        return ABSTAIN
27def check_citric(x):
28    if x.acidity_citric is None:
29        return ABSTAIN
30    elif x.acidity_citric>0.7:
31        return GOOD
32    else:
33        return ABSTAIN
Listing 3: White wine labeling functions

For the YouTube dataset, we implemented two LFs that search for the exact string "check out" or "check" in the text. Other than these two additional LFs, the LFs in the Snorkel tutorial [Snorkel] named “textblob_subjectivity”, “keyword_subscribe”, “has_person_nlp” are used in the experiments of Table 5. The rest of LFs available from the Snorkel tutorial are used for the experiments in Fig. 6.

1from snorkel.labeling import labeling_function
3HAM = 0
4SPAM = 1
7def check(x):
8    return SPAM if "check" in x.text.lower() else ABSTAIN
11def check_out(x):
12    return SPAM if "check out" in x.text.lower() else ABSTAIN
Listing 4: YouTube labeling functions