Learning to Robustly Aggregate Labeling Functions for Semi-supervised Data Programming

A critical bottleneck in supervised machine learning is the need for large amounts of labeled data which is expensive and time consuming to obtain. However, it has been shown that a small amount of labeled data, while insufficient to re-train a model, can be effectively used to generate human-interpretable labeling functions (LFs). These LFs, in turn, have been used to generate a large amount of additional noisy labeled data, in a paradigm that is now commonly referred to as data programming. However, previous approaches to automatically generate LFs make no attempt to further use the given labeled data for model training, thus giving up opportunities for improved performance. Moreover, since the LFs are generated from a relatively small labeled dataset, they are prone to being noisy, and naively aggregating these LFs can lead to very poor performance in practice. In this work, we propose an LF based reweighting framework to solve these two critical limitations. Our algorithm learns a joint model on the (same) labeled dataset used for LF induction along with any unlabeled data in a semi-supervised manner, and more critically, reweighs each LF according to its goodness, influencing its contribution to the semi-supervised loss using a robust bi-level optimization algorithm. We show that our algorithm significantly outperforms prior approaches on several text classification datasets.



There are no comments yet.


page 8


Semi-Supervised Learning for Neural Keyphrase Generation

We study the problem of generating keyphrases that summarize the key poi...

MUSCLE: Strengthening Semi-Supervised Learning Via Concurrent Unsupervised Learning Using Mutual Information Maximization

Deep neural networks are powerful, massively parameterized machine learn...

SPEAR : Semi-supervised Data Programming in Python

We present SPEAR, an open-source python library for data programming wit...

Improved Generalization of Heading Direction Estimation for Aerial Filming Using Semi-supervised Regression

In the task of Autonomous aerial filming of a moving actor (e.g. a perso...

Textual Membership Queries

Human labeling of textual data can be very time-consuming and expensive,...

Data Augmentation of IMU Signals and Evaluation via a Semi-Supervised Classification of Driving Behavior

Over the past years, interest in classifying drivers' behavior from data...

Adversarial Data Programming: Using GANs to Relax the Bottleneck of Curated Labeled Data

Paucity of large curated hand-labeled training data for every domain-of-...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Supervised machine learning approaches require large amounts of labeled data to train robust machine learning models Sharir et al. (2021). Modern machine learning systems rely excessively on human-annotated gold labels for text classification tasks such as spam detection, (movie) genre classification, sequence labeling, etc. Creating labeled data is a time-intensive process, requiring sizeable human effort and cost. Combined with the heavy dependence of model training on large amounts of labeled data, this serves as a deterrent from achieving comparable performances on new tasks. Therefore, various techniques such as semi-supervision, distant supervision, and crowdsourcing have been proposed to reduce dependency on human-annotated labels.

In particular, several recent data programming approaches Bach et al. (2019); Maheshwari et al. (2021); Chatterjee et al. (2020); Awasthi et al. (2020) have proposed the use of human-crafted labeling functions to weakly

associate labels with the training data. Users encode supervision as rules/guides/heuristics in the form of labeling functions (LFs) that assign noisy labels to the unlabeled data, thus reducing dependence on human labeled data. We refer to such approaches as label aggregators since they aggregate noisy labels to assign a label to the data instance, often using generative models. Examples of label aggregators are

Snorkel Ratner et al. (2016) and Cage Chatterjee et al. (2020)

. These models provide consensus on the noisy and conflicting labels assigned by the discrete LFs to help determine the correct labels probabilistically. The obtained labels could be used to train any supervised model/classifier and evaluate on a test set.

In the cascaded approach described above, a learned label aggregator first assigns noisy labels to the unlabeled data and then learns a supervised classifier/model using these noisy labels. In contrast, a number of recent approaches have studied semi-supervised data programming Awasthi et al. (2020); Maheshwari et al. (2021). The critical observation here is that if we have a minimal set of labeled examples, we can jointly learn the label aggregators and the model parameters in a semi-supervised manner. Such approaches have been shown to outperform the completely unsupervised data programming described above.

Data programming (unsupervised or semi-supervised) requires carefully curated LFs, in the form of either regular expressions or conditional statements, to assign multiple labels to the data. Even though developing LFs could potentially require less time than creating a large amount of supervised data, the task still needs domain experts to spend a significant amount of time determining the right patterns to create LFs. In this paper, we circumvent the requirement of human-curated LFs by instead automatically generating human-interpretable LFs as compositions of simple propositions on the data set. We leverage Snuba Varma and Ré (2018), which uses a small labeled-set to induce LFs. However, as we will show, Snuba suffers from two critical limitations, which keep it from outperforming even a simple supervised baseline that is trained on the same labeled-set. First, Snuba only uses the labeled-set to generate the LFs but does not make effective use of it in the final model training. Second, because it naively aggregates these LFs, it is unable to differentiate between very noisy LFs and more useful ones. We address both of these limitations in this work.

Label Generated LFs Weighting
NUMERIC how long
ENTITY what does
Table 1: Illustration of induced LFs, including examples of the issue of conflicting LFs, on the TREC dataset. Learning importance (weights) of LFs can be used to reduce the conflicts among LFs.

Table 1 presents a sample set of induced LFs and assigned labels from the TREC dataset Li and Roth (2002). These induced LFs are likely to be less precise than human-crafted LFs and to have more conflicts among different LFs. Owing to incompleteness and noise in the induced LFs, we observe that existing label aggregators that merely consume the outputs of the LFs are not well suited for such noisy LFs (c.f. Table 1). For instance, the sentence How long does a dog sleep ? will be assigned both DESCRIPTION and NUMERIC labels due to the LFs how and how long. As a solution, how should be given less importance due to its noisy and conflicting nature, whereas how long, and therefore the NUMERIC label, should be given higher importance. In this paper, we propose a bi-level optimization framework for reweighting the induced LFs, to effectively reduce the weights of noisy labels while also up-weighting the more useful ones.

Figure 1: Pictorial depiction of our Wisdom workflow. A small labeled-set is used to automatically induce LFs. This labeled set is split equally into supervised set and validation set to be used by our re-weighted semi-supervised data programming algorithm along with the unlabeled set.

In Figure 1, we present an overview of our approach. We leverage semi-supervision in the feature space for more effective data programming using the induced (automatically generated) labeling functions. To enable this, we split the same labeled-set (which was used to generate the LFs) into a supervised set and validation set. The supervised set is used for semi-supervised data programming, and validation set is used to tune the hyper-parameters. In this work, we leverage Spear Maheshwari et al. (2021) for semi-supervised data programming, which has achieved state-of-the-art performance. While the semi-supervised data programming approach helps in using the labeled data more effectively, it does not solve the problem of noise associated with the LFs. To address this, we propose an LF reweighting framework, Wisdom111Expanded as reWeIghting based Semi-supervised Data prOgraMming, which learns to reweight the labeling functions, thereby helping differentiate the noisy LFs from the cleaner and more effective ones.

The reweighting is achieved by framing the problem in terms of bi-level optimization. We argue that using a small labeled-set can help improve label prediction over hitherto unseen test instances when the labeled-set is bootstrapped for (i) inducing LFs, (ii) semi-supervision, and (iii) bi-level optimization to reweight the LFs. For most of this work, the LFs are induced automatically by leveraging part of the approach described in Varma and Ré (2018). The LFs are induced on the entire labeled-set, whereas the semi-supervision and reweighting are performed on the supervised set and validation set respectively (which are disjoint partitions of labeled-set).

Figure 2: A summary plot contrasting the performance gains obtained using Wisdom on previous state-of-the-art approaches on YouTube and TREC (both Lemma features). WISDOM outperforms other learning approaches with auto-generated LFs and is close to performance of SPEAR with Human-crafted LFs.

1.1 Our Contributions and Roadmap

Our Contributions are as follows: We leverage Snuba Varma and Ré (2018) to automatically generate

LFs and employ them for semi-supervised learning over both features and LFs. We address the limitations of

Snuba by effectively using the labeled set in a semi-supervised manner using Spear Maheshwari et al. (2021), and critically making the labeling function aggregation more robust via a reweighting framework. We do the reweighting by proposing a bi-level optimization algorithm that weighs each LF separately, giving low importance to noisy LFs and high importance to relevant LFs. We evaluate on six text classification datasets and demonstrate better performance than current label aggregation approaches with automatically generated labeling functions.

A summary of the results are shown in Figure 2. As mentioned, Snuba performs worse than a simple supervised baseline that just trains on the labeled data component. Furthermore, Wisdom outperforms Spear (a state-of-the-art semi-supervised data programming algorithm with auto-generated LFs) and Vat

(a state-of-the-art semi-supervised learning algorithm), demonstrating the benefit of having both semi-supervision and robust LF reweighting with the auto-generated LFs. Finally, WISDOM gets to within 2 - 4% of

Human-Spear (using human crafted-LFs), without the cost of generating labeling functions manually, which can require significant domain knowledge.

Paper Overview: The paper is structured as follows. The first step in our proposed approach, as referred to in Figure 1, is the generation of labeling functions (LFs), either manually or automatically. After setting up the basic notations in Section 2.1, in Section 2.2 we elaborate on the basic process we levereage to automatically generate LFs. As outlined in Figure 1, the LFs are employed in learning a joint model over the feature-based model and the labeling function aggregators on the labeled and unlabeled datasets. We describe the state-of-the-art label aggregation model (Cage) employed by us in Section 2.3 and the state-of-the-art semi-supervised data programming objective (Spear) in Section 2.4 - both of which form starting points for Wisdom. However, these are not sufficiently robust to noisy LFs, especially those induced automatically. To address this problem, Section 3 presents our bi-level optimization-based LF reweighting framework Wisdom, to reweigh LFs so that effective ones are assigned a higher weight, and the noisy ones are down-weighted. Therein, we present in detail (i) adaptation of the automatic LF generator (ii) reweighting modifications to the LF label aggregator and (iii) reweighting modifications to the semi-supervised objective. Finally, in Section 4, we demonstrate the utility of our algorithms empirically on six text classification datasets.

2 Background

width=0.5 Notation Description Firings of all the LFs, on an instance class associated by LF , when triggered () on The feature-based model with parameters operating on feature space and on label space

The label probabilities as per the LF-based aggregation model with parameters

labeled-set () The entire labeled dataset: where . This is used to induce the LFs supervised set () Subset of that is used for semi-supervision: where validation set () Subset of that is used for reweighting the LFs using a bi-level optimization formulation: where unlabeled-set () Unlabeled set: where . It is labeled using the induced LFs Cross Entropy Loss Entropy function Label Prediction from the LF-based graphical model Supervised negative log likelihood over the parameters of the LF aggregation model Unsupervised negative log likelihood summed over labels KL KL Divergence between two probability models Quality Guide based loss The semi-supervised bi-level optimization objective with additional weight parameters over the LFs

Table 2: Summary of notation used in this paper.

2.1 Notations

Let the feature space be denoted as and the label space as where is the number of classes. Let the (automatically or manually) generated labeling functions be denoted by to where

is the number of labeling functions generated. Let the vector

denote the firings of all the LFs on an instance . Each can be either or . indicates that the LF has fired on the instance and indicates it has not. Furthermore, each labeling function is associated with some class and for input outputs a discrete label when triggered (i.e., ) otherwise 0.

Let the labeled-set be denoted as where and is the number of points in labeled-set. Similarly, we have an unlabeled dataset denoted as where and is the number of unlabeled points. The labeled-set is further split into two disjoint sets called supervised set and validation set. Let the supervised set be denoted as where . Let the validation set be denoted as where . In Table 2, we summarize most notations that we use in the paper.

2.2 Snuba: Automatic LF Generation

Varma and Ré (2018) present Snuba, a three step approach that (i) automatically generates candidate LFs (referred to as heuristics) using a labeled set labeled-set, (ii) filters heuristics based on diversity and accuracy metrics to select only relevant heuristics, and (iii) uses the final set of filtered LFs (heuristics) and a label aggregator to compute class probabilities for each point in the unlabeled set . Steps (i) and (ii) are repeated until the labeled set is exhausted or the iteration limit is reached. Each LF is a basic composition of propositions on the labeled set. A proposition could be a word, a phrase, or a lemma (c.f., the second column of Table 1

), or abstractions such as a part of speech tag. The composition is in the form of a classifier such as decision stump (1-depth decision tree) or logistic regression.

Our Wisdom framework utilizes steps (i) and (ii) from Snuba and thereafter employs a more robust aggregator Cage. More importantly, instead of discarding labeled-set, after steps (i) and (ii), we bootstrap subsets of labeled-set for semi-supervised learning in conjunction with while also reweighting the LFs for robustness.

2.3 Cage: Label Aggregation Model

In this work, we consider the Cage Chatterjee et al. (2020) model as a label aggregator since it provides more stability to the training compared to other label aggregators such as Snorkel. Cage

aggregates the LFs by imposing a joint distribution between the true label

and the values returned by each LF .


There are parameters for each LF , where is the number of classes. The potential used in the Cage model is defined as:


As shown in Equation (2), Cage model assumes all LFs are equally good. This assumption proves to be severely limiting for Cage whenever LFs are noisy, which is often the case with automatically generated LFs.

2.4 Spear: Joint SSL Data Programming

Maheshwari et al. (2021) propose a joint learning framework called Spear that learns the parameters of a feature-based classification model and of the label aggregation model (the LF model) in a semi-supervised manner. Spear has a feature-based classification model that takes the features as input and predicts the class label. Spear

employs two kinds of models: a logistic regression and a two-layer neural network model. For the LF aggregation model,

Spear uses an LF-based graphical model inspired from Cage (as described in Section 2.3

). The loss function of

Spear has six terms. These include the cross entropy on the labeled set, an entropy SSL term on the unlabeled dataset, a cross entropy term to ensure consistency between the feature model and the LF model, the LF graphical model terms on the labeled and unlabeled datasets, a KL divergence again for consistency between the two models, and finally a regularizer. The objective function is:

In the objective function above, the LF model parameters are and feature model parameters are . The learning problem of Spear is simply to optimize the objective jointly over and .

3 The Wisdom Workflow

In this section, we present our robust aggregation framework for automatically generated LFs. We present the LF generation approach followed by our reweighting algorithm, which solves a bi-level optimization problem. In the bi-level optimization, we learn the LF weights in the outer level, and in the inner level, we learn the feature-based classifier’s and labeling function aggregator’s parameters jointly. We describe the main components of the Wisdom workflow below (see also Figure 1). A detailed pseudocode of Wisdom is provided in Algorithm 1. We describe the different components of Wisdom below.

Automatic LF Generation using SNUBA: Our Wisdom framework utilizes steps (i) and (ii) from Snuba (c.f., Section 2.2) for automatically inducing LFs. That is, it initially iterates between i) candidate LF generation on labeled-set and ii) filtering them based on diversity and accuracy based criteria, until a limit on the number of iterations is reached (or until the labeled set is completely covered). We refer to these steps as SnubaLFGen.

Re-Weighting Cage: To deal with noisy labels effectively, we associate each LF with an additional weight parameter that acts as its reliability measure. The discrete potential in Cage (c.f., eq.(2)) can be modified to include weight parameters as follows:


We observe that if the weight of the LF is zero (i.e., ), the corresponding weighted potentials in eq. (3) becomes one, which in turn implies that the LF is ignored while maximizing the log-likelihood during label aggregation. Similarly, suppose all the LFs are associated with a weight value of one (i.e., ). In that case, the above weighted potential will be the same as the discrete potential used in Cage. The re-weighted Cage is implicitly invoked on lines 12, 13, 17 and 18 of Algorithm 1 where is invoked.

Input: , Learning rates:
1 **** Automatic LF generation using SNUBA **** = SnubaLFGen() Get LFs trigger matrix for sets using Get LFs output label matrix for sets using **** The Reweighted Joint SSL **** ; Randomly initialize model parameters , and LF weights ; repeat
2        Sample mini-batch , of batch size from **** Bi-level Optimization **** **** Inner level **** **** Outer level **** **** Update net parameters ****
until convergencereturn
Algorithm 1 Wisdom

The Reweighted Joint SSL: Since the label aggregator graphical model is now dependent on the additional LF weight parameters , the joint semi-supervised learning objective function is modified as follows:


Bi-Level Objective:

We need to learn the weights before training the labeling function and feature classifier parameters using the objective function defined in Equation (3). Wisdom learns the optimal LFs weights by posing it as a bi-level optimization problem as defined in Equation (3). Wisdom uses a validation set (which is a subset of labeled-set) to learn the LFs weights. In essence, Wisdom tries to learn LF weights that result in minimum validation loss on the feature model that is jointly trained with weighted labeling aggregator.


However, finding the optimal solution to the above Bi-level objective function is computationally intractable. Hence, inspired by MAML Finn et al. (2017), Wisdom adopts an iterative alternative minimizing framework, wherein we optimize the objective function at each level using a single gradient descent step. As shown in Algorithm 1, lines 9 and 10 are the inner level updates where the parameters are updated using the current choice of weight parameters for one gradient step, and in line 12, the weight parameter is updated using the one-step updates from lines 9 and 10. Finally, the net parameters

are updated in lines 14 and 15. This procedure is continued till convergence (e.g. a fixed number of epochs or no improvement in the outer-level loss).

4 Experiments

Dataset #LFs #Labels
IMDB 71 71 1278 18 2
YouTube 55 55 977 11 2
SMS 463 463 8335 21 2
TREC 273 273 4918 13 6
Twitter 707 707 12019 25 3
SST-5 568 568 9651 25 5
Table 3: Summary statistics of the datasets and the automatically generated LFs using Snuba. Size of validation set is equal to size of supervised set and they are disjoint. The labeled-set is the union of and and therefore, . The test set contains 500 instances for each dataset.

In this section, we evaluate Wisdom against state-of-the-art semi-supervised approaches using a common automatic LF induction process inherited from Snuba Varma and Ré (2018). We also present comparisons against baseline. Our experimental evaluations are on six datasets on tasks such as question, sentiment, spam and genre classification.

4.1 Experimental Setting

As depicted in Figure 1, we use Snuba to induce LFs based on the labeled-set which is set to 10% of the complete dataset. The labeled-set is split equally into supervised set and validation set, to be used by the semi-supervised models and the reweighting approach, respectively. To train our model on the supervised set

, we use a neural network architecture with two hidden layers (512 units) and ReLU activation function as our feature-based model

. We choose our classification network to be the same as Spear Maheshwari et al. (2021). We consider two types of features: a) raw words and b) lemmatizations, as an input to our supervised model (lemmatization is a technique to reduce a word, e.g., ‘walking,’ into its root form, ’walk’). These features are also used as basic propositions over which the composite LFs are induced.

In each experimental run, we train Wisdom for 100 epochs with early stopping performed based on the validation set

. Our model is optimized using stochastic gradient descent with the Adam optimizer. The dropout probability is set to 0.8 while the batch size is 32. The learning rate for classifier and LF aggregation models are set to 0.0003 and 0.01 respectively. In each experiment, the performance numbers are obtained by averaging over five runs, each with a different random initialisation. The model with the best performance on the

validation set was chosen for evaluation on the test set. We employ macro-F1 as the evaluation measure on all the datasets. We implement222Source code is attached as supplementary material

all our models in PyTorch

333https://pytorch.org/ Paszke et al. (2019). We run all our experiments on Nvidia RTX 2080 Ti GPUs with 12 GB RAM set within Intel Xeon Gold 5120 CPU having 56 cores and 256 GB RAM. Model training times range from 5 mins (YouTube) to 60 mins (TREC).

4.2 Datasets

We present evaluations across six datasets, viz., TREC, SMS, IMDB, YouTube, Twitter, and SST-5. In Table 3, we present statistics summarising datasets, including the sizes of supervised set, validation set (with labeled-set being the union of these disjoint sets) and LFs used in experiments. The dataset summaries are presented next: TREC-6 Li and Roth (2002): A question classification dataset with six categories: Description, Entity, Human, Abbreviation, Numeric, Location. YouTube Spam Classification Alberto et al. (2015): A spam classification task over comments on YouTube videos. IMDB Genre Classification: A plot summary based movie genre binary classification dataset. SMS Spam Classification Almeida et al. (2011): A binary spam classification dataset to detect spam in SMS messages. Twitter Sentiment Wan and Gao (2015) : This is a 3-class sentiment classification problem extracted from twitter feed of popular airline handles. Each tweet is either labeled as negative, neutral and positive labels. Stanford Sentiment Treebank (SST-5) Socher et al. (2013) is a single sentence movie review dataset, and each sentence is labeled as either negative, somewhat negative, neutral, somewhat positive or positive.


Dataset Sup Snuba L2R VAT PR IL Auto-Spear Wisdom Hum-Spear
IMDB Raw 68.8 (0.2) -5.9 (2) -6.6 (1.6) -12.3 (1) +2.7 (15.6) +2.4 (1.7) +2.4 (1.6) +3.4 (0.1) NA
Lemma 72.4 (1.3) -14.4 (5.7) -3.7 (14.7) -19.3 (0.1) -11.7 (4.1) -6.4 (8.2) -2.4 (1.6) +3.6 (1.4) NA
YouTube Raw 90.8 (0.3) -33.2 (1.8) +0.5 (0.5) +0.5 (0) -4.7 (0.4) +0.2 (0.3) +0.8 (0.5) +1.4 (0.0) +3.8 (0.2)
Lemma 86 (0.3) -28.7 (2.9) -2.2 (0.7) -3.8 (0.2) -7.5 (0.5) -2.6 (0.3) -7.9 (3.7) +4.4 (0.2) +6.9(0.7)
SMS Raw 92.3 (0.5) -16.7 (9.8) -5.6 (0.4) +1.1 (0.1) +0.3 (0.1) 0 (0.3) 0.4 (0.8) +1.5 (0.1) +0.1 (0.5)
Lemma 91.4 (0.5) -16.1 (5.3) -5.9 (0.5) +1.6 (0.5) +0.6 (0.3) +1.5 (0.3) -1.5 (1.8) +2 (0.5) 0 (0.1)
TREC Raw 58.3 (3.1) -6.8 (4.1) -11.8 (0.8) +3.7 (0.5) -2.2 (0.6) -0.3 (0.8) -0.9 (0.5) +3.4 (0.5) -3.5 (0.5)
Lemma 56.3 (0.3) -5.8 (5.1) -5.5 (0.6) +3.0 (0.5) +0.4 (0.4) +0.8 (0.8) +2.7 (0.1) +3.9 (0.5) -0.1(0.3)
Twitter Raw 52.61 (0.12) -7 (4.1) -5 (2.3) +0.41 (3.5) -4.49 (3.6) -0.85 (0.6) -4.24 (0.4) +1.04 (0.8) NA
Lemma 61.24 (0.52) -9.28 (5.1) -18.03 (1.5) -10.8 (5.3) -8.12 (2.1) -3.79 (0.1) +1.9 (0.1) +3.97 (0.7) NA
SST-5 Raw 27.54 (0.12) -9 (2.2) -7.98 (0.2) -6.12 (0.12) -5.59 (0.2) -2.11 (0.1) -4.12 (0.1) +0.97 (0.3) NA
Lemma 27.52 (0.52) -8.31 (3.1) -8.1 (8.1) -7.89 (1.6) -7 (4.7) -3.4 (0.16) -3.13 (2.1) +0.79 (0.3) NA
Table 4: Performance of different approaches on six datasets IMDB, YouTube, SMS, TREC. Twitter, and SST-5. Results are shown for both ’Raw’ or ’Lemmatized’ features. The numbers are macro-F1 scores over the test set averaged over 5 runs, and for all methods after the double-line are reported as gains over the baseline (Supervised). L2R, PR, IL, Auto-Spear, and Wisdom all use the automatically generated LFs; Supervised and Vat do not use LFs; and the skyline Hum-Spear uses the human generated LFs. ’NA’ in Hum-Spear

column is when human LFs are not available. Numbers in brackets ‘()’ represent standard deviation of the original scores and not of the gains.

Wisdom consistently outperforms all other approaches, including Auto-Spear and Vat.

4.3 Baselines

In Table 4, we compare our approach against the following baselines:
Snuba Varma and Ré (2018): Recall from Section 2.2 that Snuba iteratively induces LFs from the count-based raw features of the dataset in the steps (i) and (ii). For the step (iii), as in Varma and Ré (2018), we employ a generative model to assign probabilistic labels to the unlabeled set.

These probabilistic labels are trained over the 2-layered NN model.

Supervised (SUP): This is the model obtained by training the classifier only on labeled-set. This baseline does not use the unlabeled set.
Learning to Reweight (L2R) Ren et al. (2018): This method trains the classifier using a meta-learning algorithm over the noisy labels in the unlabeled set obtained using the automatically generated labeled functions and aggregated using Snorkel. It runs an online algorithm that assigns importance to examples based on the gradient.
Posterior Regularization (PR) Hu et al. (2016): This is a method for joint learning of a rule and feature network in a teacher-student setup. Similarly to L2R, it uses the noisy labels in the unlabeled set obtained using the automatically generated labeled functions.
Imply Loss (IL) Awasthi et al. (2020): This method leverage both rules and labeled data by associating each rule with exemplars of correct firings (i.e., instantiations) of that rule. Their joint training algorithms de-noise over-generalized rules and train a classification model. This is also run on the automatically generated LFs.
SPEAR Maheshwari et al. (2021): This method employs a semi-supervised framework combined with graphical model for consensus amongst the LFs to train the model. We compare against two versions of Spear. One which uses auto-generated LFs, like L2R, PR, IL, and VAT (which we call Auto-Spear), and a skyline Hum-Spear, which uses the human LFs.
VAT: Virtual Adversarial Training Miyato et al. (2018) is a semi-supervised approach that uses the virtual adversarial loss on the unlabeled points, thereby ensuring robustness of the conditional label distribution around the unlabeled points.

4.4 Results

In Table 4, we compare the performance of Wisdom with different baselines (all using auto-generated labeling functions), using the raw and lemmatized count features/propositions (Section 2.2) across all datasets. We observe that Snuba performs worse than the Supervised

baselines over all datasets, with high variance over different runs (surprisingly,

Varma and Ré (2018) did not compare the performance of Snuba against the Supervised baseline). Learning to Reweight (L2R) performs worse than Supervised on all datasets except YouTube. Posterior regularization, imply loss and Spear show gains over Supervised on a few datasets, but not consistently across all datasets and settings. Finally, Vat obtains competitive results in some settings (e.g., TREC dataset), but performs much worse on others (e.g., IMDB and SST-5). In contrast, Wisdom achieves consistent gains over Supervised and the other baselines in almost all datasets (except TREC with raw features where Vat does slightly better than Wisdom). Additionally, Wisdom yields smaller variance over different runs compared to other semi-supervised approaches. Since the main difference between Wisdom and Auto-Spear is that the former reweighs the LFs in both the label aggregator as well as in the semi-supervised loss, the gains show the robustness of the bi-level optimisation algorithm. Note that these numbers are all reported using only 10% labeled data and hence, results for some datasets (starting with Sup) might appear lower than those reported in the literature.

Note that as a skyline, we compare Wisdom (using automatically induced LFs) to the Hum-Spear which uses the human crafted LFs in conjunction with the state-of-the-art Spear approach Maheshwari et al. (2021). We observe that despite inducing LFs automatically, the performance of Wisdom is only 2 - 4% worse compared to Spear when it employed human-generated LFs (and it is also comparable in the case of SMS). Note that generating human LFs is often a time-consuming process and requires a lot of domain knowledge, so the automatic LF generation in conjunction with Wisdom is definitely a promising avenue to consider. Furthermore, the automatically generated LFs could also serve as a starting point for humans to create better LFs to improve performance. An ablation test in Figure 3 reveals that Wisdom yields larger gains for smaller sized labeled-set.

Figure 3: Ablation study with different labeled-set sizes on the YouTube dataset.

5 Related Work

In this section, we describe some additional related work, that was not covered in Section 2.

Automatic Rule Generation: The programming by examples paradigm produces a program from a given set of input-output pairs Gulwani (2012); Singh and Gulwani (2012). It synthesises those programs that satisfy all input-output pairs. RuleNN Sen et al. (2020) learns interpretable first-order logic rules as composition of semantic role attributes. Many of these approaches, however, learn more involved rules (using e.g., a neural network) which may not work in the realistic setting of very small labeled data. In contrast, Snuba and Wisdom use more explainable models like logistic regression and decision trees for rule induction.

Semi-supervised Learning (SSL): The goal of SSL is to effectively use unlabeled data while training. Early SSL algorithms used regularization-based approaches like margin regularization, and laplacian regularization Chapelle et al. (2010). However, with the advent of advanced data augmentation, another form of regularization called consistency regularization has became popular. Most recent SSL approaches like Mean Teacher Tarvainen and Valpola (2018), VAT Miyato et al. (2018), UDA Xie et al. (2020), MixMatch Berthelot et al. (2019) and FixMatch Sohn et al. (2020) introduced various kinds of perturbations and augmentations that can be used along with consistency loss. Even though the current SSL approaches perform well even with minimal labels, they are computationally intensive and cannot be easily implemented in low-resource scenarios. Furthermore, it is tough to explain the discriminative behavior of the semi-supervised models.

Bi-level Optimization: The concept of bi-level optimization was first discussed in  von Stackelberg et al. (1952); Bracken and McGill (1973).  Bard (2006)

Since then, the framework of bi-level optimization has been used in various machine learning applications like hyperparameter tuning  

Mackay et al. (2019); Franceschi et al. (2018); Sinha et al. (2020), robust learning  Ren et al. (2018); Guo et al. (2020), meta-learning  Finn et al. (2017), efficient learning  Killamsetty et al. (2021) and continual learning  Borsos et al. (2020). Previous applications of the bi-level optimization framework for robust learning have been limited to supervised and semi-supervised learning settings. To the best of our knowledge, Wisdom is the first framework that uses a bi-level optimization approach robust aggregation of labeling functions.

6 Conclusion

While induction of labeling functions (LFs) for data-programming has been attempted in the past by Varma and Ré (2018), we observe in our experiments that the resulting model in itself does not perform well on text classification tasks, and turns out to be even worse than the supervised baseline. A more recent semi-supervised data programming approach called SPEAR Maheshwari et al. (2021), when used in conjunction with the induced LFs, performs better, though it fails to consistently outperform the supervised baseline. In this paper, we introduce Wisdom, a bi-level optimization formulation for reweighting the LFs, which injects robustness into the semi-supervised data programming approach, thus allowing it to perform well in the presence of noisy LFs. On a reasonably wide variety of text classification datasets, we show that Wisdom consistently outperforms all other approaches, while also coming close to the skyline of Spear using human-generated LFs.


  • T. C. Alberto, J. V. Lochter, and T. A. Almeida (2015) Tubespam: comment spam filtering on youtube. In 2015 IEEE 14th international conference on machine learning and applications (ICMLA), pp. 138–143. Cited by: §4.2.
  • T. A. Almeida, J. M. G. Hidalgo, and A. Yamakami (2011) Contributions to the study of sms spam filtering: new collection and results. In Proceedings of the 11th ACM symposium on Document engineering, pp. 259–262. Cited by: §4.2.
  • A. Awasthi, S. Ghosh, R. Goyal, and S. Sarawagi (2020) Learning from rules generalizing labeled exemplars. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, External Links: Link Cited by: §1, §1, §4.3.
  • S. H. Bach, D. Rodriguez, Y. Liu, C. Luo, H. Shao, C. Xia, S. Sen, A. Ratner, B. Hancock, H. Alborzi, et al. (2019) Snorkel drybell: a case study in deploying weak supervision at industrial scale. In Proceedings of the 2019 International Conference on Management of Data, pp. 362–375. Cited by: §1.
  • J. F. Bard (2006) Practical bilevel optimization: algorithms and applications (nonconvex optimization and its applications). Springer-Verlag, Berlin, Heidelberg. External Links: ISBN 0792354583 Cited by: §5.
  • D. Berthelot, N. Carlini, I. Goodfellow, N. Papernot, A. Oliver, and C. Raffel (2019) MixMatch: a holistic approach to semi-supervised learning. External Links: 1905.02249 Cited by: §5.
  • Z. Borsos, M. Mutný, and A. Krause (2020) Coresets via bilevel optimization for continual learning and streaming. External Links: 2006.03875 Cited by: §5.
  • J. Bracken and J. T. McGill (1973) Mathematical programs with optimization problems in the constraints. Operations Research 21 (1), pp. 37–44. External Links: ISSN 0030364X, 15265463, Link Cited by: §5.
  • O. Chapelle, B. Schlkopf, and A. Zien (2010) Semi-supervised learning. 1st edition, The MIT Press. External Links: ISBN 0262514125 Cited by: §5.
  • O. Chatterjee, G. Ramakrishnan, and S. Sarawagi (2020) Robust data programming with precision-guided labeling functions. In AAAI, Cited by: §1, §2.3.
  • C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. External Links: 1703.03400 Cited by: §3, §5.
  • L. Franceschi, P. Frasconi, S. Salzo, R. Grazzi, and M. Pontil (2018) Bilevel programming for hyperparameter optimization and meta-learning. External Links: 1806.04910 Cited by: §5.
  • S. Gulwani (2012) Synthesis from examples: interaction models and algorithms. In 2012 14th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, pp. 8–14. Cited by: §5.
  • L. Guo, Z. Zhang, Y. Jiang, Y. Li, and Z. Zhou (2020) Safe deep semi-supervised learning for unseen-class unlabeled data. In Proceedings of the 37th International Conference on Machine Learning, H. D. III and A. Singh (Eds.), Proceedings of Machine Learning Research, Vol. 119, pp. 3897–3906. External Links: Link Cited by: §5.
  • Z. Hu, X. Ma, Z. Liu, E. H. Hovy, and E. P. Xing (2016) Harnessing deep neural networks with logic rules. CoRR abs/1603.06318. Cited by: §4.3.
  • K. Killamsetty, D. Sivasubramanian, G. Ramakrishnan, and R. Iyer (2021) GLISTER: generalization based data subset selection for efficient and robust learning. In AAAI 2021. Cited by: §5.
  • X. Li and D. Roth (2002) Learning question classifiers. In COLING 2002: The 19th International Conference on Computational Linguistics, Cited by: §1, §4.2.
  • M. Mackay, P. Vicol, J. Lorraine, D. Duvenaud, and R. Grosse (2019) Self-tuning networks: bilevel optimization of hyperparameters using structured best-response functions. In International Conference on Learning Representations, External Links: Link Cited by: §5.
  • A. Maheshwari, O. Chatterjee, K. Killamsetty, R. K. Iyer, and G. Ramakrishnan (2021) Data programming using semi-supervision and subset selection. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, External Links: Link, 2008.09887 Cited by: §1.1, §1, §1, §1, §2.4, §4.1, §4.3, §4.4, §6.
  • T. Miyato, S. Maeda, M. Koyama, and S. Ishii (2018) Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence 41 (8), pp. 1979–1993. Cited by: §4.3, §5.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019)

    PyTorch: an imperative style, high-performance deep learning library

    External Links: 1912.01703 Cited by: §4.1.
  • A. J. Ratner, C. M. De Sa, S. Wu, D. Selsam, and C. Ré (2016) Data programming: creating large training sets, quickly. In Advances in neural information processing systems, pp. 3567–3575. Cited by: §1.
  • M. Ren, W. Zeng, B. Yang, and R. Urtasun (2018) Learning to reweight examples for robust deep learning. In International Conference on Machine Learning, pp. 4334–4343. Cited by: §4.3, §5.
  • P. Sen, M. Danilevsky, Y. Li, S. Brahma, M. Boehm, L. Chiticariu, and R. Krishnamurthy (2020)

    Learning explainable linguistic expressions with neural inductive logic programming for sentence classification


    Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    pp. 4211–4221. Cited by: §5.
  • O. Sharir, A. Shashua, and G. Carleo (2021)

    Neural tensor contractions and the expressive power of deep neural quantum states

    CoRR abs/2103.10293. External Links: Link, 2103.10293 Cited by: §1.
  • R. Singh and S. Gulwani (2012) Synthesizing number transformations from input-output examples. In International Conference on Computer Aided Verification, pp. 634–651. Cited by: §5.
  • A. Sinha, T. Khandait, and R. Mohanty (2020) A gradient-based bilevel optimization approach for tuning hyperparameters in machine learning. External Links: 2007.11022 Cited by: §5.
  • R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Y. Ng, and C. Potts (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp. 1631–1642. Cited by: §4.2.
  • K. Sohn, D. Berthelot, C. Li, Z. Zhang, N. Carlini, E. D. Cubuk, A. Kurakin, H. Zhang, and C. Raffel (2020) FixMatch: simplifying semi-supervised learning with consistency and confidence. External Links: 2001.07685 Cited by: §5.
  • A. Tarvainen and H. Valpola (2018) Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. External Links: 1703.01780 Cited by: §5.
  • P. Varma and C. Ré (2018) Snuba: automating weak supervision to label training data. Proc. VLDB Endow. 12 (3), pp. 223–236. External Links: ISSN 2150-8097, Link, Document Cited by: §1, §4.3, §4.4, §4, §6.
  • P. Varma and C. Ré (2018) Snuba: automating weak supervision to label training data. In Proceedings of the VLDB Endowment. International Conference on Very Large Data Bases, Vol. 12, pp. 223. Cited by: §1.1, §1, §2.2.
  • H. von Stackelberg, S.H. Von, and A.T. Peacock (1952) The theory of the market economy. Oxford University Press. External Links: LCCN 52004949, Link Cited by: §5.
  • Y. Wan and Q. Gao (2015) An ensemble sentiment classification system of twitter data for airline services analysis. In 2015 IEEE international conference on data mining workshop (ICDMW), pp. 1318–1325. Cited by: §4.2.
  • Q. Xie, Z. Dai, E. Hovy, M. Luong, and Q. V. Le (2020) Unsupervised data augmentation for consistency training. External Links: 1904.12848 Cited by: §5.