Currently an important portion of research in natural language processing is devoted to the goal of reducing or getting rid of large labeled datasets. Recent examples include language model fine-tuning(Devlin et al., 2019) et al., 2016) or few-shot learning (Brown et al., 2020)
. Another common approach is weakly supervised learning. The idea is to make use of human intuitions or already acquired human knowledge to create weak labels. Examples of such sources are keyword lists, regular expressions, heuristics or independently existing curated data sources, e.g. a movie database if the task is concerned with TV shows. While the resulting labels are noisy, they provide a quick and easy way to create large labeled datasets. In the following, we use the term labeling functions, introduced inRatner et al. (2017), to describe functions which create weak labels based on the notions above.
. Probably the most popular example of a system using generative modeling in weak supervision is the data programming paradigm of Snorkel(Ratner et al., 2017). It uses correlations within labeling functions to learn a graph capturing dependencies between labeling functions and true labels.
However, such an approach does not directly model biases of weak supervision reflected in the feature space. In order to directly model the relevant aspects in the feature space of a weakly supervised dataset, we investigate the use of density estimation using normalizing flows. More specifically, in this work, we model probability distributions over the input space induced bylabeling functions, and combine those distributions for better weakly supervised prediction.
We propose and examine four novel models for weakly supervised learning based on normalizing flows (WeaNF-*): Firstly, we introduce a standard model WeaNF-S
, where each labeling function is represented by a multivariate normal distribution, and itsiterative variant WeaNF-I. Furthermore WeaNF-N additionally learns the negative space, i.e. a density for the space where the labeling function does not match, and a mixed model, WeaNF-M, where correlations of sets of labeling functions are represented by the normalizing flow. As a consequence, the classification task is a two step procedure. The first step estimates the densities, and the second step aggregates them to model label prediction. Multiple alternatives are discussed and analyzed.
We benchmark our approach on several commonly used weak supervision datasets. The results highlight that our proposed generative approach is competitive with standard weak supervision methods. Additionally the results show that smart aggregation schemes prove beneficial.
In summary, our contributions are i) the development of multiple models based on normalizing flows for weak supervision combined with density aggregation schemes, ii) a quantitative and qualitative analysis highlighting opportunities and problems and iii) an implementation of the method111https://github.com/AndSt/wea_nf. To the best of our knowledge we are the first to use normalizing flows to generatively model labeling functions.
2 Background and Related Work
We split this analysis into a weak supervision and a normalizing flow section as we build upon these two areas.
A fundamental problem in machine learning is the need for massive amounts of manually labeled data. Weak supervision provides a way to counter the problem. The idea is to use human knowledge to produce noisy, so called weak labels. Typically, keywords, heuristics or knowledge from external data sources is used. The latter is called distant supervision(Craven and Kumlien, 1999; Mintz et al., 2009). In Ratner et al. (2017), data programming is introduced, a paradigm to create and work with weak supervision sources programmatically. The goal is to learn the relation between weak labels and the true unknown labels (Ratner et al., 2017; Varma et al., 2019; Bach et al., 2017; Chatterjee et al., 2019). In Ren et al. (2020) the authors use iterative modeling for weak supervision. Software packages such as SPEAR (Abhishek et al., 2021), WRENCH (Zhang et al., 2021) and Knodle (Sedova et al., 2021) allow a modular use and comparison of weak supervision methods. A recent trend is to use additional information to support the learning process. Chatterjee et al. (2019) allow labeling functions to assign a score to the weak label. In Ratner et al. (2018) the human provided class balance is used. Additionally Awasthi et al. (2020); Karamanolakis et al. (2021) use semi-supervised methods for weak supervision, where the idea is to use a small amount of labeled data to steer the learning process.
Normalizing flows. While the concept of normalizing flows is much older, Rezende and Mohamed (2016) et al., 2014)
or Variational Autoencoders(Kingma and Welling, 2014), normalizing flows provide a tractable way to model high-dimensional distributions. So far, normalizing received rather little attention in the natural language processing community. Still, Tran et al. (2019) and Ziegler and Rush (2019) applied them successfully to language modeling. An excellent overview over recent normalizing flow research is given in Papamakarios et al. (2021). Normalizing flows are based on the change of variable formula, which uses a bijective function to transform a base distribution into a target distribution :
where is typically a simple distribution, e.g. multivariate normal distribution, and is a complicated data generating distribution. Typically, a neural network learns a function by minimizing the KL-divergence between the data generating distribution and the simple base distribution. As described in Papamakarios et al. (2021) this is achieved by minimizing negative log likelihood
The tricky part is to design efficient architectures which are invertible and provide an easy and efficient way to compute the determinant. The composition of bijective functions is again bijective which enables deep architectures . Recent research focuses on the creation of more expressive transformation modules (Lu et al., 2021). In this work, we make use of an early, but well established model, called RealNVP Dinh et al. (2017). In each layer, the input is split in half and transformed according to
where is the pointwise multiplication and and neural networks. Using this formulation to realize a layer , it is easy and efficient to compute the inverse and the determinant.
axis represents the learned density related to a labeling function. In this example we use the task sentiment analysis and keyword search as labeling functions. Blue denotes a negative sentiment and red a positive sentiment.
3 Model Description
In this section the models are introduced. The following example motivates the idea. Consider the sentence , "The movie was fascinating, even though the graphics were poor, maybe due to a low budget.", the task sentiment analysis and labeling functions given by the keywords "fascinating" and "poor". Furthermore, "fascinating" is associated with the class POS, and "poor" with the class NEG. We aim to learn a neural network, which translates the complex object, text and a possible labeling function match, to a density, in the current example and . We combine this information using basic probability calculus to make a classification prediction.
Multiple models are introduced. The standard model WeaNF-S naively learns to represent each labeling function as a multivariate normal distribution.
In order to make use of unlabeled data, i.e. data where no labeling function matches, we iteratively apply the standard model (WeaNF-I).
Based on the observation that labeling functions overlap, we derive WeaNF-N modeling the negative space, i.e. the space where the labeling function does not match and the mixed model, WeaNF-M, using a common space for single labeling functions and the intersection of these.
Furthermore, multiple aggregation schemes are used to combine the learned labeling function densities. See table 1 for an overview.
Before we dive into details, we introduce some notation. From the set of all possible inputs , e.g. texts, we denote an input sample by
and its corresponding vector representation by. The set of labeling functions is and the classes are . Each labeling function maps the input to a specific class or abstains from labeling. In some of our models, we also associate an embedding with each labeling function, which we denote by . The set of labeling functions corresponding to label is .
WeaNF-S/I. The goal of the standard model is to learn a distribution for each labeling function . Similarly to Atanov et al. (2020)
in semi-supervised learning, we use a randomly initialized embeddingto create a representation for each labeling function in the input space. We concatenate input and labeling function vector and provide it as input to the normalizing flow, thus learning , where describes the concatenation operation. A standard RealNVP Dinh et al. (2017), as described in section 2 is used. See appendix B.1 for implementational details. In order to use the learned probabilities to perform label prediction, an aggregation scheme is needed. For the sake of simplicity, the model predicts the label corresponding to the labeling function with the highest likelihood, .
Additionally, to make use of the unlabeled data, i.e. the data points where no labeling function matches, an iterative version WeaNF-I is tested. For this, we use an EM-like (Dempster et al., 1977) iterative scheme where the predictions of the model trained in the previous iteration are used as labels for the unlabeled data. The corresponding pseudo-code is found in algorithm 1.
Negative Model. In typical classification scenarios it is enough to learn to compute a posterior by applying Bayes’ formula twice, resulting in
where the class prior is typically approximated on the training data or passed as a parameter. This is not possible in the current setting as often two labeling functions match simultaneously. In order to learn , we explore a novel variant that additionally learns . The learning process is similar to , so a second embedding is introduced to represent . We optimize and simultaneously. In each batch , the positive sample pairs and negative pairs , sampled such that
, are used to train the network. The number of negative samples per positive sample is an additional hyperparameter. Now Bayes’ formula can be used as in equation3 to obtain
The access to the posterior probabilityprovides additional opportunities to model . After initial experimentation we settled on two options. A simple addition of probabilities neglecting intersection probability, equation 5, which we call Union, and the NoisyOr formula, equation 7, which has previously shown to be effective in weakly supervised learning Keith et al. (2017):
|Dataset||#Classes||#Train / #Test samples||#LF’s||Coverage(%)||Class Balance|
|IMDb||2||39741 / 4993||20||0.60||1:1|
|Spouse||2||8530 / 1187||9||0.30||1:5|
|YouTube||2||1440 / 229||10||1.66||1:1|
|SMS||2||4208 / 494||73||0.51||1:6|
|Trec||6||4903 / 500||68||1.73||1:13:14:14:9:10|
Mixed Model. It was already mentioned that it is common that two or multiple labeling functions hit simultaneously. While WeaNF-N provides access to a posterior distribution which allows to model these interactions, the goal of the mixed model WeaNF-M is to model these intersections explicitly already in the density of the normalizing flow. More specifically, we aim to learn for arbitrary index families . Once again, the embeddings space is used to achieve this goal. For a given sample and a family of matching labeling functions, we uniformly sample from the simplex of all possible combinations and obtain . Afterwards we concatenate the weighted sum of the labeling function embeddings with the input and learn . Now that the density is able to access the intersections of labeling functions, we derive a new direct aggregation scheme. By we denote the simplex generated by the set of boundary points . It is important to think about this simplex, as it theoretically describes the input space where the model learns the density related to class . We use the naive but efficient variant which just computes the center of the simplex:
Implementation. In practice, sampling of data points has to be handled on multiple occasions. Empirically and during the inspection of related implementations, e.g. the Github repository accompanying Atanov et al. (2020), we found that it is beneficial if every labeling function is seen equally often during training. It supports preventing a biased density towards specific labeling functions. When training WeaNF-N, the negative space is much larger than the actual space, so an additional hyperparameter controlling the amount of negative samples is needed. WeaNF-M aims to model intersecting probabilities directly. Most intersections occur too rarely to model a reasonable density. Thus we decided to only take co-occures into account which occur more often than a certain threshold. See appendix A.3 to get a feeling for the correlations in the used datasets.
|MV + MLP||73.20||29.96||92.58||92.41||53.27|
|DP + MLP||67.79||57.05||88.79||84.40||43.00|
In order to analyze the proposed models experiments on multiple standard weakly supervised classification problems are performed. In the following, we introduce datasets, baselines and training details.
Within our experiments, we use five classification tasks. Table 2 gives an overview over some key statistics. Note that these might differ slightly compared to other papers due to the removal of duplicates. For a more detailed overview of our preprocessing steps, see appendix A.1.
The first dataset is IMDb (Internet Movie Database) and the accompanying sentiment analysis task Maas et al. (2011)
. The goal is to classify whether a movie review describes a positive or a negative sentiment. We usepositive and negative keywords as labeling functions. See Appendix A.2 for a detailed description.
The second dataset is the Spouse dataset Corney et al. (2016). The task is to classify whether a text holds a spouse relation, e.g. "Mary is married to Tom". Here, of the samples belong to the no-relation class, so we use macro- score to evaluate the performance. As the third dataset another binary classification problem is given by the YouTube Spam Alberto et al. (2015) dataset. The model has to decide whether a YouTube comment is spam or not. For both, the Spouse and the YouTube dataset, the labeling functions are provided by the Snorkel framework Ratner et al. (2017).
The SMS Spam detection dataset Almeida et al. (2011)
, we abbreviate by SMS, also asks for spam but in the private messaging domain. The dataset is quite skewed, so once again macro-score is used. Lastly, a multi-class dataset, namely TREC-6 Li and Roth (2002), is used. The task is to classify questions into six categories, namely Abbreviation, Entity, Description, Human and Location. The labeling functions provided by Awasthi et al. (2020) are used for the SMS and the TREC dataset. We took the preprocessed versions of the data available within the Knodle weak supervision programming framework Sedova et al. (2021).
Three baselines are used. While there are many weak supervision systems, most use additional knowledge to improve performance. Examples are class balance Chatterjee et al. (2019), semi-supervised learning with very little labels Awasthi et al. (2020); Karamanolakis et al. (2021) or multi-task learning Ratner et al. (2018)
. To ensure a fair comparison, only baselines are used that solely take input data and labeling function matches into account. First we use majority voting (MV) which takes the label where the most rules match. For instances where multiple classes have an equal vote or where no labeling function matches, a random vote is taken. Secondly, a multi-layer perceptron (MLP) is trained on top of the labels provided by majority vote. The third baseline uses the data programming (DP) paradigm. More explicitly, we use the model introduced byRatner et al. (2018) implemented in the Snorkel Ratner et al. (2017) programming framework. It performs a two-step approach to learning. Firstly, a generative model is trained to learn the most likely correlation between labeling functions and unknown true labels. Secondly, a discriminative model uses the labels of the generative model to train a final model. The same MLP as for second baseline is used for the final model.
4.3 Training Details
Text input embeddings are created with the SentenceTransformers library Reimers and Gurevych (2019) using the bert-base-nli-mean-tokens model. They serve as input to the baselines and the normalizing flows. Hyperparameter search is performed via grid search over learning rates of , weight decay of
and epochs in, and label embedding dimension in times the number of classes. Additionally, the number of layers is in , and the negative sampling value for WeaNF is in . The full set up ran hours on a single GPU on a DGX server.
The analysis is divided into three parts. Firstly, a general discussion of the results is given. Secondly, an analysis of the densities predicted by WeaNF-N is shown and lastly, a qualitative analysis is performed.
|won .* claim||…won … call …||SMS||Spam||Spam||Spam|
|.* I’ll .*||sorry, I’ll call later||SMS||No Spam||No Spam||No Spam|
|.* i .*||i just saw ron burgundy captaining a party boat so yeah||SMS||No Spam||No Spam||No Spam|
|(explain|what) .* mean .*||What does the abbreviation SOS mean ?||Trec||DESCR||ABBR||DESCR|
|(explain|what) .* mean .*||What are Quaaludes ?||Trec||DESCR||DESCR||DESCR|
|who.*||Who was the first man to … Pacific Ocean ?||Trec||HUMAN||HUMAN||HUMAN|
|check .* out .*||Check out this video on YouTube:||YouTube||Spam||Spam||Spam|
|#words < 5||subscribe my||YouTube||Spam||Spam||No Spam|
|.* song .*||This Song will never get old||YouTube||No Spam||No Spam||No Spam|
|.* dreadful .*||…horrible performance …. annoying||IMDb||NEG||NEG||NEG|
|.* hilarious .*||…liked the movie…funny catchphrase…WORST…low grade…||IMDb||POS||NEG||POS|
|.* disappointing .*||don’t understand stereotype … goofy ..||IMDb||NEG||NEG||POS|
|.* (husband|wife) .*||…Jill.. she and her husband…||Spouse||Spouses||Spouses||Spouses|
|.* married .*||… asked me to marry him and I said yes!||Spouse||Spouses||No Spouses||Spouses|
|family word||Clearly excited, Coleen said: ’It’s my eldest son Shane and Emma.||Spouse||No Spouses||No Spouses||No Spouses|
5.1 Overall Findings
Table 3 exposes the main evaluation. The horizontal line separates the baselines from our models. For WeaNF-N and WeaNF-M, no iterative schemes were trained. This enables a direct comparison to the standard model WeaNF-I.
Interestingly, the combination of Snorkel and MLP’s is often not performing competitively. In the IMDb data set there is barely any correlation between labeling functions, complicating Snorkel’s approach. The large number of labeling functions e.g. Trec, SMS, could also complicate correlation based approaches. Appendix A.3 shows correlation graphs.
As indicated by the bold numbers, the WeaNF-I is the best performing model. Only on the YouTube dataset, an iterative scheme could not improve the results. Related to this observation, in Ren et al. (2020) the authors achieve promising results using iterative discriminative modeling for semi-supervised weak supervision.
WeaNF-N outperforms the standard model in three out of five datasets. We observe that these are the datasets with a large amount of labeling functions. Possibly, this biases the model towards a high value of which confuses the prediction.
The simplex aggregation scheme only outperforms the maximum aggregation on two out of five datasets. We infer that the probability density over the labeling function input space is not smooth enough. Ideally, the simplex method should always have a high confidence in the prediction of a labeling function if its confident on the non-mixed embedding which is what Max is doing.
|Spouse||family word||9. 0||16.53||35.96|
|SMS||I .* miss||0.6||0||0|
|Trec||what is .* name||2.2||2.26||100|
5.2 Density Analysis
We divide into a global analysis and a local, i.e. a per-labeling function, analysis. Table 5 provides some global statistics, table 6 and 7 subsequently show statistics related to the best and worst performing labeling function estimations. In the local analysis a labeling function is predicted if . The WeaNF-N model is used because it is the only model with direct access to .
It is important to mention that in the local analysis, a perfect prediction of the matching labeling function is not wanted, as this would mean that there is no generalization. Thus, a low precision might be necessary for generalization, and a the recall would indicate how much of the original semantic or syntactic meaning of a labeling function is retained.
Interestingly, while the overall performance of WeaNF-N is competitive on the IMDb and the Spouse data sets, it is failing to predict the correct labeling function. One explanation might be that these are the data sets where the texts are substantially longer which might be complicated to model for normalizing flows. In table 7 typically the worst performing approximation of labeling function matches seems to be due to low coverage. An exception is the the Spouse labeling function.
5.3 Qualitative Analysis
In table 4 a number of examples are shown. We manually inspected samples with a very high or low density value. Note that density values related to are functions taking arbitrary values which only have to satisfy .
We observed the phenomenon that either the same labeling functions take the highest density values or that a single sample often has a high likelihood for multiple labeling functions. In the table 4 one can find examples where the learned flows were able to generalize from the original labeling functions. For example, for the IMDb dataset, it detects the meaning "funny" even though the exact keyword is "hilarious".
This work explores the novel use of normalizing flows for weak supervision. The approach is divided into two logical steps. In the first step, normalizing flows are employed to learn a probability distribution over the input space related to a labeling function. Secondly, principles from basic probability calculus are used to aggregate the learned densities and make them usable for classification tasks. Motivated by aspects of weakly supervised learning, such as labeling function overlap or coverage, multiple models are derived each of which uses the information present in the latent space differently. We show competitive results on five weakly supervised classification tasks. Our analysis shows that the flow-based representations of labeling functions successfully generalize to samples otherwise not covered by labeling functions.
This research was funded by the WWTF through the project ”Knowledge-infused Deep Learning for Natural Language Processing” (WWTF Vienna Research Group VRG19-008), and by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) - RO 5127/2-1.
- SPEAR : Semi-supervised Data Programming in Python. External Links: Cited by: §2.
- TubeSpam: Comment Spam Filtering on YouTube. In 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA), pp. 138–143. External Links: Cited by: §4.1.
- Pattern Learning for Relation Extraction with a Hierarchical Topic Model. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Jeju Island, Korea, pp. 54–59. External Links: Cited by: §1.
- Contributions to the study of SMS spam filtering: new collection and results. In DocEng ’11, Cited by: §4.1.
- Semi-Conditional Normalizing Flows for Semi-Supervised Learning. External Links: Cited by: §2, §3, §3.
- Learning from Rules Generalizing Labeled Exemplars. External Links: Cited by: §2, §4.1, §4.2.
- Learning the Structure of Generative Models without Labeled Data. CoRR abs/1703.0. External Links: Cited by: §2.
- Language Models are Few-Shot Learners. External Links: Cited by: §1.
- Data Programming using Continuous and Quality-Guided Labeling Functions. CoRR abs/1911.0. External Links: Cited by: §2, §4.2.
- What do a Million News Articles Look like?. In NewsIR@ECIR, Cited by: §4.1.
- Constructing Biological Knowledge Bases by Extracting Information from Text Sources. In Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology, pp. 77–86. External Links: Cited by: §2.
- Maximum likelihood from incomplete data via the EM algorithm. JOURNAL OF THE ROYAL STATISTICAL SOCIETY, SERIES B 39 (1), pp. 1–38. Cited by: §3.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. External Links: Cited by: §1.
- Density estimation using Real NVP. External Links: Cited by: §2, §3.
- Generative Adversarial Networks. External Links: Cited by: §2.
- Semi-Supervised Learning with Normalizing Flows. External Links: Cited by: §2.
- Self-Training with Weak Supervision. External Links: Cited by: §2, §4.2.
- Identifying civilians killed by police with distantly supervised entity-event extraction. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 1547–1557. External Links: Cited by: §3.
- Auto-Encoding Variational Bayes. External Links: Cited by: §2.
- Learning Question Classifiers. In COLING 2002: The 19th International Conference on Computational Linguistics, External Links: Cited by: §4.1.
- Implicit Normalizing Flows. In International Conference on Learning Representations, External Links: Cited by: §2.
- Learning Word Vectors for Sentiment Analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, Oregon, USA, pp. 142–150. External Links: Cited by: §4.1.
- Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, Suntec, Singapore, pp. 1003–1011. External Links: Cited by: §2.
- Normalizing Flows for Probabilistic Modeling and Inference. External Links: Cited by: §2.
- Snorkel: Rapid Training Data Creation with Weak Supervision. CoRR abs/1711.1. External Links: Cited by: §1, §1, §2, §4.1, §4.2.
- Training Complex Models with Multi-Task Weak Supervision. External Links: Cited by: §2, §4.2.
- Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, External Links: Cited by: §4.3.
- Denoising Multi-Source Weak Supervision for Neural Text Classification. Findings of the Association for Computational Linguistics: EMNLP 2020. External Links: Cited by: §2, §5.1.
- Variational Inference with Normalizing Flows. External Links: Cited by: §2.
Knodle: Modular Weakly Supervised Learning with PyTorch. External Links: Cited by: §2, §4.1.
- Reducing Wrong Labels in Distant Supervision for Relation Extraction. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Jeju Island, Korea, pp. 721–729. External Links: Cited by: §1.
- Discrete Flows: Invertible Generative Models of Discrete Data. External Links: Cited by: §2.
- Learning Dependency Structures for Weak Supervision Models. External Links: Cited by: §2.
- WRENCH: a comprehensive benchmark for weak supervision. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, External Links: Cited by: §2.
- Latent Normalizing Flows for Discrete Sequences. External Links: Cited by: §2.
Transfer Learning for Low-Resource Neural Machine Translation. External Links: Cited by: §1.
Appendix A Additional Data Description
A few steps were performed, to create a unified data format. The crucial difference to other papers is that we removed duplicated samples. There were two cases. Either there were very little duplicates or the duplication occurred because of the programmatic data generation, thus not resembling the real data generating process. Most notably, in the spouse data set of all data points are duplicates. Furthermore, we only used rules which occurred more often than a certain threshold as it is impossible to learn densities on only a handful of examples. The threshold is In order to have unbiased baselines, we ran the baseline experiments on the full set of rules and the reduced set of rules and took the best performing number.
a.2 IMDb rules
The labeling functions for the IMDb dataset are defined by keywords. We manually chose the keywords. We defined them in such a way that their meaning has rather little semantic overlap. The keywords are shown in table 8.
a.3 Labeling Function Correlations
In order to use labeling functions for weakly supervised learning, it is important to know the correlation of labeling functions to i) derive methods to combine them and ii) help to understand phenomena of the model predictions.
Thus we decided to add correlation plots. More specifically, we use the Pearson Correlation coefficient.
Appendix B Additional Implementationial Details
As mentioned in section 3, the backbone of our flow is RealNVP architecture, which we introduced in section 2. With sticking to the notation in formula 2 the network layers to approximate the functions and are shown below
Hyperparameters are the depth, i.e. number of stacked layers, and the hidden dimension.
b.2 WeaNF-M Sampling
For the mixed model WeaNF-M the sampling process becomes rather complicated.
Next up, the code to produce the convex combination
is shown. The input tensor takes values inand has shape where is the batch size and the number of labeling functions.Note that some mass is put on every labeling functions. We realized that this bias imrpoves performance.