1 Introduction
Currently an important portion of research in natural language processing is devoted to the goal of reducing or getting rid of large labeled datasets. Recent examples include language model finetuning
(Devlin et al., 2019)(Zoph et al., 2016) or fewshot learning (Brown et al., 2020). Another common approach is weakly supervised learning. The idea is to make use of human intuitions or already acquired human knowledge to create weak labels. Examples of such sources are keyword lists, regular expressions, heuristics or independently existing curated data sources, e.g. a movie database if the task is concerned with TV shows. While the resulting labels are noisy, they provide a quick and easy way to create large labeled datasets. In the following, we use the term labeling functions, introduced in
Ratner et al. (2017), to describe functions which create weak labels based on the notions above.Throughout the weak supervision literature generative modeling ideas are found (Takamatsu et al., 2012; Alfonseca et al., 2012; Ratner et al., 2017)
. Probably the most popular example of a system using generative modeling in weak supervision is the data programming paradigm of Snorkel
(Ratner et al., 2017). It uses correlations within labeling functions to learn a graph capturing dependencies between labeling functions and true labels.However, such an approach does not directly model biases of weak supervision reflected in the feature space. In order to directly model the relevant aspects in the feature space of a weakly supervised dataset, we investigate the use of density estimation using normalizing flows. More specifically, in this work, we model probability distributions over the input space induced by
labeling functions, and combine those distributions for better weakly supervised prediction.We propose and examine four novel models for weakly supervised learning based on normalizing flows (WeaNF*): Firstly, we introduce a standard model WeaNFS
, where each labeling function is represented by a multivariate normal distribution, and its
iterative variant WeaNFI. Furthermore WeaNFN additionally learns the negative space, i.e. a density for the space where the labeling function does not match, and a mixed model, WeaNFM, where correlations of sets of labeling functions are represented by the normalizing flow. As a consequence, the classification task is a two step procedure. The first step estimates the densities, and the second step aggregates them to model label prediction. Multiple alternatives are discussed and analyzed.We benchmark our approach on several commonly used weak supervision datasets. The results highlight that our proposed generative approach is competitive with standard weak supervision methods. Additionally the results show that smart aggregation schemes prove beneficial.
In summary, our contributions are i) the development of multiple models based on normalizing flows for weak supervision combined with density aggregation schemes, ii) a quantitative and qualitative analysis highlighting opportunities and problems and iii) an implementation of the method^{1}^{1}1https://github.com/AndSt/wea_nf. To the best of our knowledge we are the first to use normalizing flows to generatively model labeling functions.
2 Background and Related Work
We split this analysis into a weak supervision and a normalizing flow section as we build upon these two areas.
Weak supervision.
A fundamental problem in machine learning is the need for massive amounts of manually labeled data. Weak supervision provides a way to counter the problem. The idea is to use human knowledge to produce noisy, so called weak labels. Typically, keywords, heuristics or knowledge from external data sources is used. The latter is called distant supervision
(Craven and Kumlien, 1999; Mintz et al., 2009). In Ratner et al. (2017), data programming is introduced, a paradigm to create and work with weak supervision sources programmatically. The goal is to learn the relation between weak labels and the true unknown labels (Ratner et al., 2017; Varma et al., 2019; Bach et al., 2017; Chatterjee et al., 2019). In Ren et al. (2020) the authors use iterative modeling for weak supervision. Software packages such as SPEAR (Abhishek et al., 2021), WRENCH (Zhang et al., 2021) and Knodle (Sedova et al., 2021) allow a modular use and comparison of weak supervision methods. A recent trend is to use additional information to support the learning process. Chatterjee et al. (2019) allow labeling functions to assign a score to the weak label. In Ratner et al. (2018) the human provided class balance is used. Additionally Awasthi et al. (2020); Karamanolakis et al. (2021) use semisupervised methods for weak supervision, where the idea is to use a small amount of labeled data to steer the learning process.Normalizing flows. While the concept of normalizing flows is much older, Rezende and Mohamed (2016)
introduced the concept to deep learning. In comparison to other generative neural networks, such as Generative Adversarial networks
(Goodfellow et al., 2014)or Variational Autoencoders
(Kingma and Welling, 2014), normalizing flows provide a tractable way to model highdimensional distributions. So far, normalizing received rather little attention in the natural language processing community. Still, Tran et al. (2019) and Ziegler and Rush (2019) applied them successfully to language modeling. An excellent overview over recent normalizing flow research is given in Papamakarios et al. (2021). Normalizing flows are based on the change of variable formula, which uses a bijective function to transform a base distribution into a target distribution :where is typically a simple distribution, e.g. multivariate normal distribution, and is a complicated data generating distribution. Typically, a neural network learns a function by minimizing the KLdivergence between the data generating distribution and the simple base distribution. As described in Papamakarios et al. (2021) this is achieved by minimizing negative log likelihood
The tricky part is to design efficient architectures which are invertible and provide an easy and efficient way to compute the determinant. The composition of bijective functions is again bijective which enables deep architectures . Recent research focuses on the creation of more expressive transformation modules (Lu et al., 2021). In this work, we make use of an early, but well established model, called RealNVP Dinh et al. (2017). In each layer, the input is split in half and transformed according to
(1)  
(2) 
where is the pointwise multiplication and and neural networks. Using this formulation to realize a layer , it is easy and efficient to compute the inverse and the determinant.
axis represents the learned density related to a labeling function. In this example we use the task sentiment analysis and keyword search as labeling functions. Blue denotes a negative sentiment and red a positive sentiment.
3 Model Description
In this section the models are introduced. The following example motivates the idea. Consider the sentence , "The movie was fascinating, even though the graphics were poor, maybe due to a low budget.", the task sentiment analysis and labeling functions given by the keywords "fascinating" and "poor". Furthermore, "fascinating" is associated with the class POS, and "poor" with the class NEG. We aim to learn a neural network, which translates the complex object, text and a possible labeling function match, to a density, in the current example and . We combine this information using basic probability calculus to make a classification prediction.
Multiple models are introduced. The standard model WeaNFS naively learns to represent each labeling function as a multivariate normal distribution.
In order to make use of unlabeled data, i.e. data where no labeling function matches, we iteratively apply the standard model (WeaNFI).
Based on the observation that labeling functions overlap, we derive WeaNFN modeling the negative space, i.e. the space where the labeling function does not match and the mixed model, WeaNFM, using a common space for single labeling functions and the intersection of these.
Furthermore, multiple aggregation schemes are used to combine the learned labeling function densities. See table 1 for an overview.
Before we dive into details, we introduce some notation.
From the set of all possible inputs , e.g. texts, we denote an input sample by
and its corresponding vector representation by
. The set of labeling functions is and the classes are . Each labeling function maps the input to a specific class or abstains from labeling. In some of our models, we also associate an embedding with each labeling function, which we denote by . The set of labeling functions corresponding to label is .WeaNFS/I  WeaNFN  WeaNFM  

Maximum  
Union  
NoisyOr  
Simplex 
WeaNFS/I. The goal of the standard model is to learn a distribution for each labeling function . Similarly to Atanov et al. (2020)
in semisupervised learning, we use a randomly initialized embedding
to create a representation for each labeling function in the input space. We concatenate input and labeling function vector and provide it as input to the normalizing flow, thus learning , where describes the concatenation operation. A standard RealNVP Dinh et al. (2017), as described in section 2 is used. See appendix B.1 for implementational details. In order to use the learned probabilities to perform label prediction, an aggregation scheme is needed. For the sake of simplicity, the model predicts the label corresponding to the labeling function with the highest likelihood, .Additionally, to make use of the unlabeled data, i.e. the data points where no labeling function matches, an iterative version WeaNFI is tested. For this, we use an EMlike (Dempster et al., 1977) iterative scheme where the predictions of the model trained in the previous iteration are used as labels for the unlabeled data. The corresponding pseudocode is found in algorithm 1.
Negative Model. In typical classification scenarios it is enough to learn to compute a posterior by applying Bayes’ formula twice, resulting in
(3) 
where the class prior is typically approximated on the training data or passed as a parameter. This is not possible in the current setting as often two labeling functions match simultaneously. In order to learn , we explore a novel variant that additionally learns . The learning process is similar to , so a second embedding is introduced to represent . We optimize and simultaneously. In each batch , the positive sample pairs and negative pairs , sampled such that
, are used to train the network. The number of negative samples per positive sample is an additional hyperparameter. Now Bayes’ formula can be used as in equation
3 to obtain(4) 
The access to the posterior probability
provides additional opportunities to model . After initial experimentation we settled on two options. A simple addition of probabilities neglecting intersection probability, equation 5, which we call Union, and the NoisyOr formula, equation 7, which has previously shown to be effective in weakly supervised learning Keith et al. (2017):(5)  
(6)  
(7) 
Dataset  #Classes  #Train / #Test samples  #LF’s  Coverage(%)  Class Balance 

IMDb  2  39741 / 4993  20  0.60  1:1 
Spouse  2  8530 / 1187  9  0.30  1:5 
YouTube  2  1440 / 229  10  1.66  1:1 
SMS  2  4208 / 494  73  0.51  1:6 
Trec  6  4903 / 500  68  1.73  1:13:14:14:9:10 
Mixed Model. It was already mentioned that it is common that two or multiple labeling functions hit simultaneously. While WeaNFN provides access to a posterior distribution which allows to model these interactions, the goal of the mixed model WeaNFM is to model these intersections explicitly already in the density of the normalizing flow. More specifically, we aim to learn for arbitrary index families . Once again, the embeddings space is used to achieve this goal. For a given sample and a family of matching labeling functions, we uniformly sample from the simplex of all possible combinations and obtain . Afterwards we concatenate the weighted sum of the labeling function embeddings with the input and learn . Now that the density is able to access the intersections of labeling functions, we derive a new direct aggregation scheme. By we denote the simplex generated by the set of boundary points . It is important to think about this simplex, as it theoretically describes the input space where the model learns the density related to class . We use the naive but efficient variant which just computes the center of the simplex:
(8) 
Implementation. In practice, sampling of data points has to be handled on multiple occasions. Empirically and during the inspection of related implementations, e.g. the Github repository accompanying Atanov et al. (2020), we found that it is beneficial if every labeling function is seen equally often during training. It supports preventing a biased density towards specific labeling functions. When training WeaNFN, the negative space is much larger than the actual space, so an additional hyperparameter controlling the amount of negative samples is needed. WeaNFM aims to model intersecting probabilities directly. Most intersections occur too rarely to model a reasonable density. Thus we decided to only take cooccures into account which occur more often than a certain threshold. See appendix A.3 to get a feeling for the correlations in the used datasets.
IMDb  Spouse  YouTube  SMS  Trec  

MV  56.84  49.87  81.66  56.1  61.2 
MV + MLP  73.20  29.96  92.58  92.41  53.27 
DP + MLP  67.79  57.05  88.79  84.40  43.00 
WeaNFS  73.06  52.28  89.08  86.71  67.4 
WeaNFI  74.08  57.96  89.08  93.54  67.8 
WeaNFN (NoisyOr)  72.96  54.60  90.83  79.63  54.8 
WeaNFN (Union)  71.98  50.83  91.70  83.48  60.2 
WeaNFM (Max)  70.16  55.16  85.15  88.23  49.8 
WeaNFM (Simplex)  63.53  56.91  86.03  76.29  25.4 
4 Experiments
In order to analyze the proposed models experiments on multiple standard weakly supervised classification problems are performed. In the following, we introduce datasets, baselines and training details.
4.1 Datasets
Within our experiments, we use five classification tasks. Table 2 gives an overview over some key statistics. Note that these might differ slightly compared to other papers due to the removal of duplicates. For a more detailed overview of our preprocessing steps, see appendix A.1.
The first dataset is IMDb (Internet Movie Database) and the accompanying sentiment analysis task Maas et al. (2011)
. The goal is to classify whether a movie review describes a positive or a negative sentiment. We use
positive and negative keywords as labeling functions. See Appendix A.2 for a detailed description.The second dataset is the Spouse dataset Corney et al. (2016). The task is to classify whether a text holds a spouse relation, e.g. "Mary is married to Tom". Here, of the samples belong to the norelation class, so we use macro score to evaluate the performance. As the third dataset another binary classification problem is given by the YouTube Spam Alberto et al. (2015) dataset. The model has to decide whether a YouTube comment is spam or not. For both, the Spouse and the YouTube dataset, the labeling functions are provided by the Snorkel framework Ratner et al. (2017).
The SMS Spam detection dataset Almeida et al. (2011)
, we abbreviate by SMS, also asks for spam but in the private messaging domain. The dataset is quite skewed, so once again macro
score is used. Lastly, a multiclass dataset, namely TREC6 Li and Roth (2002), is used. The task is to classify questions into six categories, namely Abbreviation, Entity, Description, Human and Location. The labeling functions provided by Awasthi et al. (2020) are used for the SMS and the TREC dataset. We took the preprocessed versions of the data available within the Knodle weak supervision programming framework Sedova et al. (2021).4.2 Baselines
Three baselines are used. While there are many weak supervision systems, most use additional knowledge to improve performance. Examples are class balance Chatterjee et al. (2019), semisupervised learning with very little labels Awasthi et al. (2020); Karamanolakis et al. (2021) or multitask learning Ratner et al. (2018)
. To ensure a fair comparison, only baselines are used that solely take input data and labeling function matches into account. First we use majority voting (MV) which takes the label where the most rules match. For instances where multiple classes have an equal vote or where no labeling function matches, a random vote is taken. Secondly, a multilayer perceptron (MLP) is trained on top of the labels provided by majority vote. The third baseline uses the data programming (DP) paradigm. More explicitly, we use the model introduced by
Ratner et al. (2018) implemented in the Snorkel Ratner et al. (2017) programming framework. It performs a twostep approach to learning. Firstly, a generative model is trained to learn the most likely correlation between labeling functions and unknown true labels. Secondly, a discriminative model uses the labels of the generative model to train a final model. The same MLP as for second baseline is used for the final model.4.3 Training Details
Text input embeddings are created with the SentenceTransformers library Reimers and Gurevych (2019) using the bertbasenlimeantokens model. They serve as input to the baselines and the normalizing flows. Hyperparameter search is performed via grid search over learning rates of , weight decay of
and epochs in
, and label embedding dimension in times the number of classes. Additionally, the number of layers is in , and the negative sampling value for WeaNF is in . The full set up ran hours on a single GPU on a DGX server.5 Analysis
The analysis is divided into three parts. Firstly, a general discussion of the results is given. Secondly, an analysis of the densities predicted by WeaNFN is shown and lastly, a qualitative analysis is performed.
Labeling Function  Example  Dataset  Label  Gold  Prediction  

won .* claim  …won … call …  SMS  Spam  Spam  Spam  
.* I’ll .*  sorry, I’ll call later  SMS  No Spam  No Spam  No Spam  
.* i .*  i just saw ron burgundy captaining a party boat so yeah  SMS  No Spam  No Spam  No Spam  
(explainwhat) .* mean .*  What does the abbreviation SOS mean ?  Trec  DESCR  ABBR  DESCR  
(explainwhat) .* mean .*  What are Quaaludes ?  Trec  DESCR  DESCR  DESCR  
who.*  Who was the first man to … Pacific Ocean ?  Trec  HUMAN  HUMAN  HUMAN  
check .* out .*  Check out this video on YouTube:  YouTube  Spam  Spam  Spam  
#words < 5  subscribe my  YouTube  Spam  Spam  No Spam  
.* song .*  This Song will never get old  YouTube  No Spam  No Spam  No Spam  
.* dreadful .*  …horrible performance …. annoying  IMDb  NEG  NEG  NEG  
.* hilarious .*  …liked the movie…funny catchphrase…WORST…low grade…  IMDb  POS  NEG  POS  
.* disappointing .*  don’t understand stereotype … goofy ..  IMDb  NEG  NEG  POS  
.* (husbandwife) .*  …Jill.. she and her husband…  Spouse  Spouses  Spouses  Spouses  
.* married .*  … asked me to marry him and I said yes!  Spouse  Spouses  No Spouses  Spouses  
family word  Clearly excited, Coleen said: ’It’s my eldest son Shane and Emma.  Spouse  No Spouses  No Spouses  No Spouses 
5.1 Overall Findings
Table 3 exposes the main evaluation. The horizontal line separates the baselines from our models. For WeaNFN and WeaNFM, no iterative schemes were trained. This enables a direct comparison to the standard model WeaNFI.
Interestingly, the combination of Snorkel and MLP’s is often not performing competitively. In the IMDb data set there is barely any correlation between labeling functions, complicating Snorkel’s approach. The large number of labeling functions e.g. Trec, SMS, could also complicate correlation based approaches. Appendix A.3 shows correlation graphs.
As indicated by the bold numbers, the WeaNFI is the best performing model. Only on the YouTube dataset, an iterative scheme could not improve the results. Related to this observation, in Ren et al. (2020) the authors achieve promising results using iterative discriminative modeling for semisupervised weak supervision.
WeaNFN outperforms the standard model in three out of five datasets. We observe that these are the datasets with a large amount of labeling functions. Possibly, this biases the model towards a high value of which confuses the prediction.
The simplex aggregation scheme only outperforms the maximum aggregation on two out of five datasets. We infer that the probability density over the labeling function input space is not smooth enough. Ideally, the simplex method should always have a high confidence in the prediction of a labeling function if its confident on the nonmixed embedding which is what Max is doing.
IMDb  Spouse  YouTube  SMS  Trec  

Acc  72.38  74.04  78.17  88.71  72.63 
5.93  5.1  38.95  23.3  13.65  
37.53  39.31  55.01  44.34  61.07  
10.25  9.02  45.61  30.55  22.31  
Cov  4.31  5.74  19.31  3.01  4.39 
Dataset  Labeling Fct.  Cov(%)  Prec  Recall 

IMDb  *boring*  5.8  13.12  26.87 
Spouse  family word  9. 0  16.53  35.96 
YouTube  *song*  23.58  56.72  70.73 
SMS  won *call*  0.81  66.67  1.0 
Trec  how.*much  2.4  60.0  75.0 
Dataset  Labeling Fct.  Cov(%)  Prec  Recall 

IMDb  *imaginative*  0.42  0.77  52.38 
Spouse  spouse keyword  14.5  0  0 
YouTube  person entity  2.62  6.45  33.33 
SMS  I .* miss  0.6  0  0 
Trec  what is .* name  2.2  2.26  100 
5.2 Density Analysis
We divide into a global analysis and a local, i.e. a perlabeling function, analysis. Table 5 provides some global statistics, table 6 and 7 subsequently show statistics related to the best and worst performing labeling function estimations. In the local analysis a labeling function is predicted if . The WeaNFN model is used because it is the only model with direct access to .
It is important to mention that in the local analysis, a perfect prediction of the matching labeling function is not wanted, as this would mean that there is no generalization. Thus, a low precision might be necessary for generalization, and a the recall would indicate how much of the original semantic or syntactic meaning of a labeling function is retained.
Interestingly, while the overall performance of WeaNFN is competitive on the IMDb and the Spouse data sets, it is failing to predict the correct labeling function. One explanation might be that these are the data sets where the texts are substantially longer which might be complicated to model for normalizing flows. In table 7 typically the worst performing approximation of labeling function matches seems to be due to low coverage. An exception is the the Spouse labeling function.
5.3 Qualitative Analysis
In table 4 a number of examples are shown. We manually inspected samples with a very high or low density value. Note that density values related to are functions taking arbitrary values which only have to satisfy .
We observed the phenomenon that either the same labeling functions take the highest density values or that a single sample often has a high likelihood for multiple labeling functions. In the table 4 one can find examples where the learned flows were able to generalize from the original labeling functions. For example, for the IMDb dataset, it detects the meaning "funny" even though the exact keyword is "hilarious".
6 Conclusion
This work explores the novel use of normalizing flows for weak supervision. The approach is divided into two logical steps. In the first step, normalizing flows are employed to learn a probability distribution over the input space related to a labeling function. Secondly, principles from basic probability calculus are used to aggregate the learned densities and make them usable for classification tasks. Motivated by aspects of weakly supervised learning, such as labeling function overlap or coverage, multiple models are derived each of which uses the information present in the latent space differently. We show competitive results on five weakly supervised classification tasks. Our analysis shows that the flowbased representations of labeling functions successfully generalize to samples otherwise not covered by labeling functions.
Acknowledgements
This research was funded by the WWTF through the project ”Knowledgeinfused Deep Learning for Natural Language Processing” (WWTF Vienna Research Group VRG19008), and by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation)  RO 5127/21.
References
 SPEAR : Semisupervised Data Programming in Python. External Links: 2108.00373 Cited by: §2.
 TubeSpam: Comment Spam Filtering on YouTube. In 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA), pp. 138–143. External Links: Document Cited by: §4.1.
 Pattern Learning for Relation Extraction with a Hierarchical Topic Model. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Jeju Island, Korea, pp. 54–59. External Links: Link Cited by: §1.
 Contributions to the study of SMS spam filtering: new collection and results. In DocEng ’11, Cited by: §4.1.
 SemiConditional Normalizing Flows for SemiSupervised Learning. External Links: 1905.00505 Cited by: §2, §3, §3.
 Learning from Rules Generalizing Labeled Exemplars. External Links: 2004.06025 Cited by: §2, §4.1, §4.2.
 Learning the Structure of Generative Models without Labeled Data. CoRR abs/1703.0. External Links: 1703.00854, Link Cited by: §2.
 Language Models are FewShot Learners. External Links: 2005.14165 Cited by: §1.
 Data Programming using Continuous and QualityGuided Labeling Functions. CoRR abs/1911.0. External Links: 1911.09860, Link Cited by: §2, §4.2.
 What do a Million News Articles Look like?. In NewsIR@ECIR, Cited by: §4.1.
 Constructing Biological Knowledge Bases by Extracting Information from Text Sources. In Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology, pp. 77–86. External Links: ISBN 1577350839 Cited by: §2.
 Maximum likelihood from incomplete data via the EM algorithm. JOURNAL OF THE ROYAL STATISTICAL SOCIETY, SERIES B 39 (1), pp. 1–38. Cited by: §3.
 BERT: Pretraining of Deep Bidirectional Transformers for Language Understanding. External Links: 1810.04805 Cited by: §1.
 Density estimation using Real NVP. External Links: 1605.08803 Cited by: §2, §3.
 Generative Adversarial Networks. External Links: 1406.2661 Cited by: §2.
 SemiSupervised Learning with Normalizing Flows. External Links: 1912.13025 Cited by: §2.
 SelfTraining with Weak Supervision. External Links: 2104.05514 Cited by: §2, §4.2.
 Identifying civilians killed by police with distantly supervised entityevent extraction. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 1547–1557. External Links: Document, Link Cited by: §3.
 AutoEncoding Variational Bayes. External Links: 1312.6114 Cited by: §2.
 Learning Question Classifiers. In COLING 2002: The 19th International Conference on Computational Linguistics, External Links: Link Cited by: §4.1.
 Implicit Normalizing Flows. In International Conference on Learning Representations, External Links: Link Cited by: §2.
 Learning Word Vectors for Sentiment Analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, Oregon, USA, pp. 142–150. External Links: Link Cited by: §4.1.
 Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, Suntec, Singapore, pp. 1003–1011. External Links: Link Cited by: §2.
 Normalizing Flows for Probabilistic Modeling and Inference. External Links: 1912.02762 Cited by: §2.
 Snorkel: Rapid Training Data Creation with Weak Supervision. CoRR abs/1711.1. External Links: 1711.10160, Link Cited by: §1, §1, §2, §4.1, §4.2.
 Training Complex Models with MultiTask Weak Supervision. External Links: 1810.02840 Cited by: §2, §4.2.
 SentenceBERT: Sentence Embeddings using Siamese BERTNetworks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, External Links: Link Cited by: §4.3.
 Denoising MultiSource Weak Supervision for Neural Text Classification. Findings of the Association for Computational Linguistics: EMNLP 2020. External Links: Document, Link Cited by: §2, §5.1.
 Variational Inference with Normalizing Flows. External Links: 1505.05770 Cited by: §2.

Knodle: Modular Weakly Supervised Learning with PyTorch
. External Links: 2104.11557 Cited by: §2, §4.1.  Reducing Wrong Labels in Distant Supervision for Relation Extraction. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Jeju Island, Korea, pp. 721–729. External Links: Link Cited by: §1.
 Discrete Flows: Invertible Generative Models of Discrete Data. External Links: 1905.10347 Cited by: §2.
 Learning Dependency Structures for Weak Supervision Models. External Links: 1903.05844 Cited by: §2.
 WRENCH: a comprehensive benchmark for weak supervision. In Thirtyfifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, External Links: Link Cited by: §2.
 Latent Normalizing Flows for Discrete Sequences. External Links: 1901.10548 Cited by: §2.

Transfer Learning for LowResource Neural Machine Translation
. External Links: 1604.02201 Cited by: §1.
Appendix A Additional Data Description
a.1 Preprocessing
A few steps were performed, to create a unified data format. The crucial difference to other papers is that we removed duplicated samples. There were two cases. Either there were very little duplicates or the duplication occurred because of the programmatic data generation, thus not resembling the real data generating process. Most notably, in the spouse data set of all data points are duplicates. Furthermore, we only used rules which occurred more often than a certain threshold as it is impossible to learn densities on only a handful of examples. The threshold is In order to have unbiased baselines, we ran the baseline experiments on the full set of rules and the reduced set of rules and took the best performing number.
a.2 IMDb rules
The labeling functions for the IMDb dataset are defined by keywords. We manually chose the keywords. We defined them in such a way that their meaning has rather little semantic overlap. The keywords are shown in table 8.
positive  negative 

beautiful  poor 
pleasure  disappointing 
recommendation  senseless 
dazzling  secondrate 
fascinating  silly 
hilarious  boring 
surprising  tiresome 
interesting  uninteresting 
imaginative  dreadful 
original  outdated 
a.3 Labeling Function Correlations
In order to use labeling functions for weakly supervised learning, it is important to know the correlation of labeling functions to i) derive methods to combine them and ii) help to understand phenomena of the model predictions.
Thus we decided to add correlation plots. More specifically, we use the Pearson Correlation coefficient.
Appendix B Additional Implementationial Details
b.1 Architecture
As mentioned in section 3, the backbone of our flow is RealNVP architecture, which we introduced in section 2. With sticking to the notation in formula 2 the network layers to approximate the functions and are shown below
Hyperparameters are the depth, i.e. number of stacked layers, and the hidden dimension.
b.2 WeaNFM Sampling
For the mixed model WeaNFM the sampling process becomes rather complicated.
Next up, the code to produce the convex combination
is shown. The input tensor takes values in
and has shape where is the batch size and the number of labeling functions.Note that some mass is put on every labeling functions. We realized that this bias imrpoves performance.