Pairwise Feedback for Data Programming

12/16/2019 ∙ by Benedikt Boecking, et al. ∙ Carnegie Mellon University 0

The scalability of the labeling process and the attainable quality of labels have become limiting factors for many applications of machine learning. The programmatic creation of labeled datasets via the synthesis of noisy heuristics provides a promising avenue to address this problem. We propose to improve modeling of latent class variables in the programmatic creation of labeled datasets by incorporating pairwise feedback into the process. We discuss the ease with which such pairwise feedback can be obtained or generated in many application domains. Our experiments show that even a small number of sources of pairwise feedback can substantially improve the quality of the posterior estimate of the latent class variable.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The acquisition of labeled data to train machine learning models is often an expensive and time consuming process. The scalability of the labeling process, as well as the attainable quality of collected annotations, have become the limiting factors for many practical applications of machine learning. Consequently, various research directions aim to improve the ways we obtain labels to train models. One traditional approach is active learning, a sequential learning process which performs queries to a teacher (sometimes referred to as an oracle) to gather feedback on data instances. Such queries are often membership queries in the case of classification. Of late, the machine learning community has developed new ideas and approaches to reduce required labeling efforts even further and to scale the labeling process, e.g. by incorporating richer forms of feedback such as information regarding feature importance 

(Druck et al., 2008; Raghavan et al., 2006; Settles, 2011; Poulis and Dasgupta, 2017; Dasgupta et al., 2018).

Another avenue that has been proposed is to synthesize labels from imperfect sources by using generative models. A promising direction is to create labeled datasets programmatically, via user generated functions that annotate samples and provide noisy labels. This process is known as data programming (Ratner et al., 2016), and it has been shown to produce state-of-the-art performance on several benchmark challenges as well as to yield an intuitive feedback mechanism to users.

We propose the use of weak pairwise feedback–such as same or different class membership–to improve the estimation of true class labels in the programmatic creation of labeled datasets. The appeal of incorporating pairwise feedback into data programming is that it ties evidence of labeling functions together across samples. As such, even samples which do not receive many or any weak heuristic labels can benefit from information that has been acquired for its noisy associated pairs. While pairwise feedback is often abundant, it is not used frequently in machine learning as it is a weaker form of supervision than direct class labels. However, pairwise feedback has been shown to be useful source of information in areas such as semi-supervised learning, metric learning, and active learning.

Pairwise information to enhance learning is easy to obtain in many applications. In some, meta-data can inform pairwise labels. For example, spatial or temporal proximity of samples can be used to induce pairwise linkage constraints, as in e.g. the analysis of spectral information from planetary observations (Wagstaff, 2002), or in video segmentation and speaker identification (Bar-Hillel et al., 2003). Another example is to create pairwise labels from knowledge about functional links between proteins for protein function prediction tasks (Eisenberg et al., 2000).

A different approach to creating pairwise feedback is to use locally accurate similarity functions, which are easy to obtain by experts in many domains. These functions provide ways to find small numbers of similar samples of the same class with good accuracy. In text classification for example, using cosine similarity on term frequency–inverse document frequency (tf-idf) vectors is known to routinely yield a good approach for finding nearest neighbors of the same class. Such metrics can be used to gather high quality pairwise information with little noise, e.g. by constructing Mutual

k-Nearest Neighbor (MKNN) graphs using the user-supplied functions. MKNN approaches have been successfully used in various domains, including in single-cell RNA sequencing (Haghverdi et al., 2018).

In this paper, we show how to use noisy pairwise structures in combination with labeling functions in the data programming framework to improve the performance of latent class structure modeling.

2 Related Work

2.1 Data programming

A recent trend in machine learning is to use generative models to synthesize training data from weak supervision sources. The idea is to model training set creation as a process, such that the true label is a latent variable which generates the observed, noisy labels. Ratner et al. (2016) introduce data programming, where a set of labeling functions programmatically label data. These label functions are created by annotators, usually with domain expertise. The acquired labels are noisy, but generative models can be used to learn parameters of the labeling process such as accuracies of labeling functions as well as their correlation structure. Consequently, these generative models can synthesize labels from such weak supervision sources at scale.

In data programming, users can induce dependencies between labeling functions, such as that one ‘reinforces’ another. In this context, Bach et al. (2017) propose a structure estimation method to identify the generative model’s dependency structure. Varma et al. (2019) introduce a robust PCA-based algorithm for dependency structure estimation.

Further extensions to data programming have been studied to address latent subsets in data as well as to handle multi-task problems and multi-resolution settings. Varma et al. (2016) explore automatic identification of latent subsets in the training data in which the supervision sources perform differently than on average. The discovery of the subsets is used to augment the structure of the generative model in data programming. In an extension of data programming to the multi-task setting, Ratner et al. (2018a, b) assume that each labeling function returns a vector containing feedback on some subset of all possible tasks. For sequential data such as videos, Sala et al. (2019) propose a framework for modeling weak multi-resolution sources. The approach handles weak supervision at different scales such as per-frame or per-scene labels on videos, as well as sequential correlation amongst weak label sources.

2.2 Learning with Pairwise Information

Supervision via pairwise information has been used to improve clustering performance in semi-supervised clustering. The pairwise information is usually available in the form of constraints–one set of must-link pairs and one set of cannot-link pairs–and the problem setting is generally referred to as constrained clustering or semi-supervised clustering (Wagstaff et al., 2001; Basu et al., 2002; Klein et al., 2002). A number of semi-supervised clustering algorithms have been developed which simultaneously adapt the underlying notion of similarity or distance (Basu et al., 2004; Bilenko et al., 2004; Yan and Domeniconi, 2006).

Learning with pairwise supervision has also been explored in active learning settings, often with pairwise comparisons/rankings. For example, Parikh and Grauman (2011) study how to learn via relative feedback. The authors learn ranking functions for each attribute and subsequently build a generative model over the joint space of ranking outputs. Xu et al. (2017) consider an active dual supervision classification problem with algorithms that query oracles for noisy labels and pairwise comparisons. For the latter, the feedback is provided in terms of a pairwise ranking of the likelihood of being positive instead of an absolute label assignment. The algorithm can leverage both types of oracles, direct label assignment and pairwise comparisons. This active learning scheme can be useful in application areas where pairwise comparisons are easier to obtain. The comparison oracle is used to rank data points in order to create sets within which to obtain absolute labels. Xu et al. (2018) study active learning in a regression scenario with ordinal (or comparison) information. The authors provide theoretical guarantees and introduce an algorithm for this scenario.

3 Problem Setting

In data programming (Ratner et al., 2016), labeling functions are heuristics which provide noisy labels for samples. A generative model is used to model the latent class variable that generates the noisy observations. The model estimates the accuracies of the labeling functions, as well as their dependencies in some cases. The distribution over the latent class variable can be used to train a noise-aware discriminative model.

In this paper, we assume a multi-class setting with classes. We are given a dataset with unknown labels . Each data point is associated with a set of labeling function outputs . A labeling function votes for a label (e.g ) or abstains (). This means that in this simple multi-class framework we assume that label functions only output positive evidence towards a particular class label (e.g ) as opposed to providing information indicating that something is not indicative of a particular class (e.g ).

As in Ratner et al. (2016) and Bach et al. (2017)

, we can model the distribution over latent class variables and labeling functions as a factor graph. The density of the joint distribution of

and is:

(1)

where the parameter models the accuracy of labeling function , and for sample and labeling function we define the following factor function:

In this model, the assumption is that the outputs of the labeling functions are conditionally independent given the true label. The parameter vector can be learned by minimizing the negative log marginal likelihood given the observed labeling function outputs

Note that the function is convex in

and can be learned using (stochastic) gradient descent. The gradient for

is

(2)

Sampling is performed during each iteration of gradient descent to approximate the expectations, since the exact computation requires marginalization. The simple model for the multi-class scenario presented in this section is close to the conditionally independent model for binary classification presented in Bach et al. (2017).

4 Data programming with pairwise feedback

We assume that users can write weak pairwise feedback functions which output an undirected, sparse graph . In the simplest case, users only write functions that output that pairs should be of the same class () and abstain otherwise (), resulting in a symmetric matrix .

In some cases, users may also be able to gather reliable information about pairs which should not be of the same class (), resulting in

. Finally, it may be possible to obtain functions outputting values that indicate a strength of belief that some pairs belong to the same or different class. In this paper, we constrain our analysis to the use of noisy pairwise feedback that is available as a discrete signal, i.e. a function outputs that two samples are of the same or different class rather than a probability thereof. We believe that this is the most intuitive and reliable form of pairwise feedback that users can provide.

Assume we have separate sources of pairwise feedback indexed by in the form of sparse, symmetric matrices . We can again model the joint distribution using a factor graph by simply using to define factors amongst pairs of :

(3)

where is a parameter vector modeling the accuracy of the pairwise sources. Notice that we do not model as a variable but merely use it to create factors and learn one accuracy parameter per source . As in Ratner et al. (2016), we perform inference via gradient descent and use Gibbs sampling to compute expectations of the distributions used in the gradient update.

5 Experiments

To study the general usefulness of pairwise feedback in the programmatic creation of labeled data, we first conduct experiments on synthetic data in order to have full control over the precision and recall of the labeling functions. We then use the popular 20 Newsgroup dataset to show improvements in performance on real data and to demonstrate the ease with which same-class pairwise feedback can be generated in practice.

5.1 Synthetic data

5.1.1 Same-class feedback

(a) Data programming without pairwise feedback. Y-axis: accuracy of latent class variable MAP estimate. X-axis: average precision of simulated labeling functions at fixed recall.
(b) Data programming with pairwise feedback (same-class). Accuracy of indicated by color bar. Y-axis: accuracy of simulated pairwise feedback ( pairs). X-axis: average precision of simulated labeling functions at fixed recall.
Figure 1: Accuracy of MAP estimate for the latent class variable on synthetic data using simulated labeling functions and pairwise feedback.
(a) pairs.
(b) pairs.
Figure 2: Increase in accuracy of MAP estimate for the latent class variable achieved by data programming with pairwise feedback, compared to a model without pairwise feedback, on synthetic data. Results shown for and pairs. The increase in accuracy is indicated by the contours.

We simulate a multi-class classification task with

classes and generate two weak labeling functions per class. For each weak labeling function, we randomly add false positives (fp) and false negatives (fn) to the true label to achieve a target recall and precision. Given a target mean recall and precision, the specific recall and precision for one run of an experiment are drawn randomly from a truncated normal distribution. Throughout the experiments, we fix the target recall of each labeling function at

and increment the target precision from to . To create pairwise feedback, we generate same-class information by first specifying a target count of pairs and a target accuracy and then randomly create fp and true positive (tp) pairs. To provide a point of reference for the quality of the simulated pairwise feedback one can consider that for a dataset with classes and even class balance, the default accuracy of a pairwise matrix indicating same-class membership will be .

We repeat this process of simulating labeling functions and weak pairwise feedback ten times and present results based on the mean accuracy achieved by the MAP estimate . Figure 0(a) shows the accuracy of produced by a data programming model with independent labeling functions (Equation 1) on the simulated task as the precision of the underlying labeling functions increases. The contour plot in Figure 0(b) shows the accuracy of when pairs of same-class feedback of varying accuracy are added to the model (Equation 3). Figure 1(b) shows the increase in accuracy gained by using this weak pairwise feedback. Figure 1(a) reveals that even a small number of pairs can lead to drastic improvements.

5.1.2 Same-class and different-class feedback

Figure 3: Increase in accuracy of MAP estimate for the latent class variable achieved by data programming with same-class and different-class pairwise feedback, compared to a model without pairwise feedback, on synthetic data. Results shown for pairs. The increase in accuracy is indicated by the contours.

Using the same approach as described in the previous section, we also simulated the acquisition of different-class feedback in addition to same-class feedback. To this end, we randomly sampled unique pairs from the data and then corrupted the ground-truth pairs to achieve desired noise levels of pairwise feedback. Results for an experiment with pairs are shown in Figure 3. Comparing Figure 3 to Figure 2, it appears that a higher pairwise accuracy is needed to achieve consistent improvements when pairwise feedback contains both same and different class feedback, as the negative feedback in this setting is naturally less informative.

5.2 Newsgroup data

Source Unique pairs Pair accuracy
Email pairs 21696 87.33%
MKNN pairs 12355 88.04%
Table 1: Weak pairwise sources created for a subset of 20 Newsgroup topics.
Label function Precision Recall Class
1 0.896970 0.185232 alt.atheism
2 0.916216 0.424280 alt.atheism
3 0.934263 0.474696 comp.windows.x
4 0.646739 0.120445 comp.windows.x
5 0.738806 0.100202 comp.windows.x
6 0.650000 0.039514 sci.space
7 0.835635 0.612969 sci.space
8 0.821739 0.191489 sci.space
9 0.818966 0.096251 sci.space
10 0.725118 0.310030 sci.space
11 0.898734 0.071935 sci.space
12 0.940741 0.257345 sci.space
13 0.617801 0.152258 talk.politics.misc
14 0.765152 0.130323 talk.politics.misc
15 0.787709 0.181935 talk.politics.misc
16 0.804878 0.052548 talk.religion.misc
Table 2: Weak labeling functions created for a subset of 20 Newsgroup topics.

We used a subset111topics: alt.atheism, talk.religion.misc, talk.politics.misc, comp.windows.x, sci.space of the popular 20 Newsgroups text classification dataset222As available at:
https://scikit-learn.org/stable/datasets/index.html#newsgroups-dataset
with classes resulting in

samples and roughly even class balance. For this subset, the default accuracy of a classifier predicting the majority label is

. We manually created 16 labeling functions based on user-defined heuristics that look for mentions of specific words, see Table 2 for details regarding precision and recall.

To demonstrate the ease with which pairwise same-class feedback can be obtained we create two different sources thereof, one using meta-data and one based on MKNN. We use a regular expression to extract the first Email mentioned in each document. Using extracted Emails, we create pairs of documents containing the same Email address and assign a label . The assumption is that the first email in a document identifies the author, and that documents by a unique author are likely to belong to the same topic. This results in a pairwise matrix covering about of all possible pairs, see Table 1.

Next, we create a pairwise matrix based on MKNN, a popular heuristic that works well in many domains (see e.g. Fogel et al. (2019) for its use on image data). We use a term frequency - inverse document frequency (tf-idf)333With a minimum document frequency cutoff at . embedding and compute the 10 nearest neighbors for all documents using cosine similarity. Using this nearest neighbor graph, we subsequently create pairs of documents that are MKNN444 is amongst the 10 nearest neighbors of , and is amongst the 10 nearest neighbors of . and assign a label . Note that both heuristic sources we use only cover same-class pairs.

Mean accuracy of posterior class label estimate Standard deviation
Majority vote 46.94% 0.000
Labeling functions 56.45% 0.005
+ Email pairs 59.64% 0.006
+ MKNN pairs 75.98% 0.008
+ MKNN + Email pairs 69.92% 0.009
Table 3: Results obtained for modeling of the latent class variable for a subset of the 20 Newsgroup dataset.

As a comparison baseline, we use the conditionally independent model in Equation 1 as well as a majority vote using the weak labeling functions. Results are shown in Table 3, demonstrating the increase in accuracy of modeling the latent class variable when pairwise feedback is incorporated.

6 Discussion

Figure 4: A histogram showing the distribution of pairs per sample in the 20 Newsgroup dataset for the two different pairwise matrices that were created. The plot shows that the text-based MKNN matrix pairs are by construction less biased across the dataset compared to the pairwise matrix created from matching emails.

Our experiments on synthetic and benchmark data show that even a single, small source of weak pairwise feedback can lead to substantial improvements in the posterior estimate of the latent class variable. Our experiment on the 20 Newsgroup data demonstrates the ease with which good sources of pairwise feedback can be obtained for common classification tasks. The difference in performance of the Email pairs compared to the text-based MKNN pairs is very likely due to the less biased pairwise information the MKNN matrix presents, see Figure 4. The experiment demonstrates that improvement in performance is not just a function of the accuracy of pairs obtained from a feedback source, but also of the distribution of this information across the samples. In the experiment, the MKNN pairs tie more evidence of the labeling functions together across the entire dataset compared to the Email pairs.

We emphasize that the performance of the model without pairwise feedback is not a shortcoming of data programming by any means. Without weak labeling functions, pairwise feedback by itself would not allow us to model the latent class structure. Furthermore, only a small amount of human labeling effort was required in creating the weak labeling functions and no outside source of information was used. More time as well as knowledge about the domain can most certainly lead to additional, high quality labeling functions that could improve performance of the model that does not use pairwise feedback. Our experiments reveal the usefulness of the programmatic creation of labeled datasets and that the performance can be increased by additionally gathering weak pairwise feedback.

7 Conclusion

We have empirically demonstrated that just one noisy pairwise function can substantially improve accuracy of the posterior estimate of the latent class variable in data programming. Noisy pairwise feedback is a valuable resource for data programming as it ties evidence of labeling functions together, across different samples of a dataset. Our experiments as well as related literature (e.g. in semi-supervised clustering) demonstrate that pairwise feedback with sufficient accuracy can be collected with ease and at scale for many problem types.

Several issues remain that we aim to address in our future work. We will expand the number of experiments on real world data to comprehensively validate current claims, and include weak sources indicating different-class membership. Furthermore, we will explore alternative inference algorithms. As the number of samples grows, even a small fraction of all possible pairs can become quite large. We also intend to conduct a theoretical analysis of the expected improvements and guarantees when pairwise feedback is used in the programmatic creation of labeled data.

Acknowledgments

This work has been partially supported by DARPA FA8750-17-2-013 and by a Space Technology Research Institutes grant from NASA’s Space Technology Research Grants Program.

References

  • S. H. Bach, B. He, A. Ratner, and C. Ré (2017) Learning the structure of generative models without labeled data. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 273–282. Cited by: §2.1, §3.
  • A. Bar-Hillel, T. Hertz, N. Shental, and D. Weinshall (2003) Learning distance functions using equivalence relations. In ICML, pp. 11–18. Cited by: §1.
  • S. Basu, A. Banerjee, and R. Mooney (2002) Semi-supervised clustering by seeding. In ICML, Cited by: §2.2.
  • S. Basu, M. Bilenko, and R. J. Mooney (2004) A probabilistic framework for semi-supervised clustering. In SIGKDD, Cited by: §2.2.
  • M. Bilenko, S. Basu, and R. J. Mooney (2004) Integrating constraints and metric learning in semi-supervised clustering. In ICML, Cited by: §2.2.
  • S. Dasgupta, A. Dey, N. Roberts, and S. Sabato (2018) Learning from discriminative feature feedback. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 3959–3967. External Links: Link Cited by: §1.
  • G. Druck, G. Mann, and A. McCallum (2008) Learning from labeled features using generalized expectation criteria. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’08, New York, NY, USA, pp. 595–602. External Links: ISBN 978-1-60558-164-4, Link, Document Cited by: §1.
  • D. Eisenberg, E. M. Marcotte, I. Xenarios, and T. O. Yeates (2000) Protein function in the post-genomic era. Nature 405 (6788), pp. 823. Cited by: §1.
  • S. Fogel, H. Averbuch-Elor, D. Cohen-Or, and J. Goldberger (2019) Clustering-driven deep embedding with pairwise constraints. IEEE computer graphics and applications 39 (4), pp. 16–27. Cited by: §5.2.
  • L. Haghverdi, A. T. Lun, M. D. Morgan, and J. C. Marioni (2018) Batch effects in single-cell rna-sequencing data are corrected by matching mutual nearest neighbors. Nature biotechnology 36 (5), pp. 421. Cited by: §1.
  • D. Klein, S. D. Kamvar, and C. D. Manning (2002) From instance-level constraints to space-level constraints: making the most of prior knowledge in data clustering. Technical report Stanford. Cited by: §2.2.
  • D. Parikh and K. Grauman (2011) Relative attributes. In

    2011 International Conference on Computer Vision

    ,
    pp. 503–510. Cited by: §2.2.
  • S. Poulis and S. Dasgupta (2017) Learning with feature feedback: from theory to practice. In

    Proceedings of the 20th International Conference on Artificial Intelligence and Statistics

    , A. Singh and J. Zhu (Eds.),
    Proceedings of Machine Learning Research, Vol. 54, Fort Lauderdale, FL, USA, pp. 1104–1113. External Links: Link Cited by: §1.
  • H. Raghavan, O. Madani, and R. Jones (2006) Active learning with feedback on features and instances. J. Mach. Learn. Res. 7, pp. 1655–1686. External Links: ISSN 1532-4435, Link Cited by: §1.
  • A. Ratner, B. Hancock, J. Dunnmon, R. Goldman, and C. Ré (2018a) Snorkel metal: weak supervision for multi-task learning. In Proceedings of the Second Workshop on Data Management for End-To-End Machine Learning, pp. 3. Cited by: §2.1.
  • A. Ratner, B. Hancock, J. Dunnmon, F. Sala, S. Pandey, and C. Ré (2018b) Training complex models with multi-task weak supervision. arXiv preprint arXiv:1810.02840. Cited by: §2.1.
  • A. J. Ratner, C. M. De Sa, S. Wu, D. Selsam, and C. Ré (2016) Data programming: creating large training sets, quickly. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 3567–3575. External Links: Link Cited by: §1, §2.1, §3, §3, §4.
  • F. Sala, P. Varma, J. Fries, D. Y. Fu, S. Sagawa, S. Khattar, A. Ramamoorthy, K. Xiao, K. Fatahalian, J. Priest, and C. Re (2019) Multi-resolution weak supervision for sequential data. In NeurIPS’19, NeurIPS’19. Cited by: §2.1.
  • B. Settles (2011) Closing the loop: fast, interactive semi-supervised annotation with queries on features and instances. In

    Proceedings of the Conference on Empirical Methods in Natural Language Processing

    ,
    EMNLP ’11, Stroudsburg, PA, USA, pp. 1467–1478. External Links: ISBN 978-1-937284-11-4, Link Cited by: §1.
  • P. Varma, B. He, D. Iter, P. Xu, R. Yu, C. De Sa, and C. Ré (2016) Socratic learning: augmenting generative models to incorporate latent subsets in training data. arXiv preprint arXiv:1610.08123. Cited by: §2.1.
  • P. Varma, F. Sala, A. He, A. Ratner, and C. Ré (2019) Learning dependency structures for weak supervision models. arXiv preprint arXiv:1903.05844. Cited by: §2.1.
  • K. Wagstaff, C. Cardie, S. Rogers, S. Schrödl, et al. (2001)

    Constrained k-means clustering with background knowledge

    .
    In ICML, pp. 577–584. Cited by: §2.2.
  • K. L. Wagstaff (2002) Intelligent clustering with instance-level constraints. Ph.D. Thesis, Cornell University, Ithaca, NY, USA. External Links: ISBN 0-493-75182-3 Cited by: §1.
  • Y. Xu, H. Muthakana, S. Balakrishnan, A. Singh, and A. Dubrawski (2018)

    Nonparametric regression with comparisons: escaping the curse of dimensionality with ordinal information

    .
    In ICML, Cited by: §2.2.
  • Y. Xu, H. Zhang, K. Miller, A. Singh, and A. Dubrawski (2017) Noise-tolerant interactive learning using pairwise comparisons. In NIPS, pp. 2431–2440. Cited by: §2.2.
  • B. Yan and C. Domeniconi (2006) An adaptive kernel method for semi-supervised clustering. In ECML, pp. 521–532. Cited by: §2.2.