Automatic Training Data Generation with Affinity Coding
Generating large labeled training data is becoming the biggest bottleneck in building and deploying supervised machine learning models. Recently, data programming has been proposed in the data management community to reduce the human cost in training data generation. Data programming expects users to write a set of labeling functions, each of which is a weak supervision source that labels a subset of data points with better-than-random accuracy. However, the success of data programming heavily depends on the quality (in terms of both accuracy and coverage) of the labeling functions that users still need to design manually. We propose affinity coding, a new paradigm for fully automatic generation of training data. In affinity coding, the similarity between the unlabeled instances and prototypes that are derived from the same unlabeled instances serve as signals (or sources of weak supervision) for determining class membership. We term this implicit similarity as the affinity score. Consequently, we can have as many sources of weak supervision as the number of unlabeled data points, without any human input. We also propose a system called GOGGLES that is an implementation of affinity coding for labeling image datasets. GOGGLES features novel techniques for deriving affinity scores from image datasets based on "semantic prototypes" extracted from convolutional neural nets, as well as an expectation-maximization approach for performing class label inference based on the computed affinity scores. Compared to the state-of-the-art data programming system Snorkel, GOGGLES exhibits 14.88 for the binary labeling task. The GOGGLES system is open-sourced at https://github.com/chu-data-lab/GOGGLES/.READ FULL TEXT VIEW PDF
The paucity of large curated hand-labeled training data forms a major
Labeling training data is increasingly the largest bottleneck in deployi...
Scarcity of labeled data is a bottleneck for supervised learning models....
As machine learning models continue to increase in complexity, collectin...
As machine learning for images becomes democratized in the Software 2.0 ...
Paucity of large curated hand-labeled training data for every
Affinity graphs are widely used in deep architectures, including graph
Automatic Training Data Generation with Affinity Coding
Machine learning (ML) and deep learning are being increasingly used by organizations to gain insights from data and to solve a diverse set of important problems, ranging from traditional applications such as fraud detection, product recommendation, and customer churn prediction, to more challenging and modern applications such as image recognition, natural language understanding, and even enabling health care and self-driving cars. A fundamental necessity for the success of ML algorithms is the existence of large high-quality labeled training data. For example, the current ConvNet revolution would not be possible without big labeled datasets such as the 1M labeled images from ImageNet[Russakovsky et al., 2015]. Modern deep learning methods often need tens of thousands to millions of training examples to reach peak predictive performance [Sun et al., 2017]. However, for many real-world applications, large hand-labeled training datasets either do not exist, or is extremely expensive to create as manually labeling data usually requires domain experts [Davis et al., 2013].
We are certainly not the first to recognize the need for addressing the challenges arising from the lack of sufficient training data. The ML community has been dealing with it, mostly by modifying how the models are trained, such as by using semi-supervised learning[Zhu, 2005]2010], which we discuss further in Section 6. Only recently, the data programming paradigm [Ratner et al., 2016] and the Snorkel system [Ratner et al., 2017], which is an implementation of the paradigm, were proposed in the data management community that aims at directly addressing the problem by reducing the human effort in generating labeled training data. Snorkel consists of two main steps: (1) users write labeling functions, each of which is a weak supervision source that assigns potentially noisy labels to a subset of data points in the training set. This produces a labeling matrix , where stores the label assigned by source for ( denotes source fails to label ). (2) As different labeling functions may provide conflicting signals, the labeling matrix is then de-noised to generate one probabilistic label for every . Unsurprisingly, the quality of the assigned labels depends on the accuracy and coverage of the hand-designed labeling functions, as well as the de-noising algorithm.
Our Proposal and Challenges. Snorkel’s main limitation is that users still need to be domain experts in the labeling task, and need to provide many high-quality labeling functions to achieve high labeling accuracy. For example, Snorkel was developed labeling text data, where users can write labeling functions using raw features from text. However, it is difficult for users to write labeling functions for images as the raw features of images (pixels) are intractable for such a task. While Snorkel’s primary objective is to reduce the human effort for fast training data generation, we aim for a more ambitious goal of generating training data completely automatically. This is very challenging as we need to completely re-think how we generate the weak supervision sources without any human input. In other words, the unlabeled data must somehow itself serve as the sources of weak supervision to determine the class membership.
Contributions. We make the following contributions that address the above challenges for automatic training data generation.
The affinity coding paradigm. We propose affinity coding, a new paradigm for automatic generation of training data. The core premise of affinity coding is that instances belonging to the same class share certain similarities, which we call affinity, that instances belonging to different classes do not have. The paradigm thus includes two main components: how to encode the affinity between unlabeled instances that can reflect the class separation, and how to use the computed affinities for class label inference.
To the best of our knowledge, we are the first to propose a general approach to label training data using only the unlabeled data and without any further inputs, such as a small labeled development set or human annotations.
Coding affinity for image datasets.
We propose GOGGLES, a system that implements affinity coding for automatic image labeling. We choose image labeling as our first foray into affinity coding because it is much more difficult for Snorkel to label images as discussed before. GOGGLES must handle the challenges associated with labeling images: (1) noisy raw pixels have high variance even for images in the same class; (2) higher-order signals are unknown in the absence of any labeled data; and (3) a useful signal may reside in different regions of images from the same class.
To tackle these difficulties, GOGGLES features a novel approach to automatically select the most useful “prototypes” from unlabeled instances. Figure 1 shows one such extracted prototype for the leftmost image. The affinity scores are then calculated between other unlabeled instances and the prototype extracted from an image. It can be seen from Figure 1 that images that are in the same class as the prototype tend to have higher affinity scores than images that are in a different class.
Class inference from affinity scores. Given the affinity scores, GOGGLES proposes techniques to perform class inference that assigns a probabilistic label to every unlabeled instance. This turns out to be a challenging task due to multiple reasons: (1) unlike the discrete values produced by labeling functions in Snorkel, our affinity scores are real values, which introduce modelling difficulties; and (2) a higher affinity score between an instance and a prototype merely suggests that they are likely to belong to the same class, but does not actually indicate which class they actually belong to.
To address these challenges, GOGGLES proposes a generative model that models affinity scores generated by different prototypes conditioned on different classes, and formulate the class inference problem as a maximum likelihood estimation problem. GOGGLES then uses expectation-maximization to iteratively refine the class assignments until convergence. Figure1 shows that the two distributions for the affinity scores of two classes are clearly separated, which allows us to achieve high labeling accuracy.
On the Caltech-UCSD Birds-200-2011 dataset and Animals with Attributes 2 dataset, labels generated by GOGGLES show an average accuracy of 98.09% and 92.66% for the binary labeling task, and 92.93% and 83.44% for the multi-class labeling task respectively. Compared to the state-of-the-art data programming system Snorkel, GOGGLES provides 14.88% average improvement in terms of the quality of labels generated for the binary labeling task.
The rest of the paper is organized as follows. We discuss the formal problem of training data generation and our proposed affinity coding paradigm in Section 2. We show how GOGGLES provides a way to generate affinity scores for image datasets in Section 3. We present an expectation-maximization based algorithm for performing class label inference using affinity scores in Section 4. We present the experimental evaluations in Section 5. We discuss the related work in Section 6. We conclude and present the future work in Section 7. The GOGGLES system is open-sourced at https://github.com/chu-data-lab/GOGGLES/.
We formally state the problem of automatic training data generation in Section 2.1. We then introduce affinity coding, a new paradigm for addressing the automatic training data generation problem in Section 2.2.
In traditional supervised classification applications, the goal is to learn a classifierbased on a labeled training set , where and . The classifier is then used to make predictions on a test set.
In our setting, we do not assume access to any labeled training data, namely, we only have and no . Let denote the total number of unlabeled data points, and let denote the unknown true label for . Our goal is to assign a probabilistic label for every , where , where with being the number of classes in the labeling task, and .
These probabilistic labels can then be used to train any downstream ML models, such as a convolutional neural network (CNN) for image classification, or a long-short term memory network for text classification tasks. These probabilistic labels can be leveraged in several ways. For example, we can generate a discrete label according to the highestfor every instance
. Another more principled approach is to use the probabilistic labels directly in the loss function, i.e., the expected loss with respect to :
It has been shown that as the amount of unlabeled data increases, the generalization error of the model trained with probabilistic labels will decrease at the same asymptotic rate as traditional supervised learning models do with additional hand-labeled data [Ratner et al., 2016]. In this work, however, we only focus on producing probabilistic labels.
We propose affinity coding, a completely new paradigm for obtaining high quality sources of weak supervision. The core premise of affinity coding is that instances belonging to the same class share certain similarities, which we call affinity, that instances belonging to different classes do not have. Figure 2
shows the affinity coding framework for generating probabilistic training data, having two main components: affinity matrix construction and class inference.
Step 1: Affinity Matrix Construction. As the raw features of an instance may not be informative for computing affinity scores, the first step for affinity matrix construction involves extracting prototypes from input instances. Each unlabeled data point in the unlabeled data can yield one or more prototypes, each of which is used as a weak supervision source . Given unlabeled instances and prototypes, we compute an affinity score between the unlabeled instance and the prototype . The output is thus represented as the affinity matrix storing all the affinity scores.
Step 2: Class Inference. Each column of the affinity matrix provides a weak supervision source suggesting the class membership of unlabeled points. Therefore, we have different weak supervision sources, each potentially providing conflicting information about the class membership of each of the data points. Thus, the job of the class inference component is to de-noise and produce a probabilistic label for each data point . This is done by maximizing , where denotes the parameters of a probabilistic model that is used to model .
The affinity coding paradigm offers a general framework for training data generation. Users only need to develop a technique for constructing the affinity matrix, and they can be reused later for similar training data generation tasks. In Section 3 and Section 4, we present a specific instantiation of the paradigm, which we call GOGGLES, that labels image datasets. We show in the experiments that the same approach is useful for labeling two different image datasets with completely different class semantics.
In this section, we describe the affinity coding step of GOGGLES, a system that automatically generates labels for image datasets. As discussed before, our affinity coding paradigm is based on the proposition that examples belonging to the same class should have certain similarities. For image datasets, we hypothesize that images from the same class would contain certain visually grounded features which are richly discriminative when compared with images of another class.
However, it is nontrivial to design affinity scores based on visually grounded features for images due to multiple reasons: (1) raw pixel values are excessively noisy and have high variance even for images of the same class, (2) arbitrary signals that encode the image as a whole may not be class-discriminative since the underlying higher-order features may be localized only in specific regions of the image; and most importantly, (3) discriminative features may not be known a priori, in the absence of any class labels.
To alleviate these complications, GOGGLES leverages convolutional neural networks (CNNs) to transplant the data representation from the raw pixel space to a semantic space, which makes it more tractable to identify higher-order discriminative features. It has been shown that the intermediate layers of a trained CNN are able to encode human-interpretable concepts, such as edges and corners in initial layers; and textures, objects and complex patterns in the final layers [Zeiler and Fergus, 2014].
In Section 3.1, we show how we take advantage of this inherent property of trained CNNs in order to automatically identify spatially localized class-definite “concept prototypes” from the given set of images. In Section 3.2, we show how we compute the affinity scores between every image and every concept prototype . These affinity scores act as a proxy for class-discriminative signals without requiring any class labels.
To extract signals for labeling the images in a dataset as different classes, we need to look at different regions of each image instead of encoding the image as a whole. This is based on the intuition that the higher-order concept that we are trying to identify may be spatially localized in the image. For example, if we are given an image of a tiger, it may contain a tiger’s head in one corner and rest of the image contains the background of a forest. If we are trying to extract signals by looking at the complete image, the background may add noise to the signal. Therefore we propose a technique to automatically discover patches in each image that may contain class-descriptive signals. These image patches in the pixel space are represented by “concept prototypes” in a semantic subspace. For each image, we try to extract such prototypes that correspond to different patches in the pixel space.
Extracting all prototypes. Let us now formally define our approach. To begin, we pass an image through a series of convolutional, activation and pooling layers of a CNN to obtain , as illustrated in Figure 3. is also called a “filter map”, and has dimensions , where , and are the number of channels, height and width of the filter map respectively. Let us also denote indexes over the height and width dimensions of with and respectively. Hence, each vector (spanning the channel axis) in the output volume can be backtracked to a rectangular patch at the corresponding location in the input image . The location of the patch is determined by computing the gradients of with respect to the input pixels. All pixels which are found to have non-zero gradients are a part of this patch. This region in the image is formally known as the receptive field of , which is typically a continuous region. Since any change in this patch will induce a change in the vector , we say that encodes the semantic concept present in the patch.
Suppose we obtain a filter map from the CNN model having dimensions . There are a total of possible concept prototypes of dimension .
Selecting top best prototypes. In an image , obviously not every patch and the corresponding encoded semantic concept is a good signal. In fact, many patches in an image correspond to background noise that are uninformative for labeling an image. Therefore, we need a way to intelligently select the top most informative semantic prototypes from all the possible ones.
In this regard, we perform a 2D Global Max Pooling operation (GMP) across each channel ofto determine the top channels containing the highest magnitude of activation. That is, we reduce to by doing a max operation over the dimensions for each channel. We use this vector containing the highest activation value of each channel to get the top activated channels , where . Since these channels are most activated, we hypothesize that they capture class-specific abstract concepts present in the image. Consequently, for each maximally activated channel , we identify the concept prototype if it contains the highest activation value for the channel . More concretely, we define the concept prototype for the image as,
Note that the pair may not be unique across the channels, yielding the same concept prototypes. Hence, we drop the duplicate ’s and only keep the unique prototypes. This approach is illustrated in Figure 4.
Doing this for each image in the dataset, we can build a collection of prototypes , where,
Here, each prototype is represented as a vector in the semantic space, but also corresponds to a patch in the pixel space, as shown in Figure 3. We can see that by building semantic prototypes this way, we can immediately obtain many diverse signals (as many as ) that are useful for determining image labels.
Having extracted concept prototypes for each image, next we aim to identify other images containing similar concepts. To this end, we construct an affinity matrix , where is the number of images in the dataset and is the total number of concept prototypes extracted from all the images. Since we drop duplicate prototypes, we have . Each entry in the matrix is the affinity score between the image and the prototype . To compute the affinity between an image and a concept prototype, we again rely on the semantic space to capture this relationship. We calculate the similarity between the prototype and the vector contained in using a similarity function , and pick the highest value as the affinity score . That is,
In other words, this approach tries to find the “most similar patch” in each image with respect to the patch corresponding to the prototype . For example, let’s say a prototype corresponds to a patch containing a tiger’s head, then the column in the affinity matrix should intuitively have a high value in the row if the image also contains some patch with a tiger’s head; and correspondingly, a low value in the row if the image does not contain that concept. This is illustrated in Figure 5, where there is a clear separation between between the class images based on affinity scores assigned by the prototypes. Hence, this technique would cluster images having similar visually grounded semantic features by assigning high affinity scores to prototype-image pairs of the same class, and low affinity scores to prototype-image pairs of different classes. Since we can also backtrack from as well as to patches in the image, this approach has the added advantage of being interpretable. A human can visually inspect the concept prototypes in the pixel space as well as the image patches that induce high affinity scores in order to potentially refine the results.
For computing the affinity, we use the cosine similarity metric as the similarity function:
If the CNN model uses ReLU activation, all values ofare non-negative. Hence, using the cosine similarity metric, we get affinity scores in the range .
Discussion. In our particular instantiation of the GOGGLES pipline, we use a pre-trained VGG-16 model [Simonyan and Zisserman, 2014], and take the output from the last 2D max-pooling layer to parameterize , with . This choice is based on the intuition that VGG-16, which has been trained for the classification task on the ImageNet dataset, can encode higher-order semantic abstractions in its last layer even if we show it images from a domain it was not
trained on. However, the underlying paradigm of affinity coding which we introduce in this work is not tied to any particular CNN architecture. This choice can in-fact be replaced by unsupervised CNN models such as autoencoders that can be trained to reproduce the images we want to label, which we leave for future work.
In summary, our approach automatically identifies semantically meaningful prototypes from the dataset, and leverages these prototypes for providing proxy discriminative signals in the form of affinity scores. However, these signals may be noisy and conflicting. In the next section, we propose a probabilistic approach to denoise these signals and learn probabilistic labels for the dataset.
In this section, we describe GOGGLES’ class inference module. For the most part of our discussion, we focus on the binary labeling task for clarity. We also show how we extend our method for the multi-class labeling task at the end of this section.
Given the affinity matrix , where each row represents an unlabelled data point and each column represents a prototype extracted from these data points, GOGGLES proceeds to assign a probabilistic label for every example . On the surface, the affinity matrix produced by GOGGLES is similar to the labeling matrix produced by Snorkel [Ratner et al., 2017], in the sense that each column of can be seen as a weak supervision signal and the goal is to de-noise these signals for label inference. However, doing class inference on is more challenging than de-noising for the following reasons:
The entries in are discrete values from , while the affinity scores produced by the GOGGLES framework are continuous values in the range . This means that any probabilistic models used for class inference need to handle learning and inference efficiently on continuous variables. The factor graph model employed by Snorkel assumes discrete variables and cannot be directly adapted to handle continuous variables. At the same time, our continuous affinity scores are actually more expressive in modelling weak signals than discrete scores, and if de-noised properly, should produce higher quality labels, as we show in Section 5.
The value in provides a direct signal to the class assignment. For example, is a direct signal from weak supervision source suggesting that . On the other hand, a high value of (e.g., ) in does not give direct indication to the label assignment of — it only suggests that is likely to belong to the same class of the example, from which the prototype was derived. This additional uncertainty is a complication due to the affinity matrix design that Snorkel does not need to handle.
These difficulties may lead up to the notion of turning to unsupervised clustering methods such as k-means clustering, but we follow a more principled approach[Raykar et al., 2009] to resolving the uncertainty over the supervision sources. We recognize that the class membership of each prototype is implicitly tied to the class label of the data point that the prototype is derived from. We denote this class membership of the column with , and propose a probabilistic framework for inferring the class label for each data point . We express our proposed formulation for binary classification first, and then extended it to the multi-class case as well.
In the following, we first introduce a generative model in Section 4.1 that models every prototype in order to handle the above difficulties. Based on the generative model, we formulate the label inference problem as a maximum likelihood estimation problem in Section 4.2. We propose an expectation-maximization approach for solving the problem in Section 4.3.
For every prototype , we introduce a generative model that generates the affinity scores for all images . Let be a variable denoting the affinity score between an instance and the prototype . Let be the actual unobserved class label for . Each prototype provides some information about the hidden true class label . If the true label is one, namely, , then
follows a Gaussian distribution parameterized by. If the true label is zero, namely, , then follows another Gaussian distribution parameterized by . Putting them in together, we formulate the generative model for using two Gaussians as follows:
As discussed earlier, a complication we have is that we do not know the class membership of each prototype , which we denote as . If , then we expect to have a higher mean and smaller variance than ; otherwise, we expect the opposite. Luckily, this complication can be circumvented by using the EM formulation (c.f. Section 4.3). This is because EM is an iterative algorithm, and at every iteration, we have an assignment of labels for all instances . As all the prototypes are derived from the instances, we also have an assignment for . Therefore, the ’s are implicitly tied to the ’s, and we do not need to model uncertainty over separately.
Given the generative model for prototype , a current assignment of class labels (and hence
), the question is what is the probability of observing a particular value, denoted as . Depending on the assignment of ’s, we give the calculation equations as follows:
is the cumulative distribution function (CDF) ofand is the CDF of .
Figure 5 shows the Gaussian distributions learned by our approach on the given affinity matrix. Let us consider the case when . This corresponds to the affinity scores marked as Class 0 Prototypes in the figure, and correspondingly, the histogram shown on the left. Following (7) and (8), corresponds to the the area to the right of the observed score (1 - ) for the Gaussian distribution shown in orange, and corresponds to the the area to the left of the observed score () for the Gaussian distribution shown in blue. Hence, will give a high probability to low scores, denoting , and will give a high probability to high scores, denoting .
Given the observed affinity matrix , we want to estimate the parameters that maximizes the data likelihood, where (and the rest three bold parameters) denotes a vector of parameters including . Given that all instances are independent, the likelihood function can be written as follows:
Conditioning on the true label, we get,
We use the loose assumption that the prototypes are independent given a label. This assumption is mostly true as most prototypes are derived from different instances. Therefore, we define,
where we define , which captures the percentage of examples having class label .
We find the maximum likelihood estimator by maximizing the log-likelihood:
Given , the probabilistic label for is thus
We can compute the maximum likelihood estimate using the Expectation Maximization (EM) technique [Dempster et al., 1977], by treating the class labels as the latent variables. The complete data log-likelihood can be written as:
Each iteration of the expectation maximization algorithm comprises of an (E)xpectation step and a (M)aximization step. In the E-step, we maximize the complete data log-likelihood given the current model parameters to find an estimate of the true label for each image. In the M step, given the label estimates we recompute the model parameters. These two steps are performed iteratively until the estimates converge, i.e., the estimated labels do no change any more.
E Step. Given the current estimate of model parameters and the affinity matrix , the expectation of the log-likelihood is computed as
Using Bayes theorem we can say that:
M Step. Given the probabilistic labels and the affinity matrix , the model parameters are then re-computed by maximizing (16).
Finding an analytic solution of that maximizes (16) is challenging. Hence, we perform random sampling to get discrete labels for every image, where each image has a probability of of being assigned label . Given the discrete assignments of labels, maximizing (16) becomes equivalent to maximizing (14), with being the percentage of images that have label
in the assignments. We then fit the two normal distributions according to the two sets of instances, depending on the discrete labels, to update our parameters.
Initialization of the EM process. The EM process needs an initialization, i.e., an initial guess of . We perform a k-means clustering (2-means clustering for binary class inference) on the examples , using the row in the matrix as the features for . The assumption is that, for two images and , if they belong to the same class, then they are more likely to share similar values in and , for every prototype , regardless of which class was derived from.
We describe how we extend the aforementioned approach for assigning class memberships for multi-class labeling tasks. Let denote the number of classes (labels) in our labeling task. We still need to have a generative model for , except now we need to have different Gaussian distributions depending on the hidden true label of an instance, that is,
The probability of observing a particular value , given the current assignment of (and ), is thus calculated as follows:
where is the cumulative distribution function (CDF) of and is the indicator function.
Given the new generative model, we can thus define as follows:
Hence, the new likelihood function is:
where is the percentage of instances having label and .
The EM algorithm for maximizing the new likelihood function mostly stays the same. For the initial class assignments, we run the k-means clustering algorithm with the number of clusters equal to .
Discussion. The class inference module technically produces a clustering of the unlabeled instances, where each cluster corresponds to a class in the multi-class labeling task. However, we do not automatically know which class label a particular cluster corresponds to. We assume that a domain expert will look at each cluster and decide the cluster-to-class mapping, which should be a very simple task.
|Dataset||Class #||Dataset Size||VGG-16||
We compare GOGGLES, our system for implementing the affinity coding paradigm, to the Snorkel system that implements the data programming paradigm. The main points we seek to validate in this section include: (1) the end-to-end labeling accuracy of GOGGLES for the binary and multi-class labeling tasks; (2) comparing the weak supervision sources used for labeling, i.e., labeling functions in Snorkel v.s. affinity matrix in GOGGLES; and (3) comparing the class inference component for de-noising, i.e., the graphical model based approach in Snorkel v.s. the EM-based approach in GOGGLES.
Datasets. We use the following two datasets for our experimental evaluation.
Caltech-UCSD Birds-200-2011 (CUB) dataset [Wah et al., 2011]: The CUB dataset comprises of 11,788 images of 200 bird species. Each image is also provided with image level attribute information like white head, grey wing etc. The attribute annotations help explain the visual characteristics of each image. We treat these human annotations as a proxy for labeling functions for evaluating Snorkel.
Animals with Attributes 2 (AwA2) dataset [Xian et al., 2018]: The AwA2 dataset comprises of 37322 images of 50 animal species with 85 class level attribute annotations.
For binary labeling tasks, we randomly pick 10 pairs of classes from both the datasets. For multi-class labeling tasks, we randomly pick 3 sets of 5 species for both the datasets and show the results for the first 3, 4, and all classes in the set.
Competing Methods for binary labeling tasks. Both Snorkel and GOGGLES have two components: (1) constructing a matrix that includes all the weak supervision sources. Snorkel uses the labeling function based approach (LF), and GOGGLES uses the affinity matrix (AF) based approach. (2) performing class inference using the constructed matrix. Snorkel uses the probabilistic graphical model based approach (PGM), and GOGGLES uses the expectation-maximization based approach (EM). To compare the effects of these components on labeling performance, we construct and compare the following methods for training data generation.
VGG-16 as a labeler: Since we use VGG as our CNN model to extract the concept prototypes, we compare the labeling accuracy of VGG-16 with our algorithm. For both binary and multi-class labeling tasks, if all the classes in the set are present in the ImageNet dataset which the model was trained on, we report the results of VGG-16.
LF + PGM (Snorkel). This is Snorkel’s approach. However, Snorkel needs humans to write the LFs. Luckily, CUB dataset contains human annotations with respect to multiple class properties (e.g., whether an image has blue tail or not), and it also contains which class a particular property indicates. Hence, we use the human annotations in the CUB dataset to simulate the labeling functions needed by the Snorkel denoising technique. Note that this method is not applicable to the AwA2 dataset as it does not have any human annotations.
LF + EM: This method swaps the PGM based apporach for de-noising with our EM-based approach. The purpose is to compare the de-noising performance of PGM v.s. EM for the matrix that contains weak supervision signals provided by humans.
AF + PGM: This method tests the capabilities of using PGM for denoising the signal matrix provided by affinity coding. The PGM expects the values in the matrix to be in , while our affinity matrix comprises of scores in the range . Hence, we need to discretize the scores in the affinity matrix. One hurdle here is that, we do not know the class membership of each column j a priori. Hence we first assign a class label to each column by running the k-means algorithm over each column with the rows as features, i.e., we run the k-means algorithm on the transpose of the affinity matrix. Next, we calculate a threshold for each column by taking the mean of the minimum and maximum values in the column, and then assign the class label corresponding to the column to each row having a score greater than the column threshold. Correspondingly, we assign the label 0 (uncertain) to rows having scores less than the column threshold. We denoise the resulting matrix using the PGM provided by Snorkel and report the results.
AF + KM: We also analyze the performance of clustering algorithm k-means (KM) as a denoising technique for the generated affinity matrix for each set of species for both the datasets. This is essentially the GOGGLES system without using the EM approach to iteratively refine the label assignments.
AF + EM (GOGGLES): We evaluate the performance of Expectation Maximization(EM) algorithm explained in 4.3 as a denoising technique on top of the constructed affinity matrix.
AF + EM (GOGGLES with random initialization): This is the same algorithm as GOGGLES, except that, instead of using the results of k-means as the initial class assignments of the EM algorithm, we simply do random initialization. We only include this method to verify the robustness of our EM algorithm.
Competing Methods for multi-class labeling tasks. As Snorkel does not provide implementations for multi-class labeling tasks, we only compare the AF + KM method with the AF + EM (GOGGLES) method. We also report the results for using VGG-16 as a labeler if all the classes are found in the ImageNet dataset that VGG-16 is trained on.
Evaluation Metrics. Since the final labeling results from both Snorkel and GOGGLES are probabilistic, we need to convert them into discrete labels to compare them with the ground truth labeling. For EM-based class inference methods, we simply use the discrete labels produced by the M-step in the last iteration of the EM algorithm. For PGM-based class inference methods, we follow Snorkel’s recommendation: we take the average of the minimum and maximum probabilities as the threshold probability, and instances that have probabilities above the threshold are labeled , and otherwise . The final labeling accuracy is the percentage of unlabeled instances that have the same discrete labels as the ground truth.
Computing the affinity matrix code is vectorized end to end using pytorch. In addition, since the class inference is probabilistic for AF + KM, AF + EM (random), AF + EM (KM) and AF + PGM we run the experiments 10 times and report the median accuracy values for these runs for each experiment. However, PGM (Snorkel) takes a lot of time to denoise the affinity matrix for AwA2 dataset and therefore, we stop the code after 24 hours and report the results for 4 class pairs for only 1 run.
Table 1 shows the labeling accuracy comparisons for all the discussed methods. There are multiple observations from the results:
We show that GOGGLES consistently produces the best accuracy in comparison to the other algorithms. On the CUB dataset, GOGGLES produces an absolute labeling accuracy of on an average, which is 14.88 % more than the average accuracy of Snorkel i.e. . On the AwA2 dataset, GOGGLES produces an absolute labeling accuracy of .
We can see that methods that use AF are much better than methods that use LF as the weak supervision sources. This shows that our proposed affinity coding paradigm provides a much richer set of weak supervision signals as compared to data programming. This is primarily because of the affinity coding design: every unlabeled image itself can provide at least one supervision signal.
We also compare the two class inference components: PGM v.s. EM. Comparing LF + PGM with LF + EM, we can see that there is no clear dominance in terms of de-noising human labeling functions. Sometimes PGM is better than EM, and sometimes EM is better. Comparing AF + PGM with AF + EM, we can see that EM is consistently better than PGM on the CUB dataset, while we get mixed results on the AwA2 dataset. However, note that, we get many ’-’ entries for AF + PGM on the AwA2 dataset, which means that PGM has exceeded our preset hours running time limit.
We have also compared AF + KM with AF + EM (GOGGLES) to verify the benefits of running EM on top of k-means. We can see that except for two cases where we see small accuracy degradation (35, 37 and 32, 41 on AwA2 dataset), EM always improves the labeling accuracy of k-means.
We also compared AF + EM (GOGGLES) with AF + EM (GOGGLES with random initialization). We can see that they produce similar results, which suggest that our EM algorithm is able to iteratively refine class assignments and does not depend on a particular seed initialization. We choose k-means initialization as default in GOGGLES because it already provides a good starting point and hence EM can converage much faster.
|Dataset||Class #||Dataset Size||VGG||AF + KM||
Table 2 shows the results for multi-class labeling tasks. We observe the following:
GOGGLES is able to achieve an average labeling accuracy of 92.93% and 83.44% on the Caltech-UCSD Birds-200-2011 dataset and Animals with Attributes 2 dataset respectively for the multiclass labeling task across randomly selected 3-class, 4-class, and 5-class labeling tasks.
As the number of classes gets higher, the labeling accuracy usually goes down. This is expected as the labeling tasks are harder for more classes.
Comparing AF + KM with AF + EM (GOGGLES), we can see that EM provides a small amount of improvements over KM. This suggests that (1) the biggest factor for our labeling performance can be attributed to the affinity matrix and (2) EM does in fact further improve the performance of k-means.
In this section, we highlight how ML community tackles the challenges associated with insufficient training data. We highlight the differences between GOGGLES and data programming. We also give overviews of other related work in the data management community.
Model Training with Insufficient Data.
ML community has been dealing with insufficient training data problem in a variety of ways. Active learning techniques aim at involving human labelers in a judicious way to get the maximal benefits while minimizing labeling cost[Settles, 2012]. Semi-supervised learning techniques trains models with some labeled data and a much larger set of unlabeled data [Zhu, 2005]. They usually leverage various assumptions about the data, such as smoothness and low-dimensional structure. Transfer learning uses models trained on other tasks that have many labeled data to help with training models on new tasks with less labeled data [Pan and Yang, 2010]. The end products of these approaches are the final predictive models, and hence they usually integrate the model training process and the labeling process. In contrast, we are not tied with downstream modeling, and purely aim at producing labels for the unlabeled set that can be used to train any model.
, is a recent proposal that assists users to programmatically generate training data, and is the most relevant work to ours. In Snorkel, users need to write many labeling functions, where each labeling function provides labels for a subset of the unlabeled data. The labeling functions essentially allow users to encode any heuristic rules that may be useful for labeling data. Applying user-written labeling functions to all unlabeled data generates a labeling matrix. Given a labeling matrix, Snorkel uses the agreements and disagreements of the labeling functions on different data points to learn the accuracy and dependencies of labeling functions via factor graph models, and then produces the final probabilistic labels. Though GOGGLES and Snorkel share similar structures (GOGGLES’s affinity matrix corresponds to Snorkel’s labeling matrix and GOGGLES’s class inference component corresponds to Snorkel’s matrix de-noising component), there are multiple critical differences between them: (1) GOGGLES does not need any user input, except for the affinity functions that developers need to code. However, once coded, they can be reused for labeling future datasets; (2) Snorkel still requires a small set of labeled data (termed development set) to train the factor graph model for matrix de-noising. GOGGLES, however, requires zero labeled data, and does automatic class inference; and (3) while Snorkel uses the complicated factor graph models for matrix de-noising, GOGGLES features a much simpler EM algorithm for class inference, which surprisingly produces better results as shown in Section5.
Related Work in Database Community. Many research problems in database community share similar technical challenges to our work. In particular,data fusion/truth discovery [Pochampally et al., 2014, Rekatsinas et al., 2017b], crowdsourcing [Das Sarma et al., 2016], and data cleaning [Rekatsinas et al., 2017a], in one form or another, all need to reconcile information from multiple sources to reach one answer. While the information sources are assumed as input in these problems, labeling training data faces the challenge of lacking enough information sources. In fact, one primary contribution of GOGGLES is the affinity coding paradigm, where each unlabeled data becomes an information source. On the other hand, Our EM-based class inference approaches are inspired by many of the reconciliations techniques proposed in database community.
Related Work in Vision Community. CNNs have been producing the state-of-the-art results for image recognition tasks, introducing further research into the discriminative power of the filter banks. Highly sparse deep layers of VGG-16 retain the most discriminative information in the representation space as opposed to AlexNet that holds background information due to larger number of max-pool layers resulting in low sparsity [Yu et al., 2016]. The DeepCluster method [Caron et al., 2018]
performs discriminative learning by assigning pseudo labels using k-means clustering and then learning them in the network by optimizing weights through backpropagation with a classification loss. However, this work treats the input as a whole, whereas our work tries to identify spatially localized concepts in the input.
In this paper, we proposed affinity coding, a new paradigm for automatic generation of training data. Affinity coding is based on the proposition that instances belonging to the same class share certain similarities that instances belonging to different classes do not share. We also proposed the GOGGLES system that implements the affinity coding paradigm for image datasets. GOGGLES also includes a novel algorithm for class inference given an affinity matrix.
As far as we know, GOGGLES is the first system that performs training data generation automatically. There are many interesting followup research directions: (1) how do further increase the labeling accuracy, especially for multi-class labeling tasks; (2) can we design other (better) ways to code the affinities for images; (3) how can we apply affinity coding for generate training data for labeling tasks on other types of data, such as text and structured data.
Deep clustering for unsupervised learning of visual features.In
Proceedings of the European Conference on Computer Vision (ECCV), pages 132–149, 2018.
Active learning: Synthesis lectures on artificial intelligence and machine learning.Long Island, NY: Morgan & Clay Pool, 2012.