Adversarial Data Programming: Using GANs to Relax the Bottleneck of Curated Labeled Data

03/14/2018 ∙ by Arghya Pal, et al. ∙ Indian Institute of Technology Hyderabad 0

Paucity of large curated hand-labeled training data for every domain-of-interest forms a major bottleneck in the deployment of machine learning models in computer vision and other fields. Recent work (Data Programming) has shown how distant supervision signals in the form of labeling functions can be used to obtain labels for given data in near-constant time. In this work, we present Adversarial Data Programming (ADP), which presents an adversarial methodology to generate data as well as a curated aggregated label has given a set of weak labeling functions. We validated our method on the MNIST, Fashion MNIST, CIFAR 10 and SVHN datasets, and it outperformed many state-of-the-art models. We conducted extensive experiments to study its usefulness, as well as showed how the proposed ADP framework can be used for transfer learning as well as multi-task learning, where data from two domains are generated simultaneously using the framework along with the label information. Our future work will involve understanding the theoretical implications of this new framework from a game-theoretic perspective, as well as explore the performance of the method on more complex datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 7

page 11

page 14

page 15

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Curated labeled data is a key building block of modern machine learning algorithms, and a driving force for deep neural network models. The large parameter space of deep models requires very large labeled datasets to build effective models that work in practice. However, this inherited dependency on large curated labeled data has become the major bottleneck of progress in the use of machine learning and deep learning in computer vision and other domains

[41]. Creation of large scale hand-annotated datasets in every domain is a challenging task due to the requirement for extensive domain expertise, long hours of human labour and time - which collectively make the overall process expensive and time-consuming. Even when data annotation is carried out using crowdsourcing (e.g. Amazon Mechanical Turk), additional effort is required to measure the correctness (or goodness) of the obtained labels. We seek to address this problem in this work 111This paper is accepted in CVPR 2018.

In particular, we focus on automatically learning the parameters of a given joint image-label probability distribution (as provided in training image-label pairs) with a view to automatically create labeled datasets. To achieve this objective, we exploit the use of distant supervision signals to generate labeled data. These distant supervision signals are provided to our framework as a set of weak labeling functions which represent domain knowledge or heuristics obtained from experts or crowd annotators. Writing a set of labeling functions (as we found in our experiments) is fairly easy and quick, and can then be used in our framework to generate data as well as associated labels. More interestingly, such labeling functions are often easily generalizable, thus allowing our framework to be extended to transfer learning and multi-task learning (discussed in Section

5). Figure 1 shows a few examples of our results to illustrate the overall idea.

Figure 1: (a) Sample results of image-label pairs generated using the proposed ADP framework trained on CIFAR-10, MNIST and SVHN datasets (top to bottom respectively). Note that the label is generated by our model; (b) Demonstration of cross-domain multi-task learning using ADP , where the same model generates data from both Fashion MNIST and LookBook datasets (Section 5). Note that Fashion MNIST is grayscale while LookBook is color, and the model still generates both data effectively; (c) Demonstration of transfer learning of our ADP from MNIST dataset (source domain) to generate image-label pairs on the SVHN dataset (target domain).

In practice, labeling functions can be associated with two kinds of dependencies: (i) relative accuracies, which measure the correctness of the labeling functions w.r.t. the true class label; and (ii) inter-function dependencies that capture the relationships between the labeling functions with respect to the predicted class label. In this work, we have proposed a novel adversarial framework using Generative Adversarial Networks (GANs) that learns these dependencies along with the data distribution using a minmax game. Our GAN learns to generate a joint data-label distribution using a generator block, a discriminator block and a Labeling Functions Block (LFB), which contains another discriminator that helps in learning the two kinds of dependencies mentioned above. The overall architecture of the proposed ADP architecture is presented in Figure 2a.

Our broad idea of learning relative accuracies and inter-function dependencies of labeling functions is inspired by the recently proposed Data Programming (DP) framework [36] (and hence, the name ADP), but our method is different in many ways: (i) DP is a strict conditional model (i.e.

) that requires additional unlabeled data points even at test time, while our model is a joint distribution model, i.e.

, and does not require any additional unlabeled data points at test/generation time. (ii) DP learns a generative model using Maximum Likelihood Estimation (MLE) and gradient descent to learn the relative accuracies of labeling functions. We however replace this approach with a GAN-based adversarial estimation of parameters.

[11] and [42] provide insights on the advantage of using a GAN-based estimator over MLE to achieve a relatively quicker training time and good robustness on generated samples. (iii) To learn the statistical dependencies of labeling functions, DP models the dependency structure of labeling functions as a factor graph, and uses computationally expensive Gibbs sampling techniques to update the gradient in each step. We replace the factor graph and Gibbs sampling-based estimation of inter-function dependencies with another discriminator in our GAN-based estimation, which speeds up the learning process again and provides a robust generation at run-time.

As our outcomes of this work, we show how a set of low-quality, weak labeling functions can be used within a framework that models a joint data-label distribution to generate robust samples. We also show that this idea can be generalized quite easily to transfer learning and multi-task learning settings, showing the generalizability of this work. Our contributions can be summarized as follows:

  • [noitemsep,topsep=0pt]

  • We propose a novel adversarial framework, ADP to generate robust data-label pairs that be used to obtain datasets in domains that have very little data and thus save human labor and time.

  • We show how an adversarial framework can be used to learn dependencies between weak labeling functions and thus provide high-fidelity aggregated labels along with generated data in a GAN setting.

  • The proposed framework can also be used in a transfer learning setting where ADP can be trained on a source domain, and then finetuned on a target domain to then generate data-label pairs in the target domain.

  • We also show the potential of this ADP framework to generate cross-domain data in a multi-task setting, where images from two domains are generated simultaneously by the model along with the labels.

2 Related Work

Data augmentation seems a natural answer to the scarcity of curated hand-labeled training data. However, heuristic data augmentation techniques like [15] and [19]

use a limited form of class-preserving image transformations such as rotation, mirroring, addition of small noise, random crop etc. Interpolation-based methods proposed in

[13] and class-conditional models of diffeomorphisms proposed in [20] interpolate between nearest-neighbor labeled data points. The popular SMOTE algorithm [7] performs oversampling to reduce class imbalance and augment the given data. All of these methods vastly depends on hand-tuned parameters, the order of geometric transformations, the optimal value of transformation parameters, etc. A small change in parameters can often lead to negative impacts on final performance as studied in [37], [10] and [15]).

In this work, we choose to use a more intuitive way of creating labeled data by learning a joint distribution model. Learning a joint data-label distribution using generative models such as [14], [18], and [28] is non-trivial, since the label often requires domain knowledge and not directly inferrably from data. Our proposed model hence uses distant supervision signals (in the form of labeling functions) to generate novel labeled data points. Distant supervision signals such as labeling functions are cheaper than manual annotation of each data point, and has been successfully used in recent methods such as [36]. Ratner et al. proposed a generative model in [36] that uses a fixed number of user-defined labeling functions to programatically generate synthetic labels for data in near-constant time. DP outperformed number of approaches such as multiple-instance learning ([38]), co-training ([4]), crowdsourcing ([17]), or ensemble based weak-learner method like boosting ([40]), thus reinforcing our choice in this work. Alfonseca et al. [1] generated additional training data using hierarchical topic models for weak supervision. Heuristics for distant supervision are also proposed in [6], but this method does not model the inherent noise associated with such heuristics. Structure learning [43][37] also exploits the use of distant supervision signals for generating labels, but as described in Section 1, these methods like [36] require unlabeled test data to generate a labeled dataset. Additionally, [36], [37] and [43] are computationally expensive due to its use of Gibbs sampling in MLE.

We instead use an adversarial approach to learn the joint distribution by weighting a set of domain-specific label functions using a Generative Adversarial Network (GAN). GAN ([18]) approximates the real data distribution by optimizing a minmax objective function and thus generates novel out-of-sample data points. Broadly, GAN can be viewed in terms of three manifestations: (i) GANs can be trained to sample from a marginal distribution ([12], [35][2]), where x refers to data. (ii) Recent efforts in literature such as Conditional GAN [31]

, Auxiliary Classifier GAN

[34] and InfoGAN [9] show training of GANs conditioned on class labels to thus sample from a conditional distribution, i.e. . Other state-of-the-art models with similar objectives have exploited other modalities for the same purpose; for example, Zhang et al [49] propose a GAN conditioned on images, while Hu et al [21] propose a GAN conditioned on text. (iii) There have been a few very recent efforts [46], [51] and [22]), which attempt to train GANs to sample from a joint distribution. For example, CoGAN ([29]) introduces a parameter-sharing approach to learn an unpaired joint distribution between two domains, while TripleGAN [27] brings together a classifier along with the discriminator and generator which helps in a semi-supervised setting. In this work, we propose a novel idea to instead use distant supervision signals to accomplish learning the joint distribution of labeled images. We now describe the proposed methodology.

Figure 2: (a) Overall architecture of the Adversarial Data Programming (ADP) framework; (b) Example of a set of labeling functions

3 Adversarial Data Programming (ADP): Methodology

Our central aim in this work is to learn parameters of a probabilistic model:

(1)

that captures the joint distribution over the data and the corresponding labels , thus allowing us to generate out-of-sample data points along with their corresponding labels (we focus on images in the rest of this paper).

While recent efforts such as [29] and [16] have considered complementary objectives, they largely focused on learning joint probability distributions in cross-domain understanding settings. In this work, we focus on learning the joint image-label probability distribution with a view to automatically create labeled datasets, by exploiting the use of distant supervision signals to generate labeled data. To the best of our knowledge, this is the first such work that invokes distant supervision while learning the joint distribution , so as to generate labeled data points at scale from . Besides, automatic generation of labels for data based on training data-label pairs is non-trivial, and often does not work directly. Distant supervision provides us a mechanism to achieve this challenging goal. We encode distant supervision signals as a set of (weak) definitions by annotators using which unlabeled data points can be labeled. These definitions can be harvested from knowledge bases, domain heuristics, ontologies, rules-of-thumb, educated guesses, decisions of weak classifiers or obtained using crowdsourcing. Many application domains have such distant supervision available in different means through domain knowledge or heuristics, which can be leveraged in the proposed framework. We provide examples in Section 4 when we describe our experiments.

We encapsulate all available distant supervision signals, henceforth called labeling functions, in a unified abstract container called Labeling Functions Block (LFB, see Figure 2a). Let LFB comprise of labeling functions , where each labeling function is a mapping:

(2)

that maps a data point to a

-dimensional probabilistic label vector,

, where is the number of class labels with for each . For example, could be thought of as an image from the MNIST dataset, and would be the corresponding label vector when the labeling function is applied to . , for instance, could be the one-hot 10-dimensional class vector, see Figure 2b.

We characterize the set of labeling functions, , with two kinds of dependencies: (i) relative accuracies of the labeling functions with respect to the true class label of a given data point; and (ii) inter-function dependencies that capture the relationships between the labeling functions with respect to the predicted class label. To obtain a final label for a given data point using the LFB, we use two different sets of parameters, and to capture each of these dependencies between the labeling functions. We, hence, denote the Labeling Function Block (LFB) as:

(3)

i.e. given a set of labeling functions , a set of parameters capturing the relative accuracy-based dependencies between the labeling functions, , and a second set of parameters capturing inter-label dependencies, , provides a probabilistic label vector, , for a given data input .

The joint distribution we seek to model in this work (Equation 1) hence becomes:

(4)

In the rest of this section, we show how we can learn the parameters of the above distribution modeling image-label pairs using an adversarial framework with a high degree of label fidelity. We use Generative Adversarial Networks (GANs) to model the joint distribution in Equation 4. In particular, we provide a mechanism to integrate the LFB (Equation 3) into the GAN framework, and show how and

can be learned through the framework itself. Our adversarial loss function is given by:

(5)

where is the generator module and is the discriminator module. The overall architecture of the proposed ADP framework is shown in Figure 2a.

This approach has a few advantages: (i) firstly, labeling functions (which can even be just loosely defined) are cheaper to obtain than collecting labels for a large dataset; (ii) labeling functions can help bring domain knowledge into such generative models; (iii) labeling functions act as an implicit regularizer in the label space, thus allowing good generalization; (iv) with a small fine-tuning, labeling functions can be easily re-purposed for new domains (transfer learning), as we describe later in this paper.

The ADP architecture is designed to learn the parameters required to model the joint distribution in Equation 4, and thus generate out-of-sample image-label pairs. This architecture is broadly divided into three modules: the generator, discriminator and the LFB. We now describe each of these modules individually.

3.1 The ADP - Generator

Given a noise input and a set of labeling functions , the generator outputs an image and the parameters and , the dependencies between the labeling functions described earlier. In particular, consists of three blocks: , and , as shown in Figure 2a. captures the common high-level semantic relationships between the data and the label space, and is comprised only of fully connected (FC) layers. The output of forks into two branches: and , where generates the image , and generates the parameters . While uses FC layers, uses Fully Convolutional (FCONV) layers to generate the image (more details in Section 4). Thus, the generator outputs given input

, the standard normal distribution.

3.2 The ADP - Discriminator

The discriminator of ADP estimates the likelihood of an image-label input pair being drawn from the real distribution obtained from training data. takes a batch of image-label pairs as input and maps that to a probability score to estimate the aforementioned likelihood of the image-label pair. To accomplish this, has two branches: and (shown in the Discriminator block in Figure 2a). These two branches are not coupled in the initial layers, so as to separately extract required low-level features. The branches share weights in later layers to extract joint semantic features that help classify correctly if an image-label pair is fake or real.

We hence expand our objective function from Equation 5 to the following:

(6)

3.3 The ADP - Labeling Function Block

This is a critical module of the proposed ADP framework. Our initial work revealed that a simple weighted (linear or non-linear) sum of the labeling functions do not perform well in generating out-of-sample image-label pairs. We hence used a separate adversarial methodology within this block to learn the dependencies, both relative accuracies and inter-function (discussed earlier in this section), between the labeling functions provided to the framework. We describe the components of the LFB below.

3.3.1 Relative Accuracies of Labeling Functions

The output, , of the block in the ADP-Generator provides the relative accuracies of the labeling functions. Given the image output generated by : , the labeling functions , and the probabilistic label vectors obtained using the labeling functions (as in Eqn 2), we define the aggregated final label as:

(7)

where is the normalized version of , i.e. . The aggregated label, , is provided as an output of the LFB.

3.3.2 Inter-function Dependencies

Our empirical studies with considering only relative accuracies of labeling functions as a weighting mechanism led to mode collapse in the joint distribution space, a well-understood problem in GANs. Our preliminary empirical studies demonstrated mode collapse in the joint distribution space (either images of same class with different labels, or images of different classes with same label were generated). The rationale behind taking two discriminators is to penalize the missing modes. Related literature [36] shows that inter-functional dependencies act as an implicit regularizer in the label space. We also conducted experiments on synthetic data to demonstrate this issue (please see Fig LABEL:fig:synthetic_image below). We hence introduced an adversarial mechanism inside the LFB to influence the final relative accuracies, , using the inter-function dependencies between the labeling functions. , a discriminator inside LFB, receives two inputs: , which is output by , and , which is obtained from using the procedure described in Algorithm 1.

      Input: Labeling functions , Relative accuracies , Output probability vectors of labeling functions
      Output:
Set ;
/*

I = Identity Matrix */

for  to  do
       /* For each labeling function */
       for  to  do
             /* For each other labeling function */
             /*

If one-hot encoding of the outputs of two functions match, increment

th entry in by 1 */
             ;
            
       end for
      
end for
for  to  do
       (p, .) = ;
      
end for
Set /* Complete matrix using symmetry */
Algorithm 1 Procedure to compute

Algorithm 1 computes a matrix of interdependencies between the labeling functions, , by looking at the one-hot encodings of their predicted label vectors. If the one-hot encodings match for a given data input, we increase the count of their correlation by one, and compute this matrix across a particular mini-batch of data points under consideration. The counts are then normalized row-wise to obtain . The task of the discriminator is to recognize the computed interdependencies as real, and the generated through the network in

as fake. The gradient backpropagated through this discriminator to the

block is critical as a regularizer in learning a better , which is finally used to weight the labeling functions (as in Section 3.3.1). Combining the gradient information from along with , penalizes missing modes and helps to generate more variety in the samples. The objective function of our second adversarial module is hence:

(8)

where and are obtained from as described above. More details of the LFB are provided in implementation details in Section 4.

The overall architecture of ADP (Figure 2a) is trained using end-to-end backpropagation with gradients from both discriminators, and , influencing the weights learned inside the generator

. Mini-batches of image-label pairs from a given training distribution are provided as input to ADP , and Stochastic Gradient Descent (SGD) is used to learn the parameters of the model. At the end of training, we define the aggregated final label as:

(9)

the samples generated using the and modules thus provide samples from the desired joint distribution (Eqn 1) modeled using the framework.

4 Experiments and Results

4.1 Datasets

We validated the ADP framework on standard datasets: MNIST ([26]), Fashion MNIST ([45]), SVHN ([33]), and CIFAR-10 ([23]). No additional pre-processing is performed on MNIST, Fashion-MNIST and CIFAR 10 datasets. For the SVHN dataset, we used the ‘Format 2 Cropped version’, and included an additional crop on each image to reduce presence of more than one digit, though the dimension is maintained222Code available at https://github.com/ArghyaPal/Adversarial-Data-Programming.

4.2 Labeling Functions

Labeling functions form a critical element of ADP , and we used different cues from state-of-the-art algorithms to help obtain labeling functions for our experiments. Table 1 shows the labeling functions we used for our experiments on MNIST and SVHN (digit recognition problems), and Table 2 shows the functions used for CIFAR and Fashion-MNIST. We categorized labeling functions as: (i) Heuristic; (ii) Image processing-based; and (iii) Deep learning-based labeling functions (as in Tables 1 and 2). Table 3 presents the statistics of the number of labeling functions used for each of the considered datasets (the empirical study that motivated these choices is presented in Section 5). In this work, for each labeling function, a simple threshold rule on the -norm of the aforementioned features is used, where the threshold is obtained empirically as the mean of the -norms of a randomly chosen subset, which is

-trimmed to remove outliers. More examples of labeling functions and ablation studies on their usefulness are presented in the Supplementary Section.

Type Labeling Functions used
Heuristic Presence of long edges (vertical or horizontal) [30]; Image histogram
Image Processing based Bag-of-feature [39]; Haar wavelet [8]; Discrete-continuous ADM [25]; Compressive sensing [50]
Deep Learning based Convolution kernels from last conv layer (before fully connected layers) of LeNet
Table 1: Labeling functions used for MNIST and SVHN datasets, both of which represent the digit recognition task
Type Labeling Functions used
Heuristic PatchMatch [3]; Blob Detection; Presence of edges [1]; Textons; Image histogram
Image Processing based Global descriptor (GIST-based) [32]; Local descriptor (SIFT-based) [48]; Bag-of-visual-words; Histogram of Oriented Gradient (HOG)-based: HoGgles [44]
Deep Learning based

Convolution kernels from last conv layer (before fully connected layers) of (ImageNet) pre-trained AlexNet

Table 2: Labeling functions used for CIFAR 10 and Fashion MNIST datasets
Heuristic Image Processing Deep Learning
MNIST 43 10 1
Fashion-MNIST 50 6 1
SVHN 43 10 2
CIFAR 10 46 18 2
Table 3: Number of labeling functions used for different datasets
Figure 3:

(Best viewed in color) Image-label pairs generated by training on CIFAR10 dataset using CGAN, ACGAN, InfoGAN, CoGAN, TripleGAN and our method, ADP . For a given model, the columns of images represents generations after 0.1k, 20k, 40k, 50k epochs, and the rows correspond to the associated class label. ‘ap’ stands for airplane, and ‘am’ stands for the automobile class of CIFAR 10 dataset. Note the clarity of generations of the proposed method.

4.3 Implementation Details

has 3 dense fully connected layers (FC) (128 nodes per layer) with batch-normalization.

continues with fractional length convolutional layers, similar to [29] (FCONV: 1024 nodes per layer, Kernel size:

, Stride: 1, followed by batch-normalization and Parameterized ReLU), and generates image

x. uses FC layers and generates . The discriminator network follows the in-plane rotation network of [29]. is a stack of FC layers. Both and have 2 FCONV layers followed by FC layers. We trained the complete model with mini-batch Stochastic Gradient Descent (SGD) with a batch size of 128, learning rate of 0.0001, momentum factor of 0.5 and Adam as an optimizer.

4.4 Comparison with State-of-the-Art Models

Qualitative Results:

We compared our method against other generative methods that allow generation of data along with a label: Conditional GAN or CGAN ([18]), ACGAN ([34]), InfoGAN ([9]), CoGAN ([29]) and TripleGAN ([27]). (We changed the use case setup of these methods to generate data-label pairs as required. For example, for a conditional GAN, we specified a class label, generated a corresponding image and used this as an image-label pair.) We used the publicly available codes for each of the above methods, and the results for CIFAR10 are shown in Figure 3 (Results for other datasets are shown in the Supplementary Section). The figure shows that the proposed model generates images with very good clarity. Besides, while some of the aforementioned methods (such as CGAN and InfoGAN) generate images conditioned on a given label (and hence require a label to be provided as input), the label is provided by the model in our case.

Quantitative Results:

We considered three evaluation metrics for studying the performance of our method quantitatively: (i)

Human Turing test (HTT): This metric studies how hard it is for a human annotator to tell the difference between real and generated samples. We asked 40 subjects to evaluate image quality and image-label correspondence (Table 5) on a scale of 10, given 50 random image-label samples from the generated pool for each method considered. Table 5 shows consistently good performance of ADP over other methods, especially in image-label correspondence, which is the focus of this work; (ii) Inception Score: The inception score, as used in [29] and [27], for the CIFAR 10 dataset is shown in Figure 5. The figure shows that ADP and TripleGAN perform significantly better than the rest of the methods (more results on other datasets included in the Supplementary Section). We also used a Parzen window based evaluation metric, and these results are included in the Supplementary Section.

We trained a ResNet-56 model on the CIFAR-10 dataset under different settings, and the results are shown in Table 4. The addition of labeled data generated using our method significantly lowers the test cross-entropy loss across the epochs. We will include these in the paper.

(Training Data, Test Data) Epochs
5k 10k 15k 20k 30k 40k 50k
(Real data-50K, Real data-10K) 9.83 7.3 7.12 6.3 6.1 4.3 4.19
(ADP data-50K, Real data-10K) 9.32 8.9 8.13 7.0 6.75 5.53 5.0
(Real data-50K, ADP data-10K) 9.67 9.4 7.92 7.3 6.81 6.18 5.6
(ADP-25K + Real data-25K, Real data-10K) 8.5 6.6 6.21 5.7 5.5 4.83 3.5
(ADP-50K + Real data-50K, Real data-10K) 7.71 6.3 6.0 5.34 3.1 2.92 2.71
Table 4: Test cross-entropy loss of ResNet-56 on CIFAR-10 dataset. (Real data-50K, Real data-10K) = standard dataset; ADP = our method; 10/25/50K = the number of data points used in thousands. In ADP-25K + Real data-25K, class ratios were maintained as in the original dataset.
Dataset Image Quality Image-Label Correspondence
ACGAN CGAN InfoGAN TripleGAN ADP ACGAN CGAN InfoGAN TripleGAN ADP
MNIST
FMNIST
SVHN
CIFAR10
Table 5: Human Turing Test for image quality and image-label correspondence (Section 4.4, higher the better). Note that the proposed method, ADP performs the best in most cases, and is a close second when TripleGAN wins.
Classification Performance:

To study the usefulness of the generated image-label pairs, we studied the classification cross-entropy loss of a pretrained ResNet model on the image-label pairs generated by our ADP at test time. A lesser cross-entropy loss in ResNet at test time indicates the efficacy of our model as a data augmentation method. We compared our method against TripleGAN, InfoGAN, CoGAN as well as the popular oversampling technique, SMOTE [7]. Figure 4 shows the result, which shows that the proposed model has significantly lower cross-entropy loss over other methods, highlighting its usefulness.

Figure 4: Classification performance of a pretrained ResNet model on image-label pairs generated by various models trained on CIFAR 10
Figure 5: Inception scores on CIFAR 10 (Section 4.4, Quantitative Analysis)

5 Discussion and Analysis

Optimal Number of Labeling Functions:

We studied the performance of ADP when the number of labeling functions is varied to understand the impact of this parameter on the performance. We studied the test cross-entropy error of a pretrained ResNet model with image-label pairs generated by ADP, trained using different number of labeling functions. Table 6 shows our results, suggesting that 50-55 labeling functions provides the best performance, depending on the dataset. This justifies our choice of number of labeling functions in Table 3.

No of Labeling Functions MNIST F-MNIST CIFAR10 SVHN
3 70.23% 81.02% 87.39% 83.82%
10 47.92% 71.52% 60.11% 75.91%
25 20.32% 30.53% 42.31% 38.30%
30 4.56% 12.47% 21.19% 26.66%
40 1.40% 6.81% 19.93% 16.62%
50 1.33% 4.92% 18.93% 13.05%
55 1.34% 4.80% 18.45% 12.83%
65 1.31% 4.73% 18.43% 12.82%
70 1.25% 4.75% 18.40% 12.80%
Table 6: Performance of ADP when number of labeling functions is varied (Section 5, Optimal Number of Labeling Functions).
Transfer Learning:

The use of distant supervision signals such as labeling functions (which can often be generic) allows us to extend the proposed ADP model to a transfer learning setting. In this setup, we trained ADP initially on a source dataset and then finetuned the model to a target dataset, with very limited training. In particular, we first trained ADP on the MNIST dataset, and subsequently finetuned the branch alone with the SVHN dataset. We note that the weights of , and are unaltered. The final finetuned model is then used to generate image-label pairs (which we hypothesize will look similar to SVHN). Figure 1 shows encouraging results of our experiments in this regard.

Figure 6: Transfer learning from MNIST to SVHN dataset. Digits within parentheses indicate true label, while the other is the label generated using our method (Section 5, Transfer Learning)
Multi-task Joint Distribution Learning:

Learning a cross-domain joint distribution from heterogeneous domains is a challenging task. We show that the proposed ADP method can be used to achieve this, by modifying its architecture as shown in Figure 7, to simultaneously generate data from two different domains. We study this architecture on the MNIST and SVHN datasets, and show the promising results of our experiments in Figure 8. The LFB acts as a regularizer and maintains the correlations between the domains in this case. More results on other datasets - in particular, LookBook and Fashion MNIST - are included in the Supplementary Section as well as Figure 1.

Figure 7: ADP for Multi-Task Learning: Proposed Architecture
Figure 8: Results of ADP - Multi-Task Learning on MNIST (black and white) and SVHN (RGB) datasets

6 Conclusions

Paucity of large curated hand-labeled training data for every domain-of-interest forms a major bottleneck in deploying machine learning methods in practice. Standard data augmentation techniques and other heuristics are often limited in their scope and require carefully picked hand-tuned parameters. We instead propose a new adversarial framework called Adversarial Data Programming (ADP), which can learn the joint data-label distribution effectively using a set of weakly defined labeling functions. The method shows promise on standard datasets, as well as on settings such as transfer learning and multi-task learning. Our future work will involve understanding the theoretical implications of this new framework from a game-theoretic perspective, as well as explore the performance of the method on more complex datasets.

References

Appendix A Algorithm

Algorithm LABEL:alg:main presents the overall stepwise routine of the proposed method, ADP , as described in Section 3

. During the training phase, the algorithm updates weights of the model by estimating gradients for a batch of labeled data points. The hyperparameters that need to be provided include standard parameters that are provided while training a GAN, such as: (i) number of iterations of Algorithm

LABEL:alg:main; (ii) parameter (similar to [18]) that describes how many times and would be updated with respect to ; and (iii) minibatch size . Using empirical studies, we chose , and number of iterations to be . algocf[htbp]    

Appendix B Datasets

In this section, we provide more information on the datasets used for validating ADP in this work: MNIST, Fashion MNIST, SVHN and CIFAR 10. The MNIST dataset comprises grayscale images (with one handwritten digit in each image) along with the corresponding label, with 50,000 training samples (image-label pairs). In case of SVHN, we used format 2 of the dataset, which comprises 73257 images (each containing a digit captured from street views of house numbers) with the corresponding labels. In case of CIFAR 10, we merged five training batches of the dataset and built a training set of 50,000 images. This dataset contains RGB-images each of size of spanning 10 classes: automobile, airplane, bird, cat, deer, dog, frog, horse, ship, truck. The total number of samples are almost equally distributed across all classes. Fashion MNIST, similar to MNIST, consists of a training set of 50,000 grayscale images with one of 10 classes: Tshirt, Trouser, Pullover, Dress, Coat, Sandal, Shirt, Sneaker, Bag, Ankle boot.

We also used the LookBook [47] dataset (Figure 9) to demonstrate cross-domain multi-task learning using ADP , as described in Section 5. This dataset contains 84,748 images across 17 classes: Midi dress, mini dress, coat, jacket, fur jacket, padded jacket, hooded jacket, jumper, cardigan, knitwear, blouse, shirt, sleeveless tee, short sleeve tee, long sleeve tee, hoody, vest. In this work, we grouped these 17 classes into 4 classes: coat, pullover, t-shirt, dress

, in order to match with the Fashion MNIST dataset and thus help study cross-domain learning. We grouped coat, jacket, fur jacket, padded jacket, hooded jacket, jumper, cardigan to a single

coat class; hoody to the pullover class; sleeveless tee, short sleeve tee to the t-shirt class; cardigan, knitwear, blouse, Midi dress, mini dress to the Dress class. Fashion MNIST dataset also has the same classes: coat, pullover, t-shirt, Dress among its label, thus facilitating our study.

No additional pre-processing was performed on MNIST, Fashion MNIST, CIFAR 10 or the LookBook datasets. In case of SVHN, an additional crop was performed on each image to ensure only one digit is present in the image. The cropped image was subsequently sampled to maintain the size. Figure 9 shows illustrative examples of images from the chosen datasets.

Figure 9: Sample images from the datasets studied in this work: (a) CIFAR 10, (b) Fashion MNIST, (c) MNIST, (d) LookBook

Appendix C More on Labeling Functions

The Labeling Functions Block (LFB) in Figure 2a (Section 3) is implemented using the open-source framework, Snorkel [36]. We modified the underlying architecture of the Snorkel framework to include an adversarial approach, which otherwise estimates dependencies using MLE invoking Gibbs sampling. Labeling functions of three kinds: heuristics, image processing-based and deep learned features have been used in this work, as described in Section 4. Examples of labeling functions used in this work are shown as Labeling Functions 2, 3, 4 and 5. For each labeling function, a simple threshold rule on the -norm of the aforementioned features is used. For each class of a dataset, the threshold information is obtained empirically as the average -norm of the feature vectors of 20 random samples of that class (with -trimming to remove outliers). It is worthy to mention that, for an abstract understanding of working process of our labeling functions, the return value of example Labeling Functions 4 and 5 are one-hot encoding. Though in practice we fit a nonlinear function to get a probabilistic output.

      Input: Image
      Output: Probabilistic label vector
/* Decision tree for English numerals recognition [24] */
if blob(Image) == TRUE then
       if blob diameter(Image) 0.5cm then
             number = count blob(Image);
             if number == 2 then
                   return [0.2,0,0,0,0,0.1,0,0,0.6,0.1];
                  
             end if
            if number == 1 then
                  return [0.6, 0, 0.2, 0, 0.1, 0, 0, 0, 0.1]
             end if
            
       end if
      if blob diameter(Image) 0.5cm then
            return [0.4, 0, 0, 0, 0.1, 0, 0.3, 0.1, 0.1]
       end if
      
end if
Labeling Function 2 Sample heuristic labeling function (used for blobs in digits like: 0, 9, 6)
      Input: Image
      Output: Probabilistic label vector
/* Decision tree for English numerals recognition [24] */
if blob(Image) == TRUE then
       number = count stem (Image);
       if number == 0 then
            return [0.8, 0, 0, 0, 0.1, 0, 0, 0, 0.1]
       end if
      if number == 1 then
            return [0.1, 0, 0, 0, 0.4, 0, 0.4, 0, 0.1]
       end if
      if number == 2 then
            return [0, 0, 0, 0, 0.8, 0, 0, 0, 0.2]
       end if
      
end if
Labeling Function 3 Sample heuristic labeling function (used for digits with blob and stem like: 4, 6, 9)
      Input: Image, : Number of classes
      Output: Probabilistic label vector
/* NOTE: Loop presented below for purposes of clarity - it is implemented only once for a dataset */
for i=1 n do
       = average value of norm of Bag-of-feature() on subset of training samples from class ;
      
end for
;
= ;
return OneHot()
Labeling Function 4 Sample image processing based labeling function (based on Bag-of-Words)
      Input: Image, : Number of classes, Kernels from first layer of pre-trained AlexNet (trained on ImageNet)
      Output: Probabilistic label vector
/* Deep learning based labeling function */
m = Number of kernels from first layer of pre-trained AlexNet;
for i=1 n do
       for j = 1 m do
             = average value of Frobenius norm of activation map of kernel on subset of training samples from class ;
            
       end for
      
end for
for j = 1 m do
       = value of Frobenius norm of activation map of kernel on Image;
      
end for
return OneHot()
Labeling Function 5 Sample deep learned feature-based labeling function

c.1 Ablation Studies with Labeling Functions

In order to understand the effect of different kinds of labeling functions, we performed an ablation study on the CIFAR10 dataset (considering it is the most natural of the considered datasets, and that it allows us to compute an Inception score to quantitatively compare the performance of various methods). In this study, we did not alter any hyperparameters described in Section 4. Our ablation study considers the following models:

  1. [label=M0:]

  2. ADP : Full model

  3. ADP with no dependencies: Same model as ADP having 55 labeling functions, as in Table 3. Each labeling function is, however, considered independent of the other.

  4. ADP with only heuristic labeling functions: Same model as ADP with 36 heuristic labeling functions but without any image processing or deep learned feature-based labeling functions

  5. ADP with only image processing labeling functions: Same model as ADP with only 17 image processing-based labeling functions but without heuristic or deep learned feature-based labeling functions

  6. ADP with only deep learned feature-based labeling functions: Same model as ADP with only 2 deep learned feature-based labeling functions but without heuristic or image processing labeling functions

  7. ADP with (heuristic labeling functions + deep learned feature-based labeling functions)

  8. ADP with (deep learned feature-based labeling functions + image processing labeling functions)

  9. ADP with (image processing labeling functions + heuristic labeling functions)

M1 M2 M3 M4 M5 M6 M7 M8
Inception
Score
8.7 4.32 5.52 4.91 4.73 7.01 7.52 7.27
Table 7: Ablation study w.r.t labeling functions, as described in Section C.1

The inception scores for the aforementioned 8 models are presented in Table 7. The base ADP model comprising of all labeling functions outperforms all other models, highlighting the usefulness of a variety of labeling functions to model the non-trivial distribution.

Appendix D More Qualitative Results

In addition to the results on CIFAR 10 presented in Section 4.4, we also studied the performance of our ADP method against other generative methods (CGAN, ACGAN, InfoGAN, CoGAN, TripleGAN) on MNIST, SVHN and Fashion MNIST datasets. Similar to CIFAR 10 generation, we changed the use case setup of the other methods to generate labeled images, using the publicly available code for each of the methods. Figures 10, 11 and 12 present these results.

Figure 10: Image-label pairs generated by training on MNIST dataset using CGAN, ACGAN, InfoGAN, CoGAN, TripleGAN and our method, ADP . For a given model, the columns of images represents generations after 0.01k, 0.1k, 1k, 3k, 6k and 9k epochs, and the rows correspond to the associated class label. It is evident that from 6k epochs onward, ADP model starts generating quality images across classes and with a good image-to-label correspondence.
Figure 11: Image-label pairs generated by training on SVHN dataset using CGAN, ACGAN, InfoGAN, CoGAN, TripleGAN and ADP . For a given model, the columns of images represents generations after 0.1k, 1k, 10k, 30k, 40k and 60k epochs, and the rows correspond to the associated class label.
Figure 12: Image-label pairs generated by training on Fashion MNIST dataset using CGAN, ACGAN, InfoGAN, CoGAN, TripleGAN and our method, ADP . For a given model, the columns of images represents generations after 0.1k, 1k, 10k, 15k, 20k and 25k epochs, and the rows correspond to the associated class label.
Mnist:

Figure LABEL:fig:mnist_all shows that both our method ADP and TripleGAN generate good quality images on the MNIST dataset. We observe that both ADP and TripleGAN give a high image-to-label correspondence. Surprisingly, state-of-the-art methods such as CGAN, ACGAN, InfoGAN and CoGAN fail to capture image-to-label correspondence despite generating good quality images.

Svhn:

As shown in Figure LABEL:fig:svhn_all, we observe that our method generates human-recognizable images with a good image-to-label correspondence in just 1k epochs on the relatively harder SVHN dataset. At higher epochs, CoGAN (epoch = 30) and TripleGAN (epoch = 40) also generate images of good quality, but broadly fail to capture different styles, backgrounds and illuminations of the generated digit.

Figure 13: (a) Test-time classification cross-entropy loss of a pre-trained ResNet model on image-label pairs generated by ADP, ADP (i.e. only its Image-GAN component) with majority voting and ADP (i.e. only its Image-GAN component) with DP for labels; (b) Average running time of ADP against other methods to estimate the relative accuracies and inter-function dependencies in DP.
GAN CGAN ACGAN InfoGAN CoGAN ADP Triple
MNIST 198 201 204 225 278 344 321
FMNIST 213 206 234 276 254 292 312
SVHN 87 145 178 158 123 246 223
Table 8: Parzen window based evaluation on MNIST, FMNIST and SVHN datasets.
Fashion MNIST:

Most of the considered methods do well on this dataset. ADP and TripleGAN provide the sharpest results on close visual observation.

Appendix E More Quantitative Results

Parzen Window Based Evaluation:

In addition to the results with Inception score and HTT presented in Section 4.4, we compared our method against other generative models (described in Section 4) based on the Parzen window score at test time. Parzen window [5] is a commonly used non-parametric density estimation method to evaluate generative models (especially GANs [18]) for which exact likelihood is not tractable. Based on the samples generated by the model, we use a Parzen window with a Gaussian kernel as a density estimator. This helps obtain a proxy for true log-likelihood and thereby evaluate test log-likelihood. These results are shown in Table 8. The table shows that ADP has performed significantly well on MNIST (Score is 344) and SVHN (Score is 246) dataset and outperformed other state-of-the-art models including TripleGAN. For Fashion MNIST, our method is a close second with respect to TripleGAN. We chose the Parzen window size using cross-validation, as described in [18].

Figure 14: Sample results of image-label pairs generated by combining a vanilla GAN (for image generation) and DP [36] (for label generation) using the same labeling functions used in this work. Row labels represent the original class label (am = automobile) and column labels are provided by DP. Note the poor image-label correspondence, supporting the need for our work.

Appendix F More Results on Multi-task Joint Distribution Learning

In continuation to the results presented in Section 5 (and 1), we present more results for the capability of ADP to perform multi-task joint distribution learning in Figure 15. The figure captures our promise and shows that ADP is able to generate samples from two different domains, including samples of different colors.

Figure 15: Demonstration of cross-domain multi-task learning using ADP : (a)(b) Generated samples of Shirt (class 6 of Fashion MNIST dataset); (c) Generated samples of T-shirt (class 0 of Fashion MNIST dataset). Samples generated of the LookBook dataset are color images (top rows), while those of Fashion MNIST are grayscale images (bottom rows).

Appendix G Comparison against Vote Aggregation Methods

Comparison against Majority Voting and DP:

To study the usefulness of learning relative accuracies and inter-function dependencies using ADP , we compared the performance of our method, both with majority voting and Data Programming (DP, [36]). In majority voting, does not estimate relative accuracies and inter-function dependencies of labeling functions as described in Section 3. Instead, for a given image, each labeling function of makes a probabilistic prediction, and we take a maximum vote to obtain the final label. As in Section 4.4, we studied the test-time classification cross-entropy loss of a pre-trained ResNet model on image-label pairs generated by ADP, ADP (i.e. only its Image-GAN component) with majority voting and DP. The results are presented in Figure 13a, which shows that ADP has significantly lower cross-entropy loss than the other two methods, thus corroborating its effectiveness.

Adversarial Data Programming vs MLE-based Data Programming:

To further quantify the benefits of our ADP , we also show how our method compares against Data Programming (DP) [36] using different variants of MLE: MLE, Maximum Pseudo-likelihood, and Hamiltonian Monte Carlo. We note that DP only aggregates labels; we hence, combined a vanilla GAN with DP as separate components to conduct this study. We started with a small number of labeling functions (viz., 35 functions) and progressively added additional labeling functions, noting the time taken by each aforementioned parameter estimation method. Figure 13b presents the results and shows that ADP is almost 100X faster than MLE-based estimation. Figure 14 also shows sample images generated by the vanilla GAN, along with the corresponding label assigned by MLE-based DP using the same labeling functions as used in our work. Clearly, the labels are incorrect, thus supporting the value of the proposed work in learning a joint distribution, than combining two individual components.