Curated labeled data is a key building block of modern machine learning algorithms, and a driving force for deep neural network models. The large parameter space of deep models requires very large labeled datasets to build effective models that work in practice. However, this inherited dependency on large curated labeled data has become the major bottleneck of progress in the use of machine learning and deep learning in computer vision and other domains. Creation of large scale hand-annotated datasets in every domain is a challenging task due to the requirement for extensive domain expertise, long hours of human labour and time - which collectively make the overall process expensive and time-consuming. Even when data annotation is carried out using crowdsourcing (e.g. Amazon Mechanical Turk), additional effort is required to measure the correctness (or goodness) of the obtained labels. We seek to address this problem in this work 111This paper is accepted in CVPR 2018.
In particular, we focus on automatically learning the parameters of a given joint image-label probability distribution (as provided in training image-label pairs) with a view to automatically create labeled datasets. To achieve this objective, we exploit the use of distant supervision signals to generate labeled data. These distant supervision signals are provided to our framework as a set of weak labeling functions which represent domain knowledge or heuristics obtained from experts or crowd annotators. Writing a set of labeling functions (as we found in our experiments) is fairly easy and quick, and can then be used in our framework to generate data as well as associated labels. More interestingly, such labeling functions are often easily generalizable, thus allowing our framework to be extended to transfer learning and multi-task learning (discussed in Section5). Figure 1 shows a few examples of our results to illustrate the overall idea.
In practice, labeling functions can be associated with two kinds of dependencies: (i) relative accuracies, which measure the correctness of the labeling functions w.r.t. the true class label; and (ii) inter-function dependencies that capture the relationships between the labeling functions with respect to the predicted class label. In this work, we have proposed a novel adversarial framework using Generative Adversarial Networks (GANs) that learns these dependencies along with the data distribution using a minmax game. Our GAN learns to generate a joint data-label distribution using a generator block, a discriminator block and a Labeling Functions Block (LFB), which contains another discriminator that helps in learning the two kinds of dependencies mentioned above. The overall architecture of the proposed ADP architecture is presented in Figure 2a.
Our broad idea of learning relative accuracies and inter-function dependencies of labeling functions is inspired by the recently proposed Data Programming (DP) framework  (and hence, the name ADP), but our method is different in many ways: (i) DP is a strict conditional model (i.e.
) that requires additional unlabeled data points even at test time, while our model is a joint distribution model, i.e.
, and does not require any additional unlabeled data points at test/generation time. (ii) DP learns a generative model using Maximum Likelihood Estimation (MLE) and gradient descent to learn the relative accuracies of labeling functions. We however replace this approach with a GAN-based adversarial estimation of parameters. and  provide insights on the advantage of using a GAN-based estimator over MLE to achieve a relatively quicker training time and good robustness on generated samples. (iii) To learn the statistical dependencies of labeling functions, DP models the dependency structure of labeling functions as a factor graph, and uses computationally expensive Gibbs sampling techniques to update the gradient in each step. We replace the factor graph and Gibbs sampling-based estimation of inter-function dependencies with another discriminator in our GAN-based estimation, which speeds up the learning process again and provides a robust generation at run-time.
As our outcomes of this work, we show how a set of low-quality, weak labeling functions can be used within a framework that models a joint data-label distribution to generate robust samples. We also show that this idea can be generalized quite easily to transfer learning and multi-task learning settings, showing the generalizability of this work. Our contributions can be summarized as follows:
We propose a novel adversarial framework, ADP to generate robust data-label pairs that be used to obtain datasets in domains that have very little data and thus save human labor and time.
We show how an adversarial framework can be used to learn dependencies between weak labeling functions and thus provide high-fidelity aggregated labels along with generated data in a GAN setting.
The proposed framework can also be used in a transfer learning setting where ADP can be trained on a source domain, and then finetuned on a target domain to then generate data-label pairs in the target domain.
We also show the potential of this ADP framework to generate cross-domain data in a multi-task setting, where images from two domains are generated simultaneously by the model along with the labels.
2 Related Work
use a limited form of class-preserving image transformations such as rotation, mirroring, addition of small noise, random crop etc. Interpolation-based methods proposed in and class-conditional models of diffeomorphisms proposed in  interpolate between nearest-neighbor labeled data points. The popular SMOTE algorithm  performs oversampling to reduce class imbalance and augment the given data. All of these methods vastly depends on hand-tuned parameters, the order of geometric transformations, the optimal value of transformation parameters, etc. A small change in parameters can often lead to negative impacts on final performance as studied in ,  and ).
In this work, we choose to use a more intuitive way of creating labeled data by learning a joint distribution model. Learning a joint data-label distribution using generative models such as , , and  is non-trivial, since the label often requires domain knowledge and not directly inferrably from data. Our proposed model hence uses distant supervision signals (in the form of labeling functions) to generate novel labeled data points. Distant supervision signals such as labeling functions are cheaper than manual annotation of each data point, and has been successfully used in recent methods such as . Ratner et al. proposed a generative model in  that uses a fixed number of user-defined labeling functions to programatically generate synthetic labels for data in near-constant time. DP outperformed number of approaches such as multiple-instance learning (), co-training (), crowdsourcing (), or ensemble based weak-learner method like boosting (), thus reinforcing our choice in this work. Alfonseca et al.  generated additional training data using hierarchical topic models for weak supervision. Heuristics for distant supervision are also proposed in , but this method does not model the inherent noise associated with such heuristics. Structure learning  also exploits the use of distant supervision signals for generating labels, but as described in Section 1, these methods like  require unlabeled test data to generate a labeled dataset. Additionally, ,  and  are computationally expensive due to its use of Gibbs sampling in MLE.
We instead use an adversarial approach to learn the joint distribution by weighting a set of domain-specific label functions using a Generative Adversarial Network (GAN). GAN () approximates the real data distribution by optimizing a minmax objective function and thus generates novel out-of-sample data points. Broadly, GAN can be viewed in terms of three manifestations: (i) GANs can be trained to sample from a marginal distribution (, ), where x refers to data. (ii) Recent efforts in literature such as Conditional GAN 
, Auxiliary Classifier GAN and InfoGAN  show training of GANs conditioned on class labels to thus sample from a conditional distribution, i.e. . Other state-of-the-art models with similar objectives have exploited other modalities for the same purpose; for example, Zhang et al  propose a GAN conditioned on images, while Hu et al  propose a GAN conditioned on text. (iii) There have been a few very recent efforts ,  and ), which attempt to train GANs to sample from a joint distribution. For example, CoGAN () introduces a parameter-sharing approach to learn an unpaired joint distribution between two domains, while TripleGAN  brings together a classifier along with the discriminator and generator which helps in a semi-supervised setting. In this work, we propose a novel idea to instead use distant supervision signals to accomplish learning the joint distribution of labeled images. We now describe the proposed methodology.
3 Adversarial Data Programming (ADP): Methodology
Our central aim in this work is to learn parameters of a probabilistic model:
that captures the joint distribution over the data and the corresponding labels , thus allowing us to generate out-of-sample data points along with their corresponding labels (we focus on images in the rest of this paper).
While recent efforts such as  and  have considered complementary objectives, they largely focused on learning joint probability distributions in cross-domain understanding settings. In this work, we focus on learning the joint image-label probability distribution with a view to automatically create labeled datasets, by exploiting the use of distant supervision signals to generate labeled data. To the best of our knowledge, this is the first such work that invokes distant supervision while learning the joint distribution , so as to generate labeled data points at scale from . Besides, automatic generation of labels for data based on training data-label pairs is non-trivial, and often does not work directly. Distant supervision provides us a mechanism to achieve this challenging goal. We encode distant supervision signals as a set of (weak) definitions by annotators using which unlabeled data points can be labeled. These definitions can be harvested from knowledge bases, domain heuristics, ontologies, rules-of-thumb, educated guesses, decisions of weak classifiers or obtained using crowdsourcing. Many application domains have such distant supervision available in different means through domain knowledge or heuristics, which can be leveraged in the proposed framework. We provide examples in Section 4 when we describe our experiments.
We encapsulate all available distant supervision signals, henceforth called labeling functions, in a unified abstract container called Labeling Functions Block (LFB, see Figure 2a). Let LFB comprise of labeling functions , where each labeling function is a mapping:
that maps a data point to a
-dimensional probabilistic label vector,, where is the number of class labels with for each . For example, could be thought of as an image from the MNIST dataset, and would be the corresponding label vector when the labeling function is applied to . , for instance, could be the one-hot 10-dimensional class vector, see Figure 2b.
We characterize the set of labeling functions, , with two kinds of dependencies: (i) relative accuracies of the labeling functions with respect to the true class label of a given data point; and (ii) inter-function dependencies that capture the relationships between the labeling functions with respect to the predicted class label. To obtain a final label for a given data point using the LFB, we use two different sets of parameters, and to capture each of these dependencies between the labeling functions. We, hence, denote the Labeling Function Block (LFB) as:
i.e. given a set of labeling functions , a set of parameters capturing the relative accuracy-based dependencies between the labeling functions, , and a second set of parameters capturing inter-label dependencies, , provides a probabilistic label vector, , for a given data input .
The joint distribution we seek to model in this work (Equation 1) hence becomes:
In the rest of this section, we show how we can learn the parameters of the above distribution modeling image-label pairs using an adversarial framework with a high degree of label fidelity. We use Generative Adversarial Networks (GANs) to model the joint distribution in Equation 4. In particular, we provide a mechanism to integrate the LFB (Equation 3) into the GAN framework, and show how and
can be learned through the framework itself. Our adversarial loss function is given by:
where is the generator module and is the discriminator module. The overall architecture of the proposed ADP framework is shown in Figure 2a.
This approach has a few advantages: (i) firstly, labeling functions (which can even be just loosely defined) are cheaper to obtain than collecting labels for a large dataset; (ii) labeling functions can help bring domain knowledge into such generative models; (iii) labeling functions act as an implicit regularizer in the label space, thus allowing good generalization; (iv) with a small fine-tuning, labeling functions can be easily re-purposed for new domains (transfer learning), as we describe later in this paper.
The ADP architecture is designed to learn the parameters required to model the joint distribution in Equation 4, and thus generate out-of-sample image-label pairs. This architecture is broadly divided into three modules: the generator, discriminator and the LFB. We now describe each of these modules individually.
3.1 The ADP - Generator
Given a noise input and a set of labeling functions , the generator outputs an image and the parameters and , the dependencies between the labeling functions described earlier. In particular, consists of three blocks: , and , as shown in Figure 2a. captures the common high-level semantic relationships between the data and the label space, and is comprised only of fully connected (FC) layers. The output of forks into two branches: and , where generates the image , and generates the parameters . While uses FC layers, uses Fully Convolutional (FCONV) layers to generate the image (more details in Section 4). Thus, the generator outputs given input
, the standard normal distribution.
3.2 The ADP - Discriminator
The discriminator of ADP estimates the likelihood of an image-label input pair being drawn from the real distribution obtained from training data. takes a batch of image-label pairs as input and maps that to a probability score to estimate the aforementioned likelihood of the image-label pair. To accomplish this, has two branches: and (shown in the Discriminator block in Figure 2a). These two branches are not coupled in the initial layers, so as to separately extract required low-level features. The branches share weights in later layers to extract joint semantic features that help classify correctly if an image-label pair is fake or real.
We hence expand our objective function from Equation 5 to the following:
3.3 The ADP - Labeling Function Block
This is a critical module of the proposed ADP framework. Our initial work revealed that a simple weighted (linear or non-linear) sum of the labeling functions do not perform well in generating out-of-sample image-label pairs. We hence used a separate adversarial methodology within this block to learn the dependencies, both relative accuracies and inter-function (discussed earlier in this section), between the labeling functions provided to the framework. We describe the components of the LFB below.
3.3.1 Relative Accuracies of Labeling Functions
The output, , of the block in the ADP-Generator provides the relative accuracies of the labeling functions. Given the image output generated by : , the labeling functions , and the probabilistic label vectors obtained using the labeling functions (as in Eqn 2), we define the aggregated final label as:
where is the normalized version of , i.e. . The aggregated label, , is provided as an output of the LFB.
3.3.2 Inter-function Dependencies
Our empirical studies with considering only relative accuracies of labeling functions as a weighting mechanism led to mode collapse in the joint distribution space, a well-understood problem in GANs. Our preliminary empirical studies demonstrated mode collapse in the joint distribution space (either images of same class with different labels, or images of different classes with same label were generated). The rationale behind taking two discriminators is to penalize the missing modes. Related literature  shows that inter-functional dependencies act as an implicit regularizer in the label space. We also conducted experiments on synthetic data to demonstrate this issue (please see Fig LABEL:fig:synthetic_image below). We hence introduced an adversarial mechanism inside the LFB to influence the final relative accuracies, , using the inter-function dependencies between the labeling functions. , a discriminator inside LFB, receives two inputs: , which is output by , and , which is obtained from using the procedure described in Algorithm 1.
Algorithm 1 computes a matrix of interdependencies between the labeling functions, , by looking at the one-hot encodings of their predicted label vectors. If the one-hot encodings match for a given data input, we increase the count of their correlation by one, and compute this matrix across a particular mini-batch of data points under consideration. The counts are then normalized row-wise to obtain . The task of the discriminator is to recognize the computed interdependencies as real, and the generated through the network in
as fake. The gradient backpropagated through this discriminator to theblock is critical as a regularizer in learning a better , which is finally used to weight the labeling functions (as in Section 3.3.1). Combining the gradient information from along with , penalizes missing modes and helps to generate more variety in the samples. The objective function of our second adversarial module is hence:
where and are obtained from as described above. More details of the LFB are provided in implementation details in Section 4.
The overall architecture of ADP (Figure 2a) is trained using end-to-end backpropagation with gradients from both discriminators, and , influencing the weights learned inside the generator
. Mini-batches of image-label pairs from a given training distribution are provided as input to ADP , and Stochastic Gradient Descent (SGD) is used to learn the parameters of the model. At the end of training, we define the aggregated final label as:
the samples generated using the and modules thus provide samples from the desired joint distribution (Eqn 1) modeled using the framework.
4 Experiments and Results
We validated the ADP framework on standard datasets: MNIST (), Fashion MNIST (), SVHN (), and CIFAR-10 (). No additional pre-processing is performed on MNIST, Fashion-MNIST and CIFAR 10 datasets. For the SVHN dataset, we used the ‘Format 2 Cropped version’, and included an additional crop on each image to reduce presence of more than one digit, though the dimension is maintained222Code available at https://github.com/ArghyaPal/Adversarial-Data-Programming.
4.2 Labeling Functions
Labeling functions form a critical element of ADP , and we used different cues from state-of-the-art algorithms to help obtain labeling functions for our experiments. Table 1 shows the labeling functions we used for our experiments on MNIST and SVHN (digit recognition problems), and Table 2 shows the functions used for CIFAR and Fashion-MNIST. We categorized labeling functions as: (i) Heuristic; (ii) Image processing-based; and (iii) Deep learning-based labeling functions (as in Tables 1 and 2). Table 3 presents the statistics of the number of labeling functions used for each of the considered datasets (the empirical study that motivated these choices is presented in Section 5). In this work, for each labeling function, a simple threshold rule on the -norm of the aforementioned features is used, where the threshold is obtained empirically as the mean of the -norms of a randomly chosen subset, which is
-trimmed to remove outliers. More examples of labeling functions and ablation studies on their usefulness are presented in the Supplementary Section.
|Type||Labeling Functions used|
|Heuristic||Presence of long edges (vertical or horizontal) ; Image histogram|
|Image Processing based||Bag-of-feature ; Haar wavelet ; Discrete-continuous ADM ; Compressive sensing |
|Deep Learning based||Convolution kernels from last conv layer (before fully connected layers) of LeNet|
|Type||Labeling Functions used|
|Heuristic||PatchMatch ; Blob Detection; Presence of edges ; Textons; Image histogram|
|Image Processing based||Global descriptor (GIST-based) ; Local descriptor (SIFT-based) ; Bag-of-visual-words; Histogram of Oriented Gradient (HOG)-based: HoGgles |
|Deep Learning based||
Convolution kernels from last conv layer (before fully connected layers) of (ImageNet) pre-trained AlexNet
|Heuristic||Image Processing||Deep Learning|
4.3 Implementation Details
has 3 dense fully connected layers (FC) (128 nodes per layer) with batch-normalization.continues with fractional length convolutional layers, similar to  (FCONV: 1024 nodes per layer, Kernel size: x. uses FC layers and generates . The discriminator network follows the in-plane rotation network of . is a stack of FC layers. Both and have 2 FCONV layers followed by FC layers. We trained the complete model with mini-batch Stochastic Gradient Descent (SGD) with a batch size of 128, learning rate of 0.0001, momentum factor of 0.5 and Adam as an optimizer.
4.4 Comparison with State-of-the-Art Models
We compared our method against other generative methods that allow generation of data along with a label: Conditional GAN or CGAN (), ACGAN (), InfoGAN (), CoGAN () and TripleGAN (). (We changed the use case setup of these methods to generate data-label pairs as required. For example, for a conditional GAN, we specified a class label, generated a corresponding image and used this as an image-label pair.) We used the publicly available codes for each of the above methods, and the results for CIFAR10 are shown in Figure 3 (Results for other datasets are shown in the Supplementary Section). The figure shows that the proposed model generates images with very good clarity. Besides, while some of the aforementioned methods (such as CGAN and InfoGAN) generate images conditioned on a given label (and hence require a label to be provided as input), the label is provided by the model in our case.
We considered three evaluation metrics for studying the performance of our method quantitatively: (i)Human Turing test (HTT): This metric studies how hard it is for a human annotator to tell the difference between real and generated samples. We asked 40 subjects to evaluate image quality and image-label correspondence (Table 5) on a scale of 10, given 50 random image-label samples from the generated pool for each method considered. Table 5 shows consistently good performance of ADP over other methods, especially in image-label correspondence, which is the focus of this work; (ii) Inception Score: The inception score, as used in  and , for the CIFAR 10 dataset is shown in Figure 5. The figure shows that ADP and TripleGAN perform significantly better than the rest of the methods (more results on other datasets included in the Supplementary Section). We also used a Parzen window based evaluation metric, and these results are included in the Supplementary Section.
We trained a ResNet-56 model on the CIFAR-10 dataset under different settings, and the results are shown in Table 4. The addition of labeled data generated using our method significantly lowers the test cross-entropy loss across the epochs. We will include these in the paper.
|(Training Data, Test Data)||Epochs|
|(Real data-50K, Real data-10K)||9.83||7.3||7.12||6.3||6.1||4.3||4.19|
|(ADP data-50K, Real data-10K)||9.32||8.9||8.13||7.0||6.75||5.53||5.0|
|(Real data-50K, ADP data-10K)||9.67||9.4||7.92||7.3||6.81||6.18||5.6|
|(ADP-25K + Real data-25K, Real data-10K)||8.5||6.6||6.21||5.7||5.5||4.83||3.5|
|(ADP-50K + Real data-50K, Real data-10K)||7.71||6.3||6.0||5.34||3.1||2.92||2.71|
|Dataset||Image Quality||Image-Label Correspondence|
To study the usefulness of the generated image-label pairs, we studied the classification cross-entropy loss of a pretrained ResNet model on the image-label pairs generated by our ADP at test time. A lesser cross-entropy loss in ResNet at test time indicates the efficacy of our model as a data augmentation method. We compared our method against TripleGAN, InfoGAN, CoGAN as well as the popular oversampling technique, SMOTE . Figure 4 shows the result, which shows that the proposed model has significantly lower cross-entropy loss over other methods, highlighting its usefulness.
5 Discussion and Analysis
Optimal Number of Labeling Functions:
We studied the performance of ADP when the number of labeling functions is varied to understand the impact of this parameter on the performance. We studied the test cross-entropy error of a pretrained ResNet model with image-label pairs generated by ADP, trained using different number of labeling functions. Table 6 shows our results, suggesting that 50-55 labeling functions provides the best performance, depending on the dataset. This justifies our choice of number of labeling functions in Table 3.
|No of Labeling Functions||MNIST||F-MNIST||CIFAR10||SVHN|
The use of distant supervision signals such as labeling functions (which can often be generic) allows us to extend the proposed ADP model to a transfer learning setting. In this setup, we trained ADP initially on a source dataset and then finetuned the model to a target dataset, with very limited training. In particular, we first trained ADP on the MNIST dataset, and subsequently finetuned the branch alone with the SVHN dataset. We note that the weights of , and are unaltered. The final finetuned model is then used to generate image-label pairs (which we hypothesize will look similar to SVHN). Figure 1 shows encouraging results of our experiments in this regard.
Multi-task Joint Distribution Learning:
Learning a cross-domain joint distribution from heterogeneous domains is a challenging task. We show that the proposed ADP method can be used to achieve this, by modifying its architecture as shown in Figure 7, to simultaneously generate data from two different domains. We study this architecture on the MNIST and SVHN datasets, and show the promising results of our experiments in Figure 8. The LFB acts as a regularizer and maintains the correlations between the domains in this case. More results on other datasets - in particular, LookBook and Fashion MNIST - are included in the Supplementary Section as well as Figure 1.
Paucity of large curated hand-labeled training data for every domain-of-interest forms a major bottleneck in deploying machine learning methods in practice. Standard data augmentation techniques and other heuristics are often limited in their scope and require carefully picked hand-tuned parameters. We instead propose a new adversarial framework called Adversarial Data Programming (ADP), which can learn the joint data-label distribution effectively using a set of weakly defined labeling functions. The method shows promise on standard datasets, as well as on settings such as transfer learning and multi-task learning. Our future work will involve understanding the theoretical implications of this new framework from a game-theoretic perspective, as well as explore the performance of the method on more complex datasets.
-  E. Alfonseca, K. Filippova, J.-Y. Delort, and G. Garrido. Pattern learning for relation extraction with a hierarchical topic model. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2, pages 54–59. Association for Computational Linguistics, 2012.
-  M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
-  C. Barnes, D. B. Goldman, E. Shechtman, and A. Finkelstein. The patchmatch randomized matching algorithm for image manipulation. Communications of the ACM, 54(11):103–110, 2011.
A. Blum and T. Mitchell.
Combining labeled and unlabeled data with co-training.
Proceedings of the eleventh annual conference on Computational learning theory, pages 92–100. ACM, 1998.
-  O. Breuleux, Y. Bengio, and P. Vincent. Unlearning for better mixing.
-  R. Bunescu and R. Mooney. Learning to extract relations from the web using minimal supervision. In ACL, 2007.
N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer.
Smote: synthetic minority over-sampling technique.
Journal of artificial intelligence research, 16:321–357, 2002.
-  X. Chen, X. Cheng, and S. Mallat. Unsupervised deep haar scattering on graphs. In Advances in Neural Information Processing Systems, pages 1709–1717, 2014.
-  X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2172–2180, 2016.
-  D. Ciresan, U. Meier, L. Gambardella, and J. Schmidhuber. Deep big simple neural nets excel on handwritten digit recognition [Электронный ресурс]. 2010. Режим доступа: http://arxiv. org/pdf/1003.0358, свободный. Яз. англ.(дата обращения 03.07. 2014).
-  I. Danihelka, B. Lakshminarayanan, B. Uria, D. Wierstra, and P. Dayan. Comparison of maximum likelihood and gan-based training of real nvps. arXiv preprint arXiv:1705.05263, 2017.
-  E. L. Denton, S. Chintala, R. Fergus, et al. Deep generative image models using a laplacian pyramid of adversarial networks. In Advances in neural information processing systems, pages 1486–1494, 2015.
-  T. DeVries and G. W. Taylor. Dataset augmentation in feature space. arXiv preprint arXiv:1702.05538, 2017.
-  C. Doersch. Tutorial on variational autoencoders. arXiv preprint arXiv:1606.05908, 2016.
A. Dosovitskiy, P. Fischer, J. T. Springenberg, M. Riedmiller, and T. Brox.
Discriminative unsupervised feature learning with exemplar convolutional neural networks.IEEE transactions on pattern analysis and machine intelligence, 38(9):1734–1747, 2016.
-  Z. Gan, L. Chen, W. Wang, Y. Pu, Y. Zhang, H. Liu, C. Li, and L. Carin. Triangle generative adversarial networks. arXiv preprint arXiv:1709.06548, 2017.
-  H. Gao, G. Barbier, and R. Goolsby. Harnessing the crowdsourcing power of social media for disaster relief. IEEE Intelligent Systems, 26(3):10–14, 2011.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
-  B. Graham. Fractional max-pooling. arXiv preprint arXiv:1412.6071, 2014.
-  S. Hauberg, O. Freifeld, A. B. L. Larsen, J. Fisher, and L. Hansen. Dreaming more data: Class-dependent distributions over diffeomorphisms for learned data augmentation. In Artificial Intelligence and Statistics, pages 342–350, 2016.
-  Z. Hu, Z. Yang, X. Liang, R. Salakhutdinov, and E. P. Xing. Controllable text generation. arXiv preprint arXiv:1703.00955, 2017.
-  T. Kim, M. Cha, H. Kim, J. Lee, and J. Kim. Learning to discover cross-domain relations with generative adversarial networks. arXiv preprint arXiv:1703.05192, 2017.
-  A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009.
-  V. V. Kumar, A. Srikrishna, B. R. Babu, and M. R. Mani. Classification and recognition of handwritten digits by using mathematical morphology. Sadhana, 35(4):419–426, 2010.
-  E. Laude, J.-H. Lange, F. Schmidt, B. Andres, and D. Cremers. Discrete-continuous splitting for weakly supervised learning. arXiv preprint arXiv:1705.05020, 2017.
-  Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
-  C. Li, K. Xu, J. Zhu, and B. Zhang. Triple generative adversarial nets. arXiv preprint arXiv:1703.02291, 2017.
Y. Li, K. Swersky, and R. Zemel.
Generative moment matching networks.In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pages 1718–1727, 2015.
-  M. Liu, O. Tuzel, and A. Sullivan. Coupled generative adversarial nets. 2016.
-  S. Mandal, S. Sur, A. Dan, and P. Bhowmick. Handwritten bangla character recognition in machine-printed forms using gradient information and haar wavelet. In Image Information Processing (ICIIP), 2011 International Conference on, pages 1–6. IEEE, 2011.
-  M. Mirza and S. Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
-  H. Moudni, M. Er-rouidi, M. Oujaoura, and O. Bencharef. Recognition of amazigh characters using surf & gist descriptors. In International Journal of Advanced Computer Science and Application. Special Issue on Selected Papers from Third international symposium on Automatic Amazigh processing, pages 41–44, 2013.
-  Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, volume 2011, page 5, 2011.
-  A. Odena, C. Olah, and J. Shlens. Conditional image synthesis with auxiliary classifier gans. arXiv preprint arXiv:1610.09585, 2016.
-  A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
-  A. J. Ratner, C. M. De Sa, S. Wu, D. Selsam, and C. Ré. Data programming: Creating large training sets, quickly. In Advances in Neural Information Processing Systems, pages 3567–3575, 2016.
-  A. J. Ratner, H. R. Ehrenberg, Z. Hussain, J. Dunnmon, and C. Ré. Learning to compose domain-specific transformations for data augmentation. arXiv preprint arXiv:1709.01643, 2017.
-  S. Riedel, L. Yao, and A. McCallum. Modeling relations and their mentions without labeled text. Machine learning and knowledge discovery in databases, pages 148–163, 2010.
-  L. Rothacker, S. Vajda, and G. A. Fink. Bag-of-features representations for offline handwriting recognition applied to arabic script. In Frontiers in Handwriting Recognition (ICFHR), 2012 International Conference on, pages 149–154. IEEE, 2012.
-  R. E. Schapire and Y. Freund. Boosting: Foundations and algorithms. MIT press, 2012.
-  C. Sun, A. Shrivastava, S. Singh, and A. Gupta. Revisiting unreasonable effectiveness of data in deep learning era. arXiv preprint arXiv:1707.02968, 2017.
-  L. Theis, A. v. d. Oord, and M. Bethge. A note on the evaluation of generative models. arXiv preprint arXiv:1511.01844, 2015.
-  P. Varma, B. He, P. Bajaj, I. Banerjee, N. Khandwala, D. L. Rubin, and C. Ré. Inferring generative model structure with static analysis. arXiv preprint arXiv:1709.02477, 2017.
-  C. Vondrick, A. Khosla, T. Malisiewicz, and A. Torralba. HOGgles: Visualizing Object Detection Features. ICCV, 2013.
-  H. Xiao, K. Rasul, and R. Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
-  Z. Yi, H. Zhang, P. T. Gong, et al. Dualgan: Unsupervised dual learning for image-to-image translation. arXiv preprint arXiv:1704.02510, 2017.
-  D. Yoo, N. Kim, S. Park, A. S. Paek, and I. S. Kweon. Pixel-level domain transfer. In European Conference on Computer Vision, pages 517–532. Springer, 2016.
K. Yu, Y. Lin, and J. Lafferty.
Learning image representations from the pixel level via hierarchical
Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 1713–1720. IEEE, 2011.
-  H. Zhang, T. Xu, H. Li, S. Zhang, X. Huang, X. Wang, and D. Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. arXiv preprint arXiv:1612.03242, 2016.
-  M. Zhenjiang and Y. Baozong. Handwritten character recognition by extended loop neural networks. In Speech, Image Processing and Neural Networks, 1994. Proceedings, ISSIPNN’94., 1994 International Symposium on, pages 460–463. IEEE, 1994.
-  J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. arXiv preprint arXiv:1703.10593, 2017.
Appendix A Algorithm
Algorithm LABEL:alg:main presents the overall stepwise routine of the proposed method, ADP , as described in Section 3
. During the training phase, the algorithm updates weights of the model by estimating gradients for a batch of labeled data points. The hyperparameters that need to be provided include standard parameters that are provided while training a GAN, such as: (i) number of iterations of AlgorithmLABEL:alg:main; (ii) parameter (similar to ) that describes how many times and would be updated with respect to ; and (iii) minibatch size . Using empirical studies, we chose , and number of iterations to be . algocf[htbp]
Appendix B Datasets
In this section, we provide more information on the datasets used for validating ADP in this work: MNIST, Fashion MNIST, SVHN and CIFAR 10. The MNIST dataset comprises grayscale images (with one handwritten digit in each image) along with the corresponding label, with 50,000 training samples (image-label pairs). In case of SVHN, we used format 2 of the dataset, which comprises 73257 images (each containing a digit captured from street views of house numbers) with the corresponding labels. In case of CIFAR 10, we merged five training batches of the dataset and built a training set of 50,000 images. This dataset contains RGB-images each of size of spanning 10 classes: automobile, airplane, bird, cat, deer, dog, frog, horse, ship, truck. The total number of samples are almost equally distributed across all classes. Fashion MNIST, similar to MNIST, consists of a training set of 50,000 grayscale images with one of 10 classes: Tshirt, Trouser, Pullover, Dress, Coat, Sandal, Shirt, Sneaker, Bag, Ankle boot.
We also used the LookBook  dataset (Figure 9) to demonstrate cross-domain multi-task learning using ADP , as described in Section 5. This dataset contains 84,748 images across 17 classes: Midi dress, mini dress, coat, jacket, fur jacket, padded jacket, hooded jacket, jumper, cardigan, knitwear, blouse, shirt, sleeveless tee, short sleeve tee, long sleeve tee, hoody, vest. In this work, we grouped these 17 classes into 4 classes: coat, pullover, t-shirt, dress
, in order to match with the Fashion MNIST dataset and thus help study cross-domain learning. We grouped coat, jacket, fur jacket, padded jacket, hooded jacket, jumper, cardigan to a singlecoat class; hoody to the pullover class; sleeveless tee, short sleeve tee to the t-shirt class; cardigan, knitwear, blouse, Midi dress, mini dress to the Dress class. Fashion MNIST dataset also has the same classes: coat, pullover, t-shirt, Dress among its label, thus facilitating our study.
No additional pre-processing was performed on MNIST, Fashion MNIST, CIFAR 10 or the LookBook datasets. In case of SVHN, an additional crop was performed on each image to ensure only one digit is present in the image. The cropped image was subsequently sampled to maintain the size. Figure 9 shows illustrative examples of images from the chosen datasets.
Appendix C More on Labeling Functions
The Labeling Functions Block (LFB) in Figure 2a (Section 3) is implemented using the open-source framework, Snorkel . We modified the underlying architecture of the Snorkel framework to include an adversarial approach, which otherwise estimates dependencies using MLE invoking Gibbs sampling. Labeling functions of three kinds: heuristics, image processing-based and deep learned features have been used in this work, as described in Section 4. Examples of labeling functions used in this work are shown as Labeling Functions 2, 3, 4 and 5. For each labeling function, a simple threshold rule on the -norm of the aforementioned features is used. For each class of a dataset, the threshold information is obtained empirically as the average -norm of the feature vectors of 20 random samples of that class (with -trimming to remove outliers). It is worthy to mention that, for an abstract understanding of working process of our labeling functions, the return value of example Labeling Functions 4 and 5 are one-hot encoding. Though in practice we fit a nonlinear function to get a probabilistic output.
c.1 Ablation Studies with Labeling Functions
In order to understand the effect of different kinds of labeling functions, we performed an ablation study on the CIFAR10 dataset (considering it is the most natural of the considered datasets, and that it allows us to compute an Inception score to quantitatively compare the performance of various methods). In this study, we did not alter any hyperparameters described in Section 4. Our ablation study considers the following models:
ADP : Full model
ADP with no dependencies: Same model as ADP having 55 labeling functions, as in Table 3. Each labeling function is, however, considered independent of the other.
ADP with only heuristic labeling functions: Same model as ADP with 36 heuristic labeling functions but without any image processing or deep learned feature-based labeling functions
ADP with only image processing labeling functions: Same model as ADP with only 17 image processing-based labeling functions but without heuristic or deep learned feature-based labeling functions
ADP with only deep learned feature-based labeling functions: Same model as ADP with only 2 deep learned feature-based labeling functions but without heuristic or image processing labeling functions
ADP with (heuristic labeling functions + deep learned feature-based labeling functions)
ADP with (deep learned feature-based labeling functions + image processing labeling functions)
ADP with (image processing labeling functions + heuristic labeling functions)
The inception scores for the aforementioned 8 models are presented in Table 7. The base ADP model comprising of all labeling functions outperforms all other models, highlighting the usefulness of a variety of labeling functions to model the non-trivial distribution.
Appendix D More Qualitative Results
In addition to the results on CIFAR 10 presented in Section 4.4, we also studied the performance of our ADP method against other generative methods (CGAN, ACGAN, InfoGAN, CoGAN, TripleGAN) on MNIST, SVHN and Fashion MNIST datasets. Similar to CIFAR 10 generation, we changed the use case setup of the other methods to generate labeled images, using the publicly available code for each of the methods. Figures 10, 11 and 12 present these results.
Figure LABEL:fig:mnist_all shows that both our method ADP and TripleGAN generate good quality images on the MNIST dataset. We observe that both ADP and TripleGAN give a high image-to-label correspondence. Surprisingly, state-of-the-art methods such as CGAN, ACGAN, InfoGAN and CoGAN fail to capture image-to-label correspondence despite generating good quality images.
As shown in Figure LABEL:fig:svhn_all, we observe that our method generates human-recognizable images with a good image-to-label correspondence in just 1k epochs on the relatively harder SVHN dataset. At higher epochs, CoGAN (epoch = 30) and TripleGAN (epoch = 40) also generate images of good quality, but broadly fail to capture different styles, backgrounds and illuminations of the generated digit.
Most of the considered methods do well on this dataset. ADP and TripleGAN provide the sharpest results on close visual observation.
Appendix E More Quantitative Results
Parzen Window Based Evaluation:
In addition to the results with Inception score and HTT presented in Section 4.4, we compared our method against other generative models (described in Section 4) based on the Parzen window score at test time. Parzen window  is a commonly used non-parametric density estimation method to evaluate generative models (especially GANs ) for which exact likelihood is not tractable. Based on the samples generated by the model, we use a Parzen window with a Gaussian kernel as a density estimator. This helps obtain a proxy for true log-likelihood and thereby evaluate test log-likelihood. These results are shown in Table 8. The table shows that ADP has performed significantly well on MNIST (Score is 344) and SVHN (Score is 246) dataset and outperformed other state-of-the-art models including TripleGAN. For Fashion MNIST, our method is a close second with respect to TripleGAN. We chose the Parzen window size using cross-validation, as described in .
Appendix F More Results on Multi-task Joint Distribution Learning
In continuation to the results presented in Section 5 (and 1), we present more results for the capability of ADP to perform multi-task joint distribution learning in Figure 15. The figure captures our promise and shows that ADP is able to generate samples from two different domains, including samples of different colors.
Appendix G Comparison against Vote Aggregation Methods
Comparison against Majority Voting and DP:
To study the usefulness of learning relative accuracies and inter-function dependencies using ADP , we compared the performance of our method, both with majority voting and Data Programming (DP, ). In majority voting, does not estimate relative accuracies and inter-function dependencies of labeling functions as described in Section 3. Instead, for a given image, each labeling function of makes a probabilistic prediction, and we take a maximum vote to obtain the final label. As in Section 4.4, we studied the test-time classification cross-entropy loss of a pre-trained ResNet model on image-label pairs generated by ADP, ADP (i.e. only its Image-GAN component) with majority voting and DP. The results are presented in Figure 13a, which shows that ADP has significantly lower cross-entropy loss than the other two methods, thus corroborating its effectiveness.
Adversarial Data Programming vs MLE-based Data Programming:
To further quantify the benefits of our ADP , we also show how our method compares against Data Programming (DP)  using different variants of MLE: MLE, Maximum Pseudo-likelihood, and Hamiltonian Monte Carlo. We note that DP only aggregates labels; we hence, combined a vanilla GAN with DP as separate components to conduct this study. We started with a small number of labeling functions (viz., 35 functions) and progressively added additional labeling functions, noting the time taken by each aforementioned parameter estimation method. Figure 13b presents the results and shows that ADP is almost 100X faster than MLE-based estimation. Figure 14 also shows sample images generated by the vanilla GAN, along with the corresponding label assigned by MLE-based DP using the same labeling functions as used in our work. Clearly, the labels are incorrect, thus supporting the value of the proposed work in learning a joint distribution, than combining two individual components.