1 Introduction
Curated labeled data is a key building block of modern machine learning algorithms, and a driving force for deep neural network models. The large parameter space of deep models requires very large labeled datasets to build effective models that work in practice. However, this inherited dependency on large curated labeled data has become the major bottleneck of progress in the use of machine learning and deep learning in computer vision and other domains
[41]. Creation of large scale handannotated datasets in every domain is a challenging task due to the requirement for extensive domain expertise, long hours of human labour and time  which collectively make the overall process expensive and timeconsuming. Even when data annotation is carried out using crowdsourcing (e.g. Amazon Mechanical Turk), additional effort is required to measure the correctness (or goodness) of the obtained labels. We seek to address this problem in this work ^{1}^{1}1This paper is accepted in CVPR 2018.In particular, we focus on automatically learning the parameters of a given joint imagelabel probability distribution (as provided in training imagelabel pairs) with a view to automatically create labeled datasets. To achieve this objective, we exploit the use of distant supervision signals to generate labeled data. These distant supervision signals are provided to our framework as a set of weak labeling functions which represent domain knowledge or heuristics obtained from experts or crowd annotators. Writing a set of labeling functions (as we found in our experiments) is fairly easy and quick, and can then be used in our framework to generate data as well as associated labels. More interestingly, such labeling functions are often easily generalizable, thus allowing our framework to be extended to transfer learning and multitask learning (discussed in Section
5). Figure 1 shows a few examples of our results to illustrate the overall idea.In practice, labeling functions can be associated with two kinds of dependencies: (i) relative accuracies, which measure the correctness of the labeling functions w.r.t. the true class label; and (ii) interfunction dependencies that capture the relationships between the labeling functions with respect to the predicted class label. In this work, we have proposed a novel adversarial framework using Generative Adversarial Networks (GANs) that learns these dependencies along with the data distribution using a minmax game. Our GAN learns to generate a joint datalabel distribution using a generator block, a discriminator block and a Labeling Functions Block (LFB), which contains another discriminator that helps in learning the two kinds of dependencies mentioned above. The overall architecture of the proposed ADP architecture is presented in Figure 2a.
Our broad idea of learning relative accuracies and interfunction dependencies of labeling functions is inspired by the recently proposed Data Programming (DP) framework [36] (and hence, the name ADP), but our method is different in many ways: (i) DP is a strict conditional model (i.e.
) that requires additional unlabeled data points even at test time, while our model is a joint distribution model, i.e.
, and does not require any additional unlabeled data points at test/generation time. (ii) DP learns a generative model using Maximum Likelihood Estimation (MLE) and gradient descent to learn the relative accuracies of labeling functions. We however replace this approach with a GANbased adversarial estimation of parameters.
[11] and [42] provide insights on the advantage of using a GANbased estimator over MLE to achieve a relatively quicker training time and good robustness on generated samples. (iii) To learn the statistical dependencies of labeling functions, DP models the dependency structure of labeling functions as a factor graph, and uses computationally expensive Gibbs sampling techniques to update the gradient in each step. We replace the factor graph and Gibbs samplingbased estimation of interfunction dependencies with another discriminator in our GANbased estimation, which speeds up the learning process again and provides a robust generation at runtime.As our outcomes of this work, we show how a set of lowquality, weak labeling functions can be used within a framework that models a joint datalabel distribution to generate robust samples. We also show that this idea can be generalized quite easily to transfer learning and multitask learning settings, showing the generalizability of this work. Our contributions can be summarized as follows:

[noitemsep,topsep=0pt]

We propose a novel adversarial framework, ADP to generate robust datalabel pairs that be used to obtain datasets in domains that have very little data and thus save human labor and time.

We show how an adversarial framework can be used to learn dependencies between weak labeling functions and thus provide highfidelity aggregated labels along with generated data in a GAN setting.

The proposed framework can also be used in a transfer learning setting where ADP can be trained on a source domain, and then finetuned on a target domain to then generate datalabel pairs in the target domain.

We also show the potential of this ADP framework to generate crossdomain data in a multitask setting, where images from two domains are generated simultaneously by the model along with the labels.
2 Related Work
Data augmentation seems a natural answer to the scarcity of curated handlabeled training data. However, heuristic data augmentation techniques like [15] and [19]
use a limited form of classpreserving image transformations such as rotation, mirroring, addition of small noise, random crop etc. Interpolationbased methods proposed in
[13] and classconditional models of diffeomorphisms proposed in [20] interpolate between nearestneighbor labeled data points. The popular SMOTE algorithm [7] performs oversampling to reduce class imbalance and augment the given data. All of these methods vastly depends on handtuned parameters, the order of geometric transformations, the optimal value of transformation parameters, etc. A small change in parameters can often lead to negative impacts on final performance as studied in [37], [10] and [15]).In this work, we choose to use a more intuitive way of creating labeled data by learning a joint distribution model. Learning a joint datalabel distribution using generative models such as [14], [18], and [28] is nontrivial, since the label often requires domain knowledge and not directly inferrably from data. Our proposed model hence uses distant supervision signals (in the form of labeling functions) to generate novel labeled data points. Distant supervision signals such as labeling functions are cheaper than manual annotation of each data point, and has been successfully used in recent methods such as [36]. Ratner et al. proposed a generative model in [36] that uses a fixed number of userdefined labeling functions to programatically generate synthetic labels for data in nearconstant time. DP outperformed number of approaches such as multipleinstance learning ([38]), cotraining ([4]), crowdsourcing ([17]), or ensemble based weaklearner method like boosting ([40]), thus reinforcing our choice in this work. Alfonseca et al. [1] generated additional training data using hierarchical topic models for weak supervision. Heuristics for distant supervision are also proposed in [6], but this method does not model the inherent noise associated with such heuristics. Structure learning [43][37] also exploits the use of distant supervision signals for generating labels, but as described in Section 1, these methods like [36] require unlabeled test data to generate a labeled dataset. Additionally, [36], [37] and [43] are computationally expensive due to its use of Gibbs sampling in MLE.
We instead use an adversarial approach to learn the joint distribution by weighting a set of domainspecific label functions using a Generative Adversarial Network (GAN). GAN ([18]) approximates the real data distribution by optimizing a minmax objective function and thus generates novel outofsample data points. Broadly, GAN can be viewed in terms of three manifestations: (i) GANs can be trained to sample from a marginal distribution ([12], [35][2]), where x refers to data. (ii) Recent efforts in literature such as Conditional GAN [31]
, Auxiliary Classifier GAN
[34] and InfoGAN [9] show training of GANs conditioned on class labels to thus sample from a conditional distribution, i.e. . Other stateoftheart models with similar objectives have exploited other modalities for the same purpose; for example, Zhang et al [49] propose a GAN conditioned on images, while Hu et al [21] propose a GAN conditioned on text. (iii) There have been a few very recent efforts [46], [51] and [22]), which attempt to train GANs to sample from a joint distribution. For example, CoGAN ([29]) introduces a parametersharing approach to learn an unpaired joint distribution between two domains, while TripleGAN [27] brings together a classifier along with the discriminator and generator which helps in a semisupervised setting. In this work, we propose a novel idea to instead use distant supervision signals to accomplish learning the joint distribution of labeled images. We now describe the proposed methodology.3 Adversarial Data Programming (ADP): Methodology
Our central aim in this work is to learn parameters of a probabilistic model:
(1) 
that captures the joint distribution over the data and the corresponding labels , thus allowing us to generate outofsample data points along with their corresponding labels (we focus on images in the rest of this paper).
While recent efforts such as [29] and [16] have considered complementary objectives, they largely focused on learning joint probability distributions in crossdomain understanding settings. In this work, we focus on learning the joint imagelabel probability distribution with a view to automatically create labeled datasets, by exploiting the use of distant supervision signals to generate labeled data. To the best of our knowledge, this is the first such work that invokes distant supervision while learning the joint distribution , so as to generate labeled data points at scale from . Besides, automatic generation of labels for data based on training datalabel pairs is nontrivial, and often does not work directly. Distant supervision provides us a mechanism to achieve this challenging goal. We encode distant supervision signals as a set of (weak) definitions by annotators using which unlabeled data points can be labeled. These definitions can be harvested from knowledge bases, domain heuristics, ontologies, rulesofthumb, educated guesses, decisions of weak classifiers or obtained using crowdsourcing. Many application domains have such distant supervision available in different means through domain knowledge or heuristics, which can be leveraged in the proposed framework. We provide examples in Section 4 when we describe our experiments.
We encapsulate all available distant supervision signals, henceforth called labeling functions, in a unified abstract container called Labeling Functions Block (LFB, see Figure 2a). Let LFB comprise of labeling functions , where each labeling function is a mapping:
(2) 
that maps a data point to a
dimensional probabilistic label vector,
, where is the number of class labels with for each . For example, could be thought of as an image from the MNIST dataset, and would be the corresponding label vector when the labeling function is applied to . , for instance, could be the onehot 10dimensional class vector, see Figure 2b.We characterize the set of labeling functions, , with two kinds of dependencies: (i) relative accuracies of the labeling functions with respect to the true class label of a given data point; and (ii) interfunction dependencies that capture the relationships between the labeling functions with respect to the predicted class label. To obtain a final label for a given data point using the LFB, we use two different sets of parameters, and to capture each of these dependencies between the labeling functions. We, hence, denote the Labeling Function Block (LFB) as:
(3) 
i.e. given a set of labeling functions , a set of parameters capturing the relative accuracybased dependencies between the labeling functions, , and a second set of parameters capturing interlabel dependencies, , provides a probabilistic label vector, , for a given data input .
The joint distribution we seek to model in this work (Equation 1) hence becomes:
(4) 
In the rest of this section, we show how we can learn the parameters of the above distribution modeling imagelabel pairs using an adversarial framework with a high degree of label fidelity. We use Generative Adversarial Networks (GANs) to model the joint distribution in Equation 4. In particular, we provide a mechanism to integrate the LFB (Equation 3) into the GAN framework, and show how and
can be learned through the framework itself. Our adversarial loss function is given by:
(5) 
where is the generator module and is the discriminator module. The overall architecture of the proposed ADP framework is shown in Figure 2a.
This approach has a few advantages: (i) firstly, labeling functions (which can even be just loosely defined) are cheaper to obtain than collecting labels for a large dataset; (ii) labeling functions can help bring domain knowledge into such generative models; (iii) labeling functions act as an implicit regularizer in the label space, thus allowing good generalization; (iv) with a small finetuning, labeling functions can be easily repurposed for new domains (transfer learning), as we describe later in this paper.
The ADP architecture is designed to learn the parameters required to model the joint distribution in Equation 4, and thus generate outofsample imagelabel pairs. This architecture is broadly divided into three modules: the generator, discriminator and the LFB. We now describe each of these modules individually.
3.1 The ADP  Generator
Given a noise input and a set of labeling functions , the generator outputs an image and the parameters and , the dependencies between the labeling functions described earlier. In particular, consists of three blocks: , and , as shown in Figure 2a. captures the common highlevel semantic relationships between the data and the label space, and is comprised only of fully connected (FC) layers. The output of forks into two branches: and , where generates the image , and generates the parameters . While uses FC layers, uses Fully Convolutional (FCONV) layers to generate the image (more details in Section 4). Thus, the generator outputs given input
, the standard normal distribution.
3.2 The ADP  Discriminator
The discriminator of ADP estimates the likelihood of an imagelabel input pair being drawn from the real distribution obtained from training data. takes a batch of imagelabel pairs as input and maps that to a probability score to estimate the aforementioned likelihood of the imagelabel pair. To accomplish this, has two branches: and (shown in the Discriminator block in Figure 2a). These two branches are not coupled in the initial layers, so as to separately extract required lowlevel features. The branches share weights in later layers to extract joint semantic features that help classify correctly if an imagelabel pair is fake or real.
We hence expand our objective function from Equation 5 to the following:
(6) 
3.3 The ADP  Labeling Function Block
This is a critical module of the proposed ADP framework. Our initial work revealed that a simple weighted (linear or nonlinear) sum of the labeling functions do not perform well in generating outofsample imagelabel pairs. We hence used a separate adversarial methodology within this block to learn the dependencies, both relative accuracies and interfunction (discussed earlier in this section), between the labeling functions provided to the framework. We describe the components of the LFB below.
3.3.1 Relative Accuracies of Labeling Functions
The output, , of the block in the ADPGenerator provides the relative accuracies of the labeling functions. Given the image output generated by : , the labeling functions , and the probabilistic label vectors obtained using the labeling functions (as in Eqn 2), we define the aggregated final label as:
(7) 
where is the normalized version of , i.e. . The aggregated label, , is provided as an output of the LFB.
3.3.2 Interfunction Dependencies
Our empirical studies with considering only relative accuracies of labeling functions as a weighting mechanism led to mode collapse in the joint distribution space, a wellunderstood problem in GANs. Our preliminary empirical studies demonstrated mode collapse in the joint distribution space (either images of same class with different labels, or images of different classes with same label were generated). The rationale behind taking two discriminators is to penalize the missing modes. Related literature [36] shows that interfunctional dependencies act as an implicit regularizer in the label space. We also conducted experiments on synthetic data to demonstrate this issue (please see Fig LABEL:fig:synthetic_image below). We hence introduced an adversarial mechanism inside the LFB to influence the final relative accuracies, , using the interfunction dependencies between the labeling functions. , a discriminator inside LFB, receives two inputs: , which is output by , and , which is obtained from using the procedure described in Algorithm 1.
Algorithm 1 computes a matrix of interdependencies between the labeling functions, , by looking at the onehot encodings of their predicted label vectors. If the onehot encodings match for a given data input, we increase the count of their correlation by one, and compute this matrix across a particular minibatch of data points under consideration. The counts are then normalized rowwise to obtain . The task of the discriminator is to recognize the computed interdependencies as real, and the generated through the network in
as fake. The gradient backpropagated through this discriminator to the
block is critical as a regularizer in learning a better , which is finally used to weight the labeling functions (as in Section 3.3.1). Combining the gradient information from along with , penalizes missing modes and helps to generate more variety in the samples. The objective function of our second adversarial module is hence:(8) 
where and are obtained from as described above. More details of the LFB are provided in implementation details in Section 4.
The overall architecture of ADP (Figure 2a) is trained using endtoend backpropagation with gradients from both discriminators, and , influencing the weights learned inside the generator
. Minibatches of imagelabel pairs from a given training distribution are provided as input to ADP , and Stochastic Gradient Descent (SGD) is used to learn the parameters of the model. At the end of training, we define the aggregated final label as:
(9) 
the samples generated using the and modules thus provide samples from the desired joint distribution (Eqn 1) modeled using the framework.
4 Experiments and Results
4.1 Datasets
We validated the ADP framework on standard datasets: MNIST ([26]), Fashion MNIST ([45]), SVHN ([33]), and CIFAR10 ([23]). No additional preprocessing is performed on MNIST, FashionMNIST and CIFAR 10 datasets. For the SVHN dataset, we used the ‘Format 2 Cropped version’, and included an additional crop on each image to reduce presence of more than one digit, though the dimension is maintained^{2}^{2}2Code available at https://github.com/ArghyaPal/AdversarialDataProgramming.
4.2 Labeling Functions
Labeling functions form a critical element of ADP , and we used different cues from stateoftheart algorithms to help obtain labeling functions for our experiments. Table 1 shows the labeling functions we used for our experiments on MNIST and SVHN (digit recognition problems), and Table 2 shows the functions used for CIFAR and FashionMNIST. We categorized labeling functions as: (i) Heuristic; (ii) Image processingbased; and (iii) Deep learningbased labeling functions (as in Tables 1 and 2). Table 3 presents the statistics of the number of labeling functions used for each of the considered datasets (the empirical study that motivated these choices is presented in Section 5). In this work, for each labeling function, a simple threshold rule on the norm of the aforementioned features is used, where the threshold is obtained empirically as the mean of the norms of a randomly chosen subset, which is
trimmed to remove outliers. More examples of labeling functions and ablation studies on their usefulness are presented in the Supplementary Section.
Type  Labeling Functions used 

Heuristic  Presence of long edges (vertical or horizontal) [30]; Image histogram 
Image Processing based  Bagoffeature [39]; Haar wavelet [8]; Discretecontinuous ADM [25]; Compressive sensing [50] 
Deep Learning based  Convolution kernels from last conv layer (before fully connected layers) of LeNet 
Type  Labeling Functions used 

Heuristic  PatchMatch [3]; Blob Detection; Presence of edges [1]; Textons; Image histogram 
Image Processing based  Global descriptor (GISTbased) [32]; Local descriptor (SIFTbased) [48]; Bagofvisualwords; Histogram of Oriented Gradient (HOG)based: HoGgles [44] 
Deep Learning based  Convolution kernels from last conv layer (before fully connected layers) of (ImageNet) pretrained AlexNet 
Heuristic  Image Processing  Deep Learning  

MNIST  43  10  1 
FashionMNIST  50  6  1 
SVHN  43  10  2 
CIFAR 10  46  18  2 
4.3 Implementation Details
has 3 dense fully connected layers (FC) (128 nodes per layer) with batchnormalization.
continues with fractional length convolutional layers, similar to [29] (FCONV: 1024 nodes per layer, Kernel size:, Stride: 1, followed by batchnormalization and Parameterized ReLU), and generates image
x. uses FC layers and generates . The discriminator network follows the inplane rotation network of [29]. is a stack of FC layers. Both and have 2 FCONV layers followed by FC layers. We trained the complete model with minibatch Stochastic Gradient Descent (SGD) with a batch size of 128, learning rate of 0.0001, momentum factor of 0.5 and Adam as an optimizer.4.4 Comparison with StateoftheArt Models
Qualitative Results:
We compared our method against other generative methods that allow generation of data along with a label: Conditional GAN or CGAN ([18]), ACGAN ([34]), InfoGAN ([9]), CoGAN ([29]) and TripleGAN ([27]). (We changed the use case setup of these methods to generate datalabel pairs as required. For example, for a conditional GAN, we specified a class label, generated a corresponding image and used this as an imagelabel pair.) We used the publicly available codes for each of the above methods, and the results for CIFAR10 are shown in Figure 3 (Results for other datasets are shown in the Supplementary Section). The figure shows that the proposed model generates images with very good clarity. Besides, while some of the aforementioned methods (such as CGAN and InfoGAN) generate images conditioned on a given label (and hence require a label to be provided as input), the label is provided by the model in our case.
Quantitative Results:
We considered three evaluation metrics for studying the performance of our method quantitatively: (i)
Human Turing test (HTT): This metric studies how hard it is for a human annotator to tell the difference between real and generated samples. We asked 40 subjects to evaluate image quality and imagelabel correspondence (Table 5) on a scale of 10, given 50 random imagelabel samples from the generated pool for each method considered. Table 5 shows consistently good performance of ADP over other methods, especially in imagelabel correspondence, which is the focus of this work; (ii) Inception Score: The inception score, as used in [29] and [27], for the CIFAR 10 dataset is shown in Figure 5. The figure shows that ADP and TripleGAN perform significantly better than the rest of the methods (more results on other datasets included in the Supplementary Section). We also used a Parzen window based evaluation metric, and these results are included in the Supplementary Section.We trained a ResNet56 model on the CIFAR10 dataset under different settings, and the results are shown in Table 4. The addition of labeled data generated using our method significantly lowers the test crossentropy loss across the epochs. We will include these in the paper.
(Training Data, Test Data)  Epochs  

5k  10k  15k  20k  30k  40k  50k  
(Real data50K, Real data10K)  9.83  7.3  7.12  6.3  6.1  4.3  4.19 
(ADP data50K, Real data10K)  9.32  8.9  8.13  7.0  6.75  5.53  5.0 
(Real data50K, ADP data10K)  9.67  9.4  7.92  7.3  6.81  6.18  5.6 
(ADP25K + Real data25K, Real data10K)  8.5  6.6  6.21  5.7  5.5  4.83  3.5 
(ADP50K + Real data50K, Real data10K)  7.71  6.3  6.0  5.34  3.1  2.92  2.71 
Dataset  Image Quality  ImageLabel Correspondence  

ACGAN  CGAN  InfoGAN  TripleGAN  ADP  ACGAN  CGAN  InfoGAN  TripleGAN  ADP  
MNIST  
FMNIST  
SVHN  
CIFAR10 
Classification Performance:
To study the usefulness of the generated imagelabel pairs, we studied the classification crossentropy loss of a pretrained ResNet model on the imagelabel pairs generated by our ADP at test time. A lesser crossentropy loss in ResNet at test time indicates the efficacy of our model as a data augmentation method. We compared our method against TripleGAN, InfoGAN, CoGAN as well as the popular oversampling technique, SMOTE [7]. Figure 4 shows the result, which shows that the proposed model has significantly lower crossentropy loss over other methods, highlighting its usefulness.
5 Discussion and Analysis
Optimal Number of Labeling Functions:
We studied the performance of ADP when the number of labeling functions is varied to understand the impact of this parameter on the performance. We studied the test crossentropy error of a pretrained ResNet model with imagelabel pairs generated by ADP, trained using different number of labeling functions. Table 6 shows our results, suggesting that 5055 labeling functions provides the best performance, depending on the dataset. This justifies our choice of number of labeling functions in Table 3.
No of Labeling Functions  MNIST  FMNIST  CIFAR10  SVHN 

3  70.23%  81.02%  87.39%  83.82% 
10  47.92%  71.52%  60.11%  75.91% 
25  20.32%  30.53%  42.31%  38.30% 
30  4.56%  12.47%  21.19%  26.66% 
40  1.40%  6.81%  19.93%  16.62% 
50  1.33%  4.92%  18.93%  13.05% 
55  1.34%  4.80%  18.45%  12.83% 
65  1.31%  4.73%  18.43%  12.82% 
70  1.25%  4.75%  18.40%  12.80% 
Transfer Learning:
The use of distant supervision signals such as labeling functions (which can often be generic) allows us to extend the proposed ADP model to a transfer learning setting. In this setup, we trained ADP initially on a source dataset and then finetuned the model to a target dataset, with very limited training. In particular, we first trained ADP on the MNIST dataset, and subsequently finetuned the branch alone with the SVHN dataset. We note that the weights of , and are unaltered. The final finetuned model is then used to generate imagelabel pairs (which we hypothesize will look similar to SVHN). Figure 1 shows encouraging results of our experiments in this regard.
Multitask Joint Distribution Learning:
Learning a crossdomain joint distribution from heterogeneous domains is a challenging task. We show that the proposed ADP method can be used to achieve this, by modifying its architecture as shown in Figure 7, to simultaneously generate data from two different domains. We study this architecture on the MNIST and SVHN datasets, and show the promising results of our experiments in Figure 8. The LFB acts as a regularizer and maintains the correlations between the domains in this case. More results on other datasets  in particular, LookBook and Fashion MNIST  are included in the Supplementary Section as well as Figure 1.
6 Conclusions
Paucity of large curated handlabeled training data for every domainofinterest forms a major bottleneck in deploying machine learning methods in practice. Standard data augmentation techniques and other heuristics are often limited in their scope and require carefully picked handtuned parameters. We instead propose a new adversarial framework called Adversarial Data Programming (ADP), which can learn the joint datalabel distribution effectively using a set of weakly defined labeling functions. The method shows promise on standard datasets, as well as on settings such as transfer learning and multitask learning. Our future work will involve understanding the theoretical implications of this new framework from a gametheoretic perspective, as well as explore the performance of the method on more complex datasets.
References
 [1] E. Alfonseca, K. Filippova, J.Y. Delort, and G. Garrido. Pattern learning for relation extraction with a hierarchical topic model. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short PapersVolume 2, pages 54–59. Association for Computational Linguistics, 2012.
 [2] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
 [3] C. Barnes, D. B. Goldman, E. Shechtman, and A. Finkelstein. The patchmatch randomized matching algorithm for image manipulation. Communications of the ACM, 54(11):103–110, 2011.

[4]
A. Blum and T. Mitchell.
Combining labeled and unlabeled data with cotraining.
In
Proceedings of the eleventh annual conference on Computational learning theory
, pages 92–100. ACM, 1998.  [5] O. Breuleux, Y. Bengio, and P. Vincent. Unlearning for better mixing.
 [6] R. Bunescu and R. Mooney. Learning to extract relations from the web using minimal supervision. In ACL, 2007.

[7]
N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer.
Smote: synthetic minority oversampling technique.
Journal of artificial intelligence research
, 16:321–357, 2002.  [8] X. Chen, X. Cheng, and S. Mallat. Unsupervised deep haar scattering on graphs. In Advances in Neural Information Processing Systems, pages 1709–1717, 2014.
 [9] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2172–2180, 2016.
 [10] D. Ciresan, U. Meier, L. Gambardella, and J. Schmidhuber. Deep big simple neural nets excel on handwritten digit recognition [Электронный ресурс]. 2010. Режим доступа: http://arxiv. org/pdf/1003.0358, свободный. Яз. англ.(дата обращения 03.07. 2014).
 [11] I. Danihelka, B. Lakshminarayanan, B. Uria, D. Wierstra, and P. Dayan. Comparison of maximum likelihood and ganbased training of real nvps. arXiv preprint arXiv:1705.05263, 2017.
 [12] E. L. Denton, S. Chintala, R. Fergus, et al. Deep generative image models using a laplacian pyramid of adversarial networks. In Advances in neural information processing systems, pages 1486–1494, 2015.
 [13] T. DeVries and G. W. Taylor. Dataset augmentation in feature space. arXiv preprint arXiv:1702.05538, 2017.
 [14] C. Doersch. Tutorial on variational autoencoders. arXiv preprint arXiv:1606.05908, 2016.

[15]
A. Dosovitskiy, P. Fischer, J. T. Springenberg, M. Riedmiller, and T. Brox.
Discriminative unsupervised feature learning with exemplar convolutional neural networks.
IEEE transactions on pattern analysis and machine intelligence, 38(9):1734–1747, 2016.  [16] Z. Gan, L. Chen, W. Wang, Y. Pu, Y. Zhang, H. Liu, C. Li, and L. Carin. Triangle generative adversarial networks. arXiv preprint arXiv:1709.06548, 2017.
 [17] H. Gao, G. Barbier, and R. Goolsby. Harnessing the crowdsourcing power of social media for disaster relief. IEEE Intelligent Systems, 26(3):10–14, 2011.
 [18] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
 [19] B. Graham. Fractional maxpooling. arXiv preprint arXiv:1412.6071, 2014.
 [20] S. Hauberg, O. Freifeld, A. B. L. Larsen, J. Fisher, and L. Hansen. Dreaming more data: Classdependent distributions over diffeomorphisms for learned data augmentation. In Artificial Intelligence and Statistics, pages 342–350, 2016.
 [21] Z. Hu, Z. Yang, X. Liang, R. Salakhutdinov, and E. P. Xing. Controllable text generation. arXiv preprint arXiv:1703.00955, 2017.
 [22] T. Kim, M. Cha, H. Kim, J. Lee, and J. Kim. Learning to discover crossdomain relations with generative adversarial networks. arXiv preprint arXiv:1703.05192, 2017.
 [23] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009.
 [24] V. V. Kumar, A. Srikrishna, B. R. Babu, and M. R. Mani. Classification and recognition of handwritten digits by using mathematical morphology. Sadhana, 35(4):419–426, 2010.
 [25] E. Laude, J.H. Lange, F. Schmidt, B. Andres, and D. Cremers. Discretecontinuous splitting for weakly supervised learning. arXiv preprint arXiv:1705.05020, 2017.
 [26] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 [27] C. Li, K. Xu, J. Zhu, and B. Zhang. Triple generative adversarial nets. arXiv preprint arXiv:1703.02291, 2017.

[28]
Y. Li, K. Swersky, and R. Zemel.
Generative moment matching networks.
In Proceedings of the 32nd International Conference on Machine Learning (ICML15), pages 1718–1727, 2015.  [29] M. Liu, O. Tuzel, and A. Sullivan. Coupled generative adversarial nets. 2016.
 [30] S. Mandal, S. Sur, A. Dan, and P. Bhowmick. Handwritten bangla character recognition in machineprinted forms using gradient information and haar wavelet. In Image Information Processing (ICIIP), 2011 International Conference on, pages 1–6. IEEE, 2011.
 [31] M. Mirza and S. Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
 [32] H. Moudni, M. Errouidi, M. Oujaoura, and O. Bencharef. Recognition of amazigh characters using surf & gist descriptors. In International Journal of Advanced Computer Science and Application. Special Issue on Selected Papers from Third international symposium on Automatic Amazigh processing, pages 41–44, 2013.
 [33] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, volume 2011, page 5, 2011.
 [34] A. Odena, C. Olah, and J. Shlens. Conditional image synthesis with auxiliary classifier gans. arXiv preprint arXiv:1610.09585, 2016.
 [35] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
 [36] A. J. Ratner, C. M. De Sa, S. Wu, D. Selsam, and C. Ré. Data programming: Creating large training sets, quickly. In Advances in Neural Information Processing Systems, pages 3567–3575, 2016.
 [37] A. J. Ratner, H. R. Ehrenberg, Z. Hussain, J. Dunnmon, and C. Ré. Learning to compose domainspecific transformations for data augmentation. arXiv preprint arXiv:1709.01643, 2017.
 [38] S. Riedel, L. Yao, and A. McCallum. Modeling relations and their mentions without labeled text. Machine learning and knowledge discovery in databases, pages 148–163, 2010.
 [39] L. Rothacker, S. Vajda, and G. A. Fink. Bagoffeatures representations for offline handwriting recognition applied to arabic script. In Frontiers in Handwriting Recognition (ICFHR), 2012 International Conference on, pages 149–154. IEEE, 2012.
 [40] R. E. Schapire and Y. Freund. Boosting: Foundations and algorithms. MIT press, 2012.
 [41] C. Sun, A. Shrivastava, S. Singh, and A. Gupta. Revisiting unreasonable effectiveness of data in deep learning era. arXiv preprint arXiv:1707.02968, 2017.
 [42] L. Theis, A. v. d. Oord, and M. Bethge. A note on the evaluation of generative models. arXiv preprint arXiv:1511.01844, 2015.
 [43] P. Varma, B. He, P. Bajaj, I. Banerjee, N. Khandwala, D. L. Rubin, and C. Ré. Inferring generative model structure with static analysis. arXiv preprint arXiv:1709.02477, 2017.
 [44] C. Vondrick, A. Khosla, T. Malisiewicz, and A. Torralba. HOGgles: Visualizing Object Detection Features. ICCV, 2013.
 [45] H. Xiao, K. Rasul, and R. Vollgraf. Fashionmnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
 [46] Z. Yi, H. Zhang, P. T. Gong, et al. Dualgan: Unsupervised dual learning for imagetoimage translation. arXiv preprint arXiv:1704.02510, 2017.
 [47] D. Yoo, N. Kim, S. Park, A. S. Paek, and I. S. Kweon. Pixellevel domain transfer. In European Conference on Computer Vision, pages 517–532. Springer, 2016.

[48]
K. Yu, Y. Lin, and J. Lafferty.
Learning image representations from the pixel level via hierarchical
sparse coding.
In
Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on
, pages 1713–1720. IEEE, 2011.  [49] H. Zhang, T. Xu, H. Li, S. Zhang, X. Huang, X. Wang, and D. Metaxas. Stackgan: Text to photorealistic image synthesis with stacked generative adversarial networks. arXiv preprint arXiv:1612.03242, 2016.
 [50] M. Zhenjiang and Y. Baozong. Handwritten character recognition by extended loop neural networks. In Speech, Image Processing and Neural Networks, 1994. Proceedings, ISSIPNN’94., 1994 International Symposium on, pages 460–463. IEEE, 1994.
 [51] J.Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired imagetoimage translation using cycleconsistent adversarial networks. arXiv preprint arXiv:1703.10593, 2017.
Appendix A Algorithm
Algorithm LABEL:alg:main presents the overall stepwise routine of the proposed method, ADP , as described in Section 3
. During the training phase, the algorithm updates weights of the model by estimating gradients for a batch of labeled data points. The hyperparameters that need to be provided include standard parameters that are provided while training a GAN, such as: (i) number of iterations of Algorithm
LABEL:alg:main; (ii) parameter (similar to [18]) that describes how many times and would be updated with respect to ; and (iii) minibatch size . Using empirical studies, we chose , and number of iterations to be . algocf[htbp]Appendix B Datasets
In this section, we provide more information on the datasets used for validating ADP in this work: MNIST, Fashion MNIST, SVHN and CIFAR 10. The MNIST dataset comprises grayscale images (with one handwritten digit in each image) along with the corresponding label, with 50,000 training samples (imagelabel pairs). In case of SVHN, we used format 2 of the dataset, which comprises 73257 images (each containing a digit captured from street views of house numbers) with the corresponding labels. In case of CIFAR 10, we merged five training batches of the dataset and built a training set of 50,000 images. This dataset contains RGBimages each of size of spanning 10 classes: automobile, airplane, bird, cat, deer, dog, frog, horse, ship, truck. The total number of samples are almost equally distributed across all classes. Fashion MNIST, similar to MNIST, consists of a training set of 50,000 grayscale images with one of 10 classes: Tshirt, Trouser, Pullover, Dress, Coat, Sandal, Shirt, Sneaker, Bag, Ankle boot.
We also used the LookBook [47] dataset (Figure 9) to demonstrate crossdomain multitask learning using ADP , as described in Section 5. This dataset contains 84,748 images across 17 classes: Midi dress, mini dress, coat, jacket, fur jacket, padded jacket, hooded jacket, jumper, cardigan, knitwear, blouse, shirt, sleeveless tee, short sleeve tee, long sleeve tee, hoody, vest. In this work, we grouped these 17 classes into 4 classes: coat, pullover, tshirt, dress
, in order to match with the Fashion MNIST dataset and thus help study crossdomain learning. We grouped coat, jacket, fur jacket, padded jacket, hooded jacket, jumper, cardigan to a single
coat class; hoody to the pullover class; sleeveless tee, short sleeve tee to the tshirt class; cardigan, knitwear, blouse, Midi dress, mini dress to the Dress class. Fashion MNIST dataset also has the same classes: coat, pullover, tshirt, Dress among its label, thus facilitating our study.No additional preprocessing was performed on MNIST, Fashion MNIST, CIFAR 10 or the LookBook datasets. In case of SVHN, an additional crop was performed on each image to ensure only one digit is present in the image. The cropped image was subsequently sampled to maintain the size. Figure 9 shows illustrative examples of images from the chosen datasets.
Appendix C More on Labeling Functions
The Labeling Functions Block (LFB) in Figure 2a (Section 3) is implemented using the opensource framework, Snorkel [36]. We modified the underlying architecture of the Snorkel framework to include an adversarial approach, which otherwise estimates dependencies using MLE invoking Gibbs sampling. Labeling functions of three kinds: heuristics, image processingbased and deep learned features have been used in this work, as described in Section 4. Examples of labeling functions used in this work are shown as Labeling Functions 2, 3, 4 and 5. For each labeling function, a simple threshold rule on the norm of the aforementioned features is used. For each class of a dataset, the threshold information is obtained empirically as the average norm of the feature vectors of 20 random samples of that class (with trimming to remove outliers). It is worthy to mention that, for an abstract understanding of working process of our labeling functions, the return value of example Labeling Functions 4 and 5 are onehot encoding. Though in practice we fit a nonlinear function to get a probabilistic output.
c.1 Ablation Studies with Labeling Functions
In order to understand the effect of different kinds of labeling functions, we performed an ablation study on the CIFAR10 dataset (considering it is the most natural of the considered datasets, and that it allows us to compute an Inception score to quantitatively compare the performance of various methods). In this study, we did not alter any hyperparameters described in Section 4. Our ablation study considers the following models:

[label=M0:]

ADP : Full model

ADP with no dependencies: Same model as ADP having 55 labeling functions, as in Table 3. Each labeling function is, however, considered independent of the other.

ADP with only heuristic labeling functions: Same model as ADP with 36 heuristic labeling functions but without any image processing or deep learned featurebased labeling functions

ADP with only image processing labeling functions: Same model as ADP with only 17 image processingbased labeling functions but without heuristic or deep learned featurebased labeling functions

ADP with only deep learned featurebased labeling functions: Same model as ADP with only 2 deep learned featurebased labeling functions but without heuristic or image processing labeling functions

ADP with (heuristic labeling functions + deep learned featurebased labeling functions)

ADP with (deep learned featurebased labeling functions + image processing labeling functions)

ADP with (image processing labeling functions + heuristic labeling functions)
M1  M2  M3  M4  M5  M6  M7  M8  

8.7  4.32  5.52  4.91  4.73  7.01  7.52  7.27 
The inception scores for the aforementioned 8 models are presented in Table 7. The base ADP model comprising of all labeling functions outperforms all other models, highlighting the usefulness of a variety of labeling functions to model the nontrivial distribution.
Appendix D More Qualitative Results
In addition to the results on CIFAR 10 presented in Section 4.4, we also studied the performance of our ADP method against other generative methods (CGAN, ACGAN, InfoGAN, CoGAN, TripleGAN) on MNIST, SVHN and Fashion MNIST datasets. Similar to CIFAR 10 generation, we changed the use case setup of the other methods to generate labeled images, using the publicly available code for each of the methods. Figures 10, 11 and 12 present these results.
Mnist:
Figure LABEL:fig:mnist_all shows that both our method ADP and TripleGAN generate good quality images on the MNIST dataset. We observe that both ADP and TripleGAN give a high imagetolabel correspondence. Surprisingly, stateoftheart methods such as CGAN, ACGAN, InfoGAN and CoGAN fail to capture imagetolabel correspondence despite generating good quality images.
Svhn:
As shown in Figure LABEL:fig:svhn_all, we observe that our method generates humanrecognizable images with a good imagetolabel correspondence in just 1k epochs on the relatively harder SVHN dataset. At higher epochs, CoGAN (epoch = 30) and TripleGAN (epoch = 40) also generate images of good quality, but broadly fail to capture different styles, backgrounds and illuminations of the generated digit.
GAN  CGAN  ACGAN  InfoGAN  CoGAN  ADP  Triple  

MNIST  198  201  204  225  278  344  321 
FMNIST  213  206  234  276  254  292  312 
SVHN  87  145  178  158  123  246  223 
Fashion MNIST:
Most of the considered methods do well on this dataset. ADP and TripleGAN provide the sharpest results on close visual observation.
Appendix E More Quantitative Results
Parzen Window Based Evaluation:
In addition to the results with Inception score and HTT presented in Section 4.4, we compared our method against other generative models (described in Section 4) based on the Parzen window score at test time. Parzen window [5] is a commonly used nonparametric density estimation method to evaluate generative models (especially GANs [18]) for which exact likelihood is not tractable. Based on the samples generated by the model, we use a Parzen window with a Gaussian kernel as a density estimator. This helps obtain a proxy for true loglikelihood and thereby evaluate test loglikelihood. These results are shown in Table 8. The table shows that ADP has performed significantly well on MNIST (Score is 344) and SVHN (Score is 246) dataset and outperformed other stateoftheart models including TripleGAN. For Fashion MNIST, our method is a close second with respect to TripleGAN. We chose the Parzen window size using crossvalidation, as described in [18].
Appendix F More Results on Multitask Joint Distribution Learning
In continuation to the results presented in Section 5 (and 1), we present more results for the capability of ADP to perform multitask joint distribution learning in Figure 15. The figure captures our promise and shows that ADP is able to generate samples from two different domains, including samples of different colors.
Appendix G Comparison against Vote Aggregation Methods
Comparison against Majority Voting and DP:
To study the usefulness of learning relative accuracies and interfunction dependencies using ADP , we compared the performance of our method, both with majority voting and Data Programming (DP, [36]). In majority voting, does not estimate relative accuracies and interfunction dependencies of labeling functions as described in Section 3. Instead, for a given image, each labeling function of makes a probabilistic prediction, and we take a maximum vote to obtain the final label. As in Section 4.4, we studied the testtime classification crossentropy loss of a pretrained ResNet model on imagelabel pairs generated by ADP, ADP (i.e. only its ImageGAN component) with majority voting and DP. The results are presented in Figure 13a, which shows that ADP has significantly lower crossentropy loss than the other two methods, thus corroborating its effectiveness.
Adversarial Data Programming vs MLEbased Data Programming:
To further quantify the benefits of our ADP , we also show how our method compares against Data Programming (DP) [36] using different variants of MLE: MLE, Maximum Pseudolikelihood, and Hamiltonian Monte Carlo. We note that DP only aggregates labels; we hence, combined a vanilla GAN with DP as separate components to conduct this study. We started with a small number of labeling functions (viz., 35 functions) and progressively added additional labeling functions, noting the time taken by each aforementioned parameter estimation method. Figure 13b presents the results and shows that ADP is almost 100X faster than MLEbased estimation. Figure 14 also shows sample images generated by the vanilla GAN, along with the corresponding label assigned by MLEbased DP using the same labeling functions as used in our work. Clearly, the labels are incorrect, thus supporting the value of the proposed work in learning a joint distribution, than combining two individual components.