I Introduction
Over the last few years more and more neural network configurations and frameworks have proposed the use of multi modal datasets, or external evidence, as a way to increase their effectiveness. Among other tasks, question answering [1], action recognition [2] and social media inference [3]
have made use of external information to learn increasingly meaningful representations. These representations are a product of colearning the primary and the external data through a common objective. In the context of supervised learning, this learning process depends on the availability of external data as well as on their relation to the primary dataset.
However, in practice the availability of external data is either not guaranteed, or we may observe the outcome of external processes without having explicit access to the corresponding dataset. For instance, suppose we want to cluster towns according to their observed weather (primary task). At the same time we observe a grouping of geographical regions (which may include more than one town) according to rainfall data (auxiliary task). The outcome of the auxiliary task clearly relates to the primary task, however (a) the actual relationship is unknown, i.e. we do not know the extent to which rainfall can predict the overall weather of an area; (b) we do not have access to the data which led to the auxiliary tasks (in this case, the rainfallbased outcome), in fact we may prefer to not perform two tasks but rather to improve one based on our observation of the second; and (c) even though data items (and consequently task outcomes) in the two tasks will be related, the cardinality of this relationship is unknown (in this case, each region contains many towns). Here, we consider the auxiliary categorical outcome as external evidence, and the process of influencing the latent representations of a dataset to improve a primary task, as evidence transfer.
In this paper we propose a general framework that uses external categorical evidence when available to improve unsupervised learning, and in particular clustering. Using autoencoders that learn the primary dataset distribution, we learn a latent space that can be later manipulated to reflect new external evidence.
In this paper we present a general evidence transfer method for combining multiple sources of external evidence to improve the outcome of clustering tasks. Our method makes no assumptions regarding the quality, source or availability of external information. It aids clustering tasks by learning augmented latent representations that are disentangled according to external categorical evidence. By manipulating the latent space, we increase the effectiveness of clustering algorithms that rely on linear distance metrics, such as means. Our method is effective, robust when presented with low quality of additional evidence, and modular as it can be incrementally applied to new pieces of evidence. The proposed evidence transfer method, although related to style transfer has fundamental differences in the training procedure that are further discussed in IV.
Ii Methodology
In this section we define the problem of combining external evidence in a primary clustering task, we introduce appropriate fitness criteria and propose an evidence transfer method that satisfies these fitness criteria.
Iia Problem Statement
Consider the task of clustering a dataset . Clustering yields a set of memberships , where
can be modelled as the categorical probability distribution of
over the target classes, . For the case of hard cluster assignment, one would eventually assign to cluster . In this paper denotes the primary dataset and clustering is the primary task.External evidence is the set of outcomes of an auxiliary task applied either on the primary dataset or on some auxiliary dataset. Similar to the primary task, each can be seen as the categorical distribution of a subset of over the target classes of the auxiliary task, with the most straightforward case being that and that there is a onetoone correspondence between the elements of and . There may exist multiple sources of external evidence yielding observable membership outcomes .
Our objective is to use external evidence to improve accuracy by reducing uncertainty in the primary clustering task. In any clustering task there are data samples that the clustering algorithm is not able to distinguish with high certainty. Clustering on latent representations of generally leads to better results due to their increased linear separability in the latent space learned [4]. By allowing external evidence to also influence the learning process, we posit that we can further improve linear separability, therefore achieving increased certainty in the primary task.
In this work we do not take into account the auxiliary datasets directly, only external categorical evidence produced on an unseen dataset by an unseen procedure. It follows that the proposed method makes no assumptions regarding the relation of the external evidence to the primary dataset. The only assumption made is that the external evidence is somehow related to the primary dataset but the mapping is unknown or too complex.
Methods that attempt to improve a task via evidence transfer should be at least effective and robust. In practice, sources of evidence may not be known and available at the beginning of evidence transfer, and therefore methods should also be modular in order to allow incrementally improving representations. More specifically, the fitness criteria we evaluate evidence transfer methods against are:

Effectiveness: In the case that the evidence corresponds to a meaningful relation between itself and the primary dataset, this should be discovered and utilised to reduce uncertainty in the latent space. Meaningful relations are characterised by consistency on the outcome of the auxiliary tasks. Intuitively, introducing more than one sources of consistent evidence should lead to more effective performance than using a single source of evidence.

Robustness: Since the mapping is unknown, there may be evidence that does not contribute meaningful information to the primary task, i.e. that it does not contribute to disentangling the latent representations of . For example, in cases where distribution or in cases where the evidence is consistent but it is introduced in a non corresponding order. The algorithm should be able to distinguish this evidence as low quality evidence and be able to reject it without making significant changes in the latent space, therefore maintaining its prior effectiveness.

Modularity: The method should not require complete retraining in view of additional evidence. For instance, the proposed method includes evidence as a fine tune step that augments the baseline representations. The added step should not disrupt the latent space in such way that will lead to changes in the original objective. The transformations that take place during the finetune step, should be restricted by the original objective of the baseline solution.
Modularity and Robustness are measures against low quality evidence, while Effectiveness reduces uncertainty in the latent space, leading to better performance.
IiB Dealing with external evidence
In order to satisfy the above criteria we consider the minimization of cross entropy as an appropriate objective. Cross entropy is an asymmetrical metric that involves the entropy of the “true” distribution and its divergence to an auxiliary distribution (Equation 1). Considering the external evidence as the “true” distribution and the latent space as the “auxiliary” distribution, then cross entropy quantifies the uncertainty of evidence distribution, as well as, its relation to the latent space. As a task outcome, evidence distribution is considered as fixed and therefore its entropy is constant. On the other hand, the distribution of the latent space belongs to parametric families that involve the trainable parameters of the neural network.
(1) 
We use the cross entropy to shift these parameters into reducing the divergence between the evidence distribution and the latent space. In cases where evidence correlates with the latent space, their divergence is minimized and therefore satisfying the effectiveness criterion. In cases where evidence can not be correlated with the latent space, their divergence converges to high values that affect the latent space less and less with each epoch.
Since the relation between the evidence and the primary dataset is unknown, we do not introduce the evidence samples in their raw format. We use a biased additional autoencoder with a single hidden layer to upscale or downscale the width of the latent evidence samples, to correspond with the width of the inner hidden layer of the primary autoencoder. We use the term biased, due to not allowing the evidence autoencoder to generalise over the input dataset. We train the evidence autoencoder with low number of epochs to act as an identity function.
We use latent categorical representations of the biased evidence autoencoders, denoted as
, since the identity bias forces the evidence autoencoder to produce the same latent space distribution for both white noise evidence and inconsistent evidence. In both cases, the latent evidence samples approximate a uniform distribution which results in high cross entropy that converges to constant loss.
IiC Evidence transfer
Our method is a sequence of two steps, the initialization and evidence transfer steps. During the initialization step we introduce a baseline clustering method that we later finetune using external additional evidence. In order to initialize the latent space of the baseline solution, we train an autoencoder using the squared error loss, namely (Equation 2).
(2) 
We believe that generative types of autoencoders such as Denoising Autoencoders, Variational Autoencoders
[5] or Adversarial Autoencoders [6] are fit for the initialization of the latent space. Generative autoencoders approximate a latent space distribution that is close or exactly the same as the true underlying data generation distribution. In our experiments, we use Denoising Autoencoders that maximize the expected loglikelihood of dataset given corrupted dataset (by minimizing (3), as defined in [7]), with expectation taken over the joint datagenerating distribution.(3) 
When the training of the autoencoder is done, we use the means algorithm on the initial latent representations. We introduce this method as a baseline solution to the clustering problem. Before we proceed to the evidence transfer step, we also train the additional evidence autoencoders to produce latent categorical samples as part of the initialization step.
(4) 
(5) 
Evidence transfer step follows our predefined guideline. We satisfy the criterion of modularity by introducing the process of evidence transfer as an additional step to the initialization. Objective minimizes the mean cross entropy between each additional evidence sources and predictors . Objective restricts the latent space to preserve its baseline structure, meaning that latent samples are able to perform their original task, which is the reconstruction of primary dataset samples. We jointly minimize both and by minimizing Equation (5) that involves both losses in a weighted sum using and hyperparameters as coefficients for each respective loss. By jointly optimizing both tasks we approach the maximization of the expected loglikelihood by using evidence informed parameters .
We use additional layers (one for each source of evidence) in the output of the autoencoder in order to predict the latent categorical variables
. As depicted in Figure 1. Opposing to directly manipulating the latent space, predictors adjust their weights depending on the quality of the evidence. In cases of low quality evidence, their weights decay and the joint minimization of Equation (5), is achieved by minimizing .Relation to InfoGAN
The InfoGAN framework [8] as well as our framework, both make use of information theory metrics and added layers to manipulate the latent space. We use auxiliary task outcomes performed either on the primary dataset or on other auxiliary datasets to manipulate an initialized latent space in order to improve a clustering task. InfoGAN introduces latent code to disentangle latent representations in order to manipulate the task of generation. In our case, evidence such as would be considered as low quality and rejected by our configuration, making no changes in the latent space.
Iii Evaluation and Results
For the purpose of evaluating our solution we tried three different qualities of evidence (real corresponding evidence, random values / white noise, random index evidence) and three different quantities of evidence (single, double, triple). The criteria of fitness of our solution are both the effectiveness and robustness. Random index evidence, is essentially real corresponding evidence. We introduce it in an non corresponding order to evaluate the robustness of our solution on inconsistent evidence. The code for all experiments and configurations is available at https://github.com/davidath/evitrac.
Iiia Datasets and Metrics
We briefly introduce the datasets and the preprocess techniques that were use in our experiments.

MNIST
: The MNIST dataset consists of 70000 images of handwritten digits. Each 28 x 28 image is reshaped into a single vector with 784 features.

CIFAR10: CIFAR10 contains 60000 32x32 colour images of 10 classes. In a similar manner as the experiments in VADE [9] and DEC [10], we do not cluster the raw images. We use feature vectors acquired by a pretrained VGG16 network [11]
on ImageNet
[12]. We use the output of the first dense layer of VGG16 as input to our configuration, each image is transformed to a single vector of 4096 features. 
20 Newsgroups: A dataset of 20000 newsgroup documents, for our experiments we use features acquired from a pretrained word2vec model [13] on Google news corpus. We acquire a 300 dimensional vector for each word. After the preprocess we acquire 18282 documents. To represent each document we use the mean of its word embeddings.

Reuters: Reuters Corpus Volume I [14] contains 804414 documents of 103 categories. For our experiments we use a subset of 96933 documents of 10 sub categories. In the same manner as DEC, we compute tfidf features on the 2000 most frequent word stems.
To evaluate the effectiveness and robustness of each experiment we use the unsupervised clustering accuracy (ACC) and the normalized mutual information score (NMI) metrics. The unsupervised clustering accuracy was introduced in DEC. Both metrics are used in the evaluation of latent representation clustering frameworks such as DEC, VADE, DCEC [15], DEPICT [16], JULE [17], etc.
IiiB Effectiveness and Robustness
We evaluate our solution based on the results of our experiments when introducing one, two and three sources of evidence (Tables I, II and III). In all cases where a corresponding source of evidence is present, our solution is able to effectively utilize it, leading to increase of the unsupervised clustering accuracy and normalized mutual information score. The effectiveness criterion is successfully satisfied, considering that the gain in effectiveness is scalable with the amount of corresponding evidence sources. The robustness criterion is also satisfied since there is no significant loss in cases where we introduce any source of low quality evidence.




ACC (%)  NMI (%)  

Baseline  22.79  13.44 
3 Real (w: 3,4,5)  64.75 (+41.96)  74.23 (+60.79) 
2 Real + 1 Noise (w: 3,4,3)  53.04 (+30.26)  61.74 (+48.30) 
1 Real + 2 Noise (w: 3,3,10)  36.67 (+13.89)  46.21 (+32.77) 
2 Real + 1 Noise (w: 3,5,3)  60.56 (+37.77)  71.39 (+57.94) 
1 Real + 2 Noise (w: 3,3,10)  44.68 (+21.89)  54.37 (+40.92) 
2 Real + 1 Noise (w: 4,5,3)  63.42 (+40.63)  77.16 (+63.72) 
1 Real + 2 Noise (w: 5,3,10)  62.49 (+39.70)  65.58 (+52.14) 
3 White Noise (w: 3,10,5)  25.21 (+2.43)  14.90 (+1.46) 
During the incremental manipulation of the initial latent space, latent representations are disentagled according to corresponding evidence. The joint minimization of the cross entropy and the reconstruction leads to the “tagging” of samples. In Figure 2 we showcase the data sample “tagging” performed by our framework. The “tagging” refers to data samples being reconstructed with added symbols that are consistent in the same way as the groupings of the evidence.
To evaluate the ability of a linear algorithm to distinguish between samples of different classes after the incremental manipulation, we use an SVM classifier (with linear kernel) to evaluate the ability to perform binary classification before and the after the evidence transfer. Figure
4 showcases the ability of an SVM classifier to distinguish between two specific classes during the initial and incremental manipulation stage. Figure 3 visualizes both states of latent space as a whole.Reuters Effectiveness
Reuters consists of 4 root categories, each of these 4 categories branches out to multiple sub categories, leading to 103 total categories. For our experiments, the primary task is to cluster 10 categories. The latent representations produced during the initial solution are naturally clustered into the original 4 root categories with overlaps between the representations of each cluster. For that reason, we incrementally trained the initial solution by alternating between minimizing and in each batch. The natural clustering of 4 clusters restricts the effectiveness of our framework during the 10 category clustering, yet we still satisfy both of the effectiveness and robustness criteria.
Iv Related Work
Iva Deep Generative Networks
CrossingNets [18] is one of the frameworks that have considered combining external information in an unsupervised learning task. CrossingNets uses two different data sources for the task of hand pose estimation. It combines the two data sources by using a shared latent space. Although effective for the task of hand pose estimation, CrossingNets is a task specific configuration that does not propose a scalable way to handle multiple sources of evidence.
InfoGAN is also another case of using external information in the unsupervised task of generation. InfoGAN uses structured latent variables in order to manipulate the generation process of a GAN. InfoGAN utilizes the Information Maximization [19] algorithm in order to disentangle latent representations. The structured latent variables are arbitrarily chosen by observing the dataset and are randomly sampled from known distributions. Latent variables are considered as low quality evidence in our case, since they do not provide any insight for the primary task.
IvB Autoencoders
Adverarial Autoencoders (AAE) [6] assimilate external information into the latent space, using random samples. AAE uses external information by making assumptions upon the prior distribution of the latent space. They depend on sampling from known distributions such as the Normal or Categorical distribution, to manipulate the latent space of an autoencoder. As with InfoGAN, types of evidence as such are considered as low quality and serve a specific purpose in these frameworks. In addition, AAE makes assumptions regarding the structure of the latent space and uses the external information accordingly.
Multi modal datasets have been used in autoencoders in order fill missing sensor data [20]. The latent space of the MMAE has assimilated all modalities of a mood prediction dataset and is able to reconstruct images even in cases where some modalities are missing. MMAE is a specific configuration designed as a way to accommodate the cases of missing sensor data and proposes a way of assimilating modalities of the same dataset.
To solve imbalance in classification samples [21] proposed a dual autoencoder solution. These autoencoders use different activations of the same autoencoder to capture different aspects of the same information. Each different configuration can be treated as an additional external information in the same way as ensemble methods combine different task outcomes. Another way of defining dual autoencoders configuration is for learning representations used for recommendation [22]. These autoencoders focus on two different sources (namely item and user) in order to create a latent space that will be used for the recommendation task.
These methods make assumptions upon the availability of the external information. The way that dual autoencoders are proposed require both datasets to be ever present during training and evaluation.
IvC SemiSupervised Learning
In Semisupervised learning the various configurations make use of both labeled and unlabeled data. The presence of labels or lack thereof usually refers directly to the labelset associated with the training dataset as well as the primary task. Semisupervised learning configurations (
[23], [24], [25]), Principled Hybrids of Generative and Discriminative [26] and our framework, share the notion of training Generative models to also perform Discriminative tasks. Nevertheless, in our framework the external categorical evidence is not guaranteed to be directly associated with the primary task.IvD Style transfer and Transfer Learning
During transfer learning a pretrained model on a primary task is used after training in order to learn a new auxiliary task. It is not uncommon for transfer learning techniques to freeze the pretrained layer weights and iteratively train the additional layers. In style transfer
[27] freezing the layers of the network is the standard process. An initial random image is reconstructed into sharing similar features with some content and style feature maps. Backpropagation involves computed gradients with respect to the reconstructed image and not the trainable parameters of the network. In addition, variables informed in the style transfer are continuous. Our framework is incrementally trained to learn latent representations that will satisfy both the reconstruction and the cross entropy objectives. Through joint optimization of both objectives, we produce evidence informed weights and biases. In our case, to improve the clustering outcome we transfer features of categorical variables (evidence) to the continuous variable of latent representations.V Conclusions and Future Work
In this paper we presented the evidence transfer method. Evidence transfer is a general method of combining external additional evidence to increase the performance of clustering tasks. It makes no assumptions about the relation of the evidence to the primary dataset or its availability. We introduced a set of guidelines to cope with the categorical evidence, as well as the evaluation criteria of a solution that exploits one or more pieces of external evidence to improve a primary task. Our evidence transfer approach manipulates the latent representation to be more linearly separable and therefore leads to increased performance. The effectiveness achieved scales with increasing number of pieces of evidence, while at the same time it is robust when introducing low quality or ineffective evidence.
We evaluated our proposed solution using evidence of different quantitative and qualitative properties, however effective evidence is directly related to our primary clustering task. Future work is directed to evaluating this method when there is a nonlinear relationship between the evidence and the primary dataset. Although this case may be partially covered by using the latent representations of the additional evidence autoencoder, it remains to be validated experimentally. Additionally, evaluating the ability of evidence transfer to improve other unsupervised tasks such as generation is also considered for future work.
Satisfying the robustness criterion during the optimization was a challenging task. The choice of optimizing algorithm proved to be crucial for the satisfaction of both the effectiveness and robustness, adaptive optimizers proved to disrupt the initial latent space during joint training with low quality of evidence. These experiments indicate further investigation of optimization techniques deployed on incremental training of multi task learning models with conflicting or unrelated objectives.
References
 [1] D. Savenkov and E. Agichtein, “EviNets: Neural Networks for Combining Evidence Signals for Factoid Question Answering,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2017, pp. 299–304. [Online]. Available: https://doi.org/10.18653/v1/P172047

[2]
E. Park, X. Han, T. L. Berg, and A. C. Berg, “Combining multiple sources of
knowledge in deep CNNs for action recognition,” in
2016 IEEE Winter Conference on Applications of Computer Vision (WACV)
. IEEE, mar 2016, pp. 1–8. [Online]. Available: http://ieeexplore.ieee.org/document/7477589/  [3] J. Li, A. Ritter, and D. Jurafsky, “Learning multifaceted representations of individuals from heterogeneous evidence using neural networks,” 2015. [Online]. Available: https://arxiv.org/pdf/1510.05198.pdf
 [4] Y. Bengio, A. Courville, and P. Vincent, “Representation Learning: A Review and New Perspectives,” Tech. Rep. [Online]. Available: http://www.imagenet.org/challenges/LSVRC/2012/results.html
 [5] D. P. Kingma and M. Welling, “AutoEncoding Variational Bayes,” 2013. [Online]. Available: http://arxiv.org/abs/1312.6114
 [6] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey, “Adversarial Autoencoders,” 2015. [Online]. Available: http://arxiv.org/abs/1511.05644
 [7] Y. Bengio, L. Yao, G. Alain, and P. Vincent, “Generalized Denoising AutoEncoders as Generative Models,” pp. 1–9, 2013. [Online]. Available: http://arxiv.org/abs/1305.6663
 [8] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel, “InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets,” 2016. [Online]. Available: http://arxiv.org/abs/1606.03657
 [9] Z. Jiang, Y. Zheng, H. Tan, B. Tang, and H. Zhou, “Variational deep embedding: An unsupervised generative approach to Clustering,” IJCAI International Joint Conference on Artificial Intelligence, pp. 1965–1972, 2017. [Online]. Available: https://arxiv.org/pdf/1611.05148.pdf
 [10] J. Xie, R. B. Girshick, and A. Farhadi, “Unsupervised deep embedding for clustering analysis,” CoRR, vol. abs/1511.06335, 2015. [Online]. Available: http://arxiv.org/abs/1511.06335
 [11] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for LargeScale Image Recognition,” International Conference on Learning Representations (ICLR), pp. 1–14, 2015. [Online]. Available: http://arxiv.org/abs/1409.1556
 [12] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. FeiFei, “ImageNet Large Scale Visual Recognition Challenge,” International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp. 211–252, 2015.
 [13] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient Estimation of Word Representations in Vector Space,” Tech. Rep. [Online]. Available: https://arxiv.org/abs/1301.3781
 [14] D. D. Lewis, Y. Yang, T. G. Rose, F. Li, and F. Li LEWIS, “RCV1: A New Benchmark Collection for Text Categorization Research,” Tech. Rep., 2004. [Online]. Available: http://www.jmlr.org/papers/volume5/lewis04a/lewis04a.pdf

[15]
X. Guo, X. Liu, E. Zhu, and J. Yin, “Deep Clustering with Convolutional
Autoencoders,” in
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
, vol. 10635 LNCS, 2017, pp. 373–382. [Online]. Available: https://xifengguo.github.io/papers/ICONIP17DCEC.pdf  [16] K. G. Dizaji, A. Herandi, C. Deng, W. Cai, and H. Huang, “Deep Clustering via Joint Convolutional Autoencoder Embedding and Relative Entropy Minimization,” in Proceedings of the IEEE International Conference on Computer Vision, vol. 2017Octob, 2017, pp. 5747–5756. [Online]. Available: https://arxiv.org/pdf/1704.06327.pdf
 [17] J. Yang, D. Parikh, and D. Batra, “Joint Unsupervised Learning of Deep Representations and Image Clusters,” 2016. [Online]. Available: https://arxiv.org/pdf/1604.03628.pdf
 [18] C. Wan, T. Probst, L. Van Gool, and A. Yao, “Crossing Nets: Dual Generative Models with a Shared Latent Space for Hand Pose Estimation,” in CVPR2017, 2017, p. 10. [Online]. Available: http://arxiv.org/abs/1702.03431
 [19] D. Barber and F. V. Agakov, “The IM Algorithm: A Variational Approach to Information Maximization,” Advances in Neural Information Processing Systems, no. 2, 2003.
 [20] N. Jaques, S. Taylor, A. Sano, and R. Picard, Multimodal Autoencoder : A Deep Learning Approach to Filling In Missing Sensor Data and Enabling Better Mood Prediction, 2017. [Online]. Available: http://affect.media.mit.edu/pdfs/17.Jaques_autoencoder_ACII.pdf
 [21] W. W. Ng, G. Zeng, J. Zhang, D. S. Yeung, and W. Pedrycz, “Dual autoencoders features for imbalance classification problem,” Pattern Recognition, vol. 60, pp. 875–889, 2016.
 [22] F. Zhuang, Z. Zhang, M. Qian, C. Shi, X. Xie, and Q. He, “Representation learning via DualAutoencoder for recommendation,” Neural Networks, vol. 90, pp. 83–89, 2017.
 [23] J. T. Springenberg, “Unsupervised and Semisupervised Learning with Categorical Generative Adversarial Networks,” 2015. [Online]. Available: http://arxiv.org/abs/1511.06390
 [24] V. Dumoulin, I. Belghazi, B. Poole, O. Mastropietro, A. Lamb, M. Arjovsky, and A. Courville, “Adversarially Learned Inference.” [Online]. Available: https://arxiv.org/pdf/1606.00704.pdf
 [25] M. I. Belghazi, S. Rajeswar, O. Mastropietro, N. Rostamzadeh, J. Mitrovic, and A. Courville, “Hierarchical Adversarially Learned Inference.” [Online]. Available: https://arxiv.org/pdf/1802.01071.pdf
 [26] J. A. Lasserre, C. M. Bishop, and T. P. Minka, “Principled hybrids of generative and discriminative models,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2006.
 [27] L. A. Gatys, A. S. Ecker, and M. Bethge, “A Neural Algorithm of Artistic Style,” Tech. Rep., 2015. [Online]. Available: https://arxiv.org/pdf/1508.06576.pdf