Log In Sign Up

Evidence Transfer for Improving Clustering Tasks Using External Categorical Evidence

by   Athanasios Davvetas, et al.

In this paper we introduce evidence transfer for clustering, a deep learning method that can incrementally manipulate the latent representations of an autoencoder, according to external categorical evidence, in order to improve a clustering outcome. It is deployed on a baseline solution to reduce the cross entropy between the external evidence and an extension of the latent space. By evidence transfer we define the process by which the categorical outcome of an external, auxiliary task is exploited to improve a primary task, in this case representation learning for clustering. Our proposed method makes no assumptions regarding the categorical evidence presented, nor the structure of the latent space. We compare our method, against the baseline solution by performing k-means clustering before and after its deployment. Experiments with three different kinds of evidence show that our method effectively manipulates the latent representations when introduced with real corresponding evidence, while remaining robust when presented with low quality evidence.


page 5

page 6

page 7


Learning Improved Representations by Transferring Incomplete Evidence Across Heterogeneous Tasks

Acquiring ground truth labels for unlabelled data can be a costly proced...

Clustering-Oriented Representation Learning with Attractive-Repulsive Loss

The standard loss function used to train neural network classifiers, cat...

InfoCatVAE: Representation Learning with Categorical Variational Autoencoders

This paper describes InfoCatVAE, an extension of the variational autoenc...

Expert-LaSTS: Expert-Knowledge Guided Latent Space for Traffic Scenarios

Clustering traffic scenarios and detecting novel scenario types are requ...

On Extending NLP Techniques from the Categorical to the Latent Space: KL Divergence, Zipf's Law, and Similarity Search

Despite the recent successes of deep learning in natural language proces...

Clustering-friendly Representation Learning via Instance Discrimination and Feature Decorrelation

Clustering is one of the most fundamental tasks in machine learning. Rec...

Unsupervised Severe Weather Detection Via Joint Representation Learning Over Textual and Weather Data

When observing a phenomenon, severe cases or anomalies are often charact...

I Introduction

Over the last few years more and more neural network configurations and frameworks have proposed the use of multi modal datasets, or external evidence, as a way to increase their effectiveness. Among other tasks, question answering [1], action recognition [2] and social media inference [3]

have made use of external information to learn increasingly meaningful representations. These representations are a product of co-learning the primary and the external data through a common objective. In the context of supervised learning, this learning process depends on the availability of external data as well as on their relation to the primary dataset.

However, in practice the availability of external data is either not guaranteed, or we may observe the outcome of external processes without having explicit access to the corresponding dataset. For instance, suppose we want to cluster towns according to their observed weather (primary task). At the same time we observe a grouping of geographical regions (which may include more than one town) according to rainfall data (auxiliary task). The outcome of the auxiliary task clearly relates to the primary task, however (a) the actual relationship is unknown, i.e. we do not know the extent to which rainfall can predict the overall weather of an area; (b) we do not have access to the data which led to the auxiliary tasks (in this case, the rainfall-based outcome), in fact we may prefer to not perform two tasks but rather to improve one based on our observation of the second; and (c) even though data items (and consequently task outcomes) in the two tasks will be related, the cardinality of this relationship is unknown (in this case, each region contains many towns). Here, we consider the auxiliary categorical outcome as external evidence, and the process of influencing the latent representations of a dataset to improve a primary task, as evidence transfer.

In this paper we propose a general framework that uses external categorical evidence when available to improve unsupervised learning, and in particular clustering. Using autoencoders that learn the primary dataset distribution, we learn a latent space that can be later manipulated to reflect new external evidence.

In this paper we present a general evidence transfer method for combining multiple sources of external evidence to improve the outcome of clustering tasks. Our method makes no assumptions regarding the quality, source or availability of external information. It aids clustering tasks by learning augmented latent representations that are disentangled according to external categorical evidence. By manipulating the latent space, we increase the effectiveness of clustering algorithms that rely on linear distance metrics, such as -means. Our method is effective, robust when presented with low quality of additional evidence, and modular as it can be incrementally applied to new pieces of evidence. The proposed evidence transfer method, although related to style transfer has fundamental differences in the training procedure that are further discussed in IV.

Ii Methodology

In this section we define the problem of combining external evidence in a primary clustering task, we introduce appropriate fitness criteria and propose an evidence transfer method that satisfies these fitness criteria.

Ii-a Problem Statement

Consider the task of clustering a dataset . Clustering yields a set of memberships , where

can be modelled as the categorical probability distribution of

over the target classes, . For the case of hard cluster assignment, one would eventually assign to cluster . In this paper denotes the primary dataset and clustering is the primary task.

External evidence is the set of outcomes of an auxiliary task applied either on the primary dataset or on some auxiliary dataset. Similar to the primary task, each can be seen as the categorical distribution of a subset of over the target classes of the auxiliary task, with the most straightforward case being that and that there is a one-to-one correspondence between the elements of and . There may exist multiple sources of external evidence yielding observable membership outcomes .

Our objective is to use external evidence to improve accuracy by reducing uncertainty in the primary clustering task. In any clustering task there are data samples that the clustering algorithm is not able to distinguish with high certainty. Clustering on latent representations of generally leads to better results due to their increased linear separability in the latent space learned [4]. By allowing external evidence to also influence the learning process, we posit that we can further improve linear separability, therefore achieving increased certainty in the primary task.

In this work we do not take into account the auxiliary datasets directly, only external categorical evidence produced on an unseen dataset by an unseen procedure. It follows that the proposed method makes no assumptions regarding the relation of the external evidence to the primary dataset. The only assumption made is that the external evidence is somehow related to the primary dataset but the mapping is unknown or too complex.

Methods that attempt to improve a task via evidence transfer should be at least effective and robust. In practice, sources of evidence may not be known and available at the beginning of evidence transfer, and therefore methods should also be modular in order to allow incrementally improving representations. More specifically, the fitness criteria we evaluate evidence transfer methods against are:

  1. Effectiveness: In the case that the evidence corresponds to a meaningful relation between itself and the primary dataset, this should be discovered and utilised to reduce uncertainty in the latent space. Meaningful relations are characterised by consistency on the outcome of the auxiliary tasks. Intuitively, introducing more than one sources of consistent evidence should lead to more effective performance than using a single source of evidence.

  2. Robustness: Since the mapping is unknown, there may be evidence that does not contribute meaningful information to the primary task, i.e. that it does not contribute to disentangling the latent representations of . For example, in cases where distribution or in cases where the evidence is consistent but it is introduced in a non corresponding order. The algorithm should be able to distinguish this evidence as low quality evidence and be able to reject it without making significant changes in the latent space, therefore maintaining its prior effectiveness.

  3. Modularity: The method should not require complete re-training in view of additional evidence. For instance, the proposed method includes evidence as a fine tune step that augments the baseline representations. The added step should not disrupt the latent space in such way that will lead to changes in the original objective. The transformations that take place during the finetune step, should be restricted by the original objective of the baseline solution.

Modularity and Robustness are measures against low quality evidence, while Effectiveness reduces uncertainty in the latent space, leading to better performance.

Ii-B Dealing with external evidence

In order to satisfy the above criteria we consider the minimization of cross entropy as an appropriate objective. Cross entropy is an asymmetrical metric that involves the entropy of the “true” distribution and its divergence to an auxiliary distribution (Equation 1). Considering the external evidence as the “true” distribution and the latent space as the “auxiliary” distribution, then cross entropy quantifies the uncertainty of evidence distribution, as well as, its relation to the latent space. As a task outcome, evidence distribution is considered as fixed and therefore its entropy is constant. On the other hand, the distribution of the latent space belongs to parametric families that involve the trainable parameters of the neural network.


We use the cross entropy to shift these parameters into reducing the divergence between the evidence distribution and the latent space. In cases where evidence correlates with the latent space, their divergence is minimized and therefore satisfying the effectiveness criterion. In cases where evidence can not be correlated with the latent space, their divergence converges to high values that affect the latent space less and less with each epoch.

Since the relation between the evidence and the primary dataset is unknown, we do not introduce the evidence samples in their raw format. We use a biased additional autoencoder with a single hidden layer to upscale or downscale the width of the latent evidence samples, to correspond with the width of the inner hidden layer of the primary autoencoder. We use the term biased, due to not allowing the evidence autoencoder to generalise over the input dataset. We train the evidence autoencoder with low number of epochs to act as an identity function.

We use latent categorical representations of the biased evidence autoencoders, denoted as

, since the identity bias forces the evidence autoencoder to produce the same latent space distribution for both white noise evidence and inconsistent evidence. In both cases, the latent evidence samples approximate a uniform distribution which results in high cross entropy that converges to constant loss.

Ii-C Evidence transfer

Our method is a sequence of two steps, the initialization and evidence transfer steps. During the initialization step we introduce a baseline clustering method that we later finetune using external additional evidence. In order to initialize the latent space of the baseline solution, we train an autoencoder using the squared error loss, namely (Equation 2).


We believe that generative types of autoencoders such as Denoising Autoencoders, Variational Autoencoders

[5] or Adversarial Autoencoders [6] are fit for the initialization of the latent space. Generative autoencoders approximate a latent space distribution that is close or exactly the same as the true underlying data generation distribution. In our experiments, we use Denoising Autoencoders that maximize the expected log-likelihood of dataset given corrupted dataset (by minimizing (3), as defined in [7]), with expectation taken over the joint data-generating distribution.


When the training of the autoencoder is done, we use the -means algorithm on the initial latent representations. We introduce this method as a baseline solution to the clustering problem. Before we proceed to the evidence transfer step, we also train the additional evidence autoencoders to produce latent categorical samples as part of the initialization step.


Evidence transfer step follows our predefined guideline. We satisfy the criterion of modularity by introducing the process of evidence transfer as an additional step to the initialization. Objective minimizes the mean cross entropy between each additional evidence sources and predictors . Objective restricts the latent space to preserve its baseline structure, meaning that latent samples are able to perform their original task, which is the reconstruction of primary dataset samples. We jointly minimize both and by minimizing Equation (5) that involves both losses in a weighted sum using and hyperparameters as coefficients for each respective loss. By jointly optimizing both tasks we approach the maximization of the expected log-likelihood by using evidence informed parameters .

We use additional layers (one for each source of evidence) in the output of the autoencoder in order to predict the latent categorical variables

. As depicted in Figure 1. Opposing to directly manipulating the latent space, predictors adjust their weights depending on the quality of the evidence. In cases of low quality evidence, their weights decay and the joint minimization of Equation (5), is achieved by minimizing .

Relation to InfoGAN

The InfoGAN framework [8] as well as our framework, both make use of information theory metrics and added layers to manipulate the latent space. We use auxiliary task outcomes performed either on the primary dataset or on other auxiliary datasets to manipulate an initialized latent space in order to improve a clustering task. InfoGAN introduces latent code to disentangle latent representations in order to manipulate the task of generation. In our case, evidence such as would be considered as low quality and rejected by our configuration, making no changes in the latent space.

(a) Primary Autoencoder with additional layers (Stacked)
(b) Primary Autoencoder with additional layers (Convolutional)
(c) Evidence Autoencoder (Auxiliary)
Fig. 1: Neural network configurations used by evidence transfer method. Figure (a) and (b) depicts the neural network configuration of the primary task, which is the initial baseline solution along with the additional layers need to manipulate the latent space. We showcase the Stacked Denoising Autoencoder version (Figure (a)) that was used for the experiments in CIFAR, 20newsgroups and REUTERS-100k. For the MNIST dataset, we use a Convolutional Autoencoder (Figure (b)), the autoencoder topology is similar to the proposed Convolutional Autoencoder in DCEC. The Stacked Denoising Autoencoder has widths of d-500-500-200-10, as seen in DEC. Figure (c) depicts the topology of the evidence autoencoder where we acquire latent representations that are use to during the evidence transfer method. The widths of the fully connected layers in the evidence autoencoder depend on the widths of each evidence and latent space width.

Iii Evaluation and Results

For the purpose of evaluating our solution we tried three different qualities of evidence (real corresponding evidence, random values / white noise, random index evidence) and three different quantities of evidence (single, double, triple). The criteria of fitness of our solution are both the effectiveness and robustness. Random index evidence, is essentially real corresponding evidence. We introduce it in an non corresponding order to evaluate the robustness of our solution on inconsistent evidence. The code for all experiments and configurations is available at

Iii-a Datasets and Metrics

We briefly introduce the datasets and the preprocess techniques that were use in our experiments.


    : The MNIST dataset consists of 70000 images of handwritten digits. Each 28 x 28 image is reshaped into a single vector with 784 features.

  • CIFAR-10: CIFAR-10 contains 60000 32x32 colour images of 10 classes. In a similar manner as the experiments in VADE [9] and DEC [10], we do not cluster the raw images. We use feature vectors acquired by a pretrained VGG-16 network [11]

    on ImageNet

    [12]. We use the output of the first dense layer of VGG-16 as input to our configuration, each image is transformed to a single vector of 4096 features.

  • 20 Newsgroups: A dataset of 20000 newsgroup documents, for our experiments we use features acquired from a pretrained word2vec model [13] on Google news corpus. We acquire a 300 dimensional vector for each word. After the preprocess we acquire 18282 documents. To represent each document we use the mean of its word embeddings.

  • Reuters: Reuters Corpus Volume I [14] contains 804414 documents of 103 categories. For our experiments we use a subset of 96933 documents of 10 sub categories. In the same manner as DEC, we compute tf-idf features on the 2000 most frequent word stems.

To evaluate the effectiveness and robustness of each experiment we use the unsupervised clustering accuracy (ACC) and the normalized mutual information score (NMI) metrics. The unsupervised clustering accuracy was introduced in DEC. Both metrics are used in the evaluation of latent representation clustering frameworks such as DEC, VADE, DCEC [15], DEPICT [16], JULE [17], etc.

Iii-B Effectiveness and Robustness

We evaluate our solution based on the results of our experiments when introducing one, two and three sources of evidence (Tables I, II and III). In all cases where a corresponding source of evidence is present, our solution is able to effectively utilize it, leading to increase of the unsupervised clustering accuracy and normalized mutual information score. The effectiveness criterion is successfully satisfied, considering that the gain in effectiveness is scalable with the amount of corresponding evidence sources. The robustness criterion is also satisfied since there is no significant loss in cases where we introduce any source of low quality evidence.

ACC (%) NMI (%)
Baseline 82.03 76.25
Real evidence (w: 3) 95.57 (+13.54) 89.59 (+13.34)
Real evidence (w: 10) 96.71 (+14.68) 91.77 (+15.52)
White noise (w: 3) 82.32 (+0.29) 76.40 (+0.14)
White noise (w: 10) 82.32 (+0.29) 76.40 (+0.14)
Random index (w: 3) 82.16 (+0.13) 76.29 (+0.04)
Random index (w: 10) 82.34 (+0.32) 76.43 (+0.18)
2 Real evidence (w: 3,4) 97.72 (+15.69) 93.93 (+17.68)
2 White noise (w: 3,10) 82.20 (+0.17) 76.38 (+0.13)
1 Real + 1 Noise (w: 3,3) 95.52 (+13.50) 89.50 (+13.25)
ACC (%) NMI (%)
Baseline 22.79 13.44
Real evidence (w: 3) 37.34 (+14.56) 46.24 (+32.80)
Real evidence (w: 10) 91.97 (+69.18) 83.06 (+69.62)
White noise (w: 3) 24.62 (+1.83) 14.66 (+1.22)
White noise (w: 10) 24.61 (+1.82) 14.56 (+1.12)
Random index (w: 3) 26.18 (+3.39) 15.35 (+1.91)
Random index (w: 10) 26.01 (+3.22) 15.08 (+1.63)
2 Real evidence (w: 3,4) 52.86 (+30.07) 61.44 (+48.00)
2 White noise (w: 3,10) 25.00 (+2.22) 14.80 (+1.35)
1 Real + 1 Noise (w: 3,3) 36.97 (+14.18) 46.22 (+32.78)
(b) CIFAR-10
TABLE I: For our experiments in both MNIST and CIFAR, we report the average of 4 runs for each evidence configuration. indicates the width of each evidence vector. For MNIST, real evidence with width 3 represents the relation ( being the digit label), while evidence with width equal to 4 corresponds to the relation . Real evidence of 10 width, is the full labelset of MNIST. For CIFAR-10, real evidence does not correspond to any notable relations. Width 3 real evidence separates the samples into three categories: vehicles, pets and wild animals. Width 4 real evidence expands the pets category into another group of two classes. 10 width evidence corresponds to the labelset of CIFAR-10.
ACC (%) NMI (%)
Baseline 21.19 25.01
Real evidence (w: 5) 34.18 (+12.99) 57.35 (+32.34)
Real evidence (w: 20) 88.90 (+67.71) 90.01 (+65.00)
White noise (w: 3) 22.36 (+1.17) 25.49 (+0.49)
White noise (w: 10) 22.46 (+1.27) 26.11 (+1.10)
Random index (w: 5) 21.77 (+0.58) 25.32 (+0.32)
Random index (w: 20) 22.40 (+1.21) 25.54 (+0.53)
2 Real evidence (w: 5,6) 46.19 (+25.00) 68.31 (+43.30)
2 White noise (w: 3,10) 22.89 (+1.70) 26.35 (+1.34)
1 Real + 1 Noise (w: 5,3) 31.41 (+10.22) 54.24 (+29.24)
(a) 20 Newsgroups
ACC (%) NMI (%)
Baseline 41.12 32.72
Real evidence (w: 4) 43.34 (+2.22) 36.24 (+3.52)
Real evidence (w: 10) 48.27 (+7.15) 41.23 (+8.51)
White noise (w: 3) 41.42 (+0.30) 32.77 (+0.05)
White noise (w: 10) 41.38 (+0.26) 32.74 (+0.03)
Random index (w: 4) 41.37 (+0.25) 32.82 (+0.10)
Random index (w: 10) 41.38 (+0.26) 32.68 (-0.03)
2 Real evidence (w: 4,5) 50.54 (+9.42) 41.81 (+9.10)
2 White noise (w: 3,10) 41.16 (+0.04) 32.65 (-0.06)
1 Real + 1 Noise (w: 4,3) 43.44 (+2.32) 36.29 (+3.57)
(b) REUTERS-100k
TABLE II: For our experiments in both 20 Newsgroups and REUTERS-100k, we report the average of 4 runs for each evidence configuration. indicates the width of each evidence vector. For the 20 Newsgroups dataset, we used its natural structure as evidence, the 20 original labels are accompanied by a prefix that indicates a root category. We divided the labelset into “comp(uters).”, “rec(reational)”, “sci(ence)”, “talk” and “misc.”, to produce a 5 width real evidence. To produce 6 width real evidence, we divided the labelset into 6 classes namely “sport”, “politics”, “religion”, “vehicles”, “systems” and “science”. 20 width evidence corresponds to the labelset of the full dataset. For REUTERS-100k, we used the labelset of 103 categories in order to create a subset that contains 10 sub categories. Width 4 real evidence represents the 4 root categories, while 5 width real evidence is a simple re-categorization of the 10 labels into 5 groups (applying mod 5 to the label number). Width 10 corresponds to the labelset of 10 sub categories.
ACC (%) NMI (%)
Baseline 22.79 13.44
3 Real (w: 3,4,5) 64.75 (+41.96) 74.23 (+60.79)
2 Real + 1 Noise (w: 3,4,3) 53.04 (+30.26) 61.74 (+48.30)
1 Real + 2 Noise (w: 3,3,10) 36.67 (+13.89) 46.21 (+32.77)
2 Real + 1 Noise (w: 3,5,3) 60.56 (+37.77) 71.39 (+57.94)
1 Real + 2 Noise (w: 3,3,10) 44.68 (+21.89) 54.37 (+40.92)
2 Real + 1 Noise (w: 4,5,3) 63.42 (+40.63) 77.16 (+63.72)
1 Real + 2 Noise (w: 5,3,10) 62.49 (+39.70) 65.58 (+52.14)
3 White Noise (w: 3,10,5) 25.21 (+2.43) 14.90 (+1.46)
TABLE III: In this table we report our experiment of introducing three sources of evidence in the CIFAR-10 dataset. We report the average of 4 runs for each evidence configuration. indicates the width of each evidence vector. We experimented with simple re-categorizations of the original labelset into smaller groups, in order to test the scalability of our solution for both the effectiveness and robustness criteria.

During the incremental manipulation of the initial latent space, latent representations are disentagled according to corresponding evidence. The joint minimization of the cross entropy and the reconstruction leads to the “tagging” of samples. In Figure 2 we showcase the data sample “tagging” performed by our framework. The “tagging” refers to data samples being reconstructed with added symbols that are consistent in the same way as the groupings of the evidence.

To evaluate the ability of a linear algorithm to distinguish between samples of different classes after the incremental manipulation, we use an SVM classifier (with linear kernel) to evaluate the ability to perform binary classification before and the after the evidence transfer. Figure

4 showcases the ability of an SVM classifier to distinguish between two specific classes during the initial and incremental manipulation stage. Figure 3 visualizes both states of latent space as a whole.

Reuters Effectiveness

Reuters consists of 4 root categories, each of these 4 categories branches out to multiple sub categories, leading to 103 total categories. For our experiments, the primary task is to cluster 10 categories. The latent representations produced during the initial solution are naturally clustered into the original 4 root categories with overlaps between the representations of each cluster. For that reason, we incrementally trained the initial solution by alternating between minimizing and in each batch. The natural clustering of 4 clusters restricts the effectiveness of our framework during the 10 category clustering, yet we still satisfy both of the effectiveness and robustness criteria.

Fig. 2: Reconstructed digits after introducing evidence of three groups. This is a case where the evidence introduced is the relation , with being the digit label. For visualization purposes, we move estimators after the reconstruction and use the Adam optimizer to clearly showcase the “tagging” of samples. For there is a common pattern of marking the top left corner of the digit. The pattern of drawing two dots on the left side of the digit is deployed for samples where . Two shapes drawn in the right of the digit is deployed for
(a) Initial state of latent space for CIFAR-10
(b) Latent space of CIFAR-10, after incremental manipulation using real evidence (w:3)
Fig. 3: We showcase the latent space of our primary autoencoder by transforming the latent representations into 2d samples, using t-SNE. Figure (a) depicts the initial state of the latent space for the CIFAR-10 dataset. Each latent sample is clustered around the mean. Figure (b) showcases the latent space after we introduced evidence of three classes. Incremental manipulation according to that evidence reforms the latent space into three distinct groups. They still preserve their initial structure, yet samples that indicate disentanglement are completely separated.
(a) Initial latent representations of classes “frog” and “automobile”
(b) Latent representations after introducing evidence that indicates disentanglement
(c) Initial latent representations of classes “frog” and “bird”
(d) Latent representations after introducing evidence that “frog” and “bird” should not be disentangled
Fig. 4: In this figure we highlight the separations among the latent representations, by displaying the differences in comparing two labels. Straight lines indicate the decision boundary of an SVM classifier with linear kernel. For this experiment we used evidence that separates CIFAR-10 samples into 3 groups. Vehicles, pets and wild animals. According to our evidence, “frog” and “automobile” belong in different groups and therefore their latent representations should reflect their relation. On the other hand, “frog” and “bird” are both “wild animals” and their latent representations should indicate their shared group.

Iv Related Work

Iv-a Deep Generative Networks

CrossingNets [18] is one of the frameworks that have considered combining external information in an unsupervised learning task. CrossingNets uses two different data sources for the task of hand pose estimation. It combines the two data sources by using a shared latent space. Although effective for the task of hand pose estimation, CrossingNets is a task specific configuration that does not propose a scalable way to handle multiple sources of evidence.

InfoGAN is also another case of using external information in the unsupervised task of generation. InfoGAN uses structured latent variables in order to manipulate the generation process of a GAN. InfoGAN utilizes the Information Maximization [19] algorithm in order to disentangle latent representations. The structured latent variables are arbitrarily chosen by observing the dataset and are randomly sampled from known distributions. Latent variables are considered as low quality evidence in our case, since they do not provide any insight for the primary task.

Iv-B Autoencoders

Adverarial Autoencoders (AAE) [6] assimilate external information into the latent space, using random samples. AAE uses external information by making assumptions upon the prior distribution of the latent space. They depend on sampling from known distributions such as the Normal or Categorical distribution, to manipulate the latent space of an autoencoder. As with InfoGAN, types of evidence as such are considered as low quality and serve a specific purpose in these frameworks. In addition, AAE makes assumptions regarding the structure of the latent space and uses the external information accordingly.

Multi modal datasets have been used in autoencoders in order fill missing sensor data [20]. The latent space of the MMAE has assimilated all modalities of a mood prediction dataset and is able to reconstruct images even in cases where some modalities are missing. MMAE is a specific configuration designed as a way to accommodate the cases of missing sensor data and proposes a way of assimilating modalities of the same dataset.

To solve imbalance in classification samples [21] proposed a dual autoencoder solution. These autoencoders use different activations of the same autoencoder to capture different aspects of the same information. Each different configuration can be treated as an additional external information in the same way as ensemble methods combine different task outcomes. Another way of defining dual autoencoders configuration is for learning representations used for recommendation [22]. These autoencoders focus on two different sources (namely item and user) in order to create a latent space that will be used for the recommendation task.

These methods make assumptions upon the availability of the external information. The way that dual autoencoders are proposed require both datasets to be ever present during training and evaluation.

Iv-C Semi-Supervised Learning

In Semi-supervised learning the various configurations make use of both labeled and unlabeled data. The presence of labels or lack thereof usually refers directly to the labelset associated with the training dataset as well as the primary task. Semi-supervised learning configurations (

[23], [24], [25]), Principled Hybrids of Generative and Discriminative [26] and our framework, share the notion of training Generative models to also perform Discriminative tasks. Nevertheless, in our framework the external categorical evidence is not guaranteed to be directly associated with the primary task.

Iv-D Style transfer and Transfer Learning

During transfer learning a pre-trained model on a primary task is used after training in order to learn a new auxiliary task. It is not uncommon for transfer learning techniques to freeze the pre-trained layer weights and iteratively train the additional layers. In style transfer

[27] freezing the layers of the network is the standard process. An initial random image is reconstructed into sharing similar features with some content and style feature maps. Back-propagation involves computed gradients with respect to the reconstructed image and not the trainable parameters of the network. In addition, variables informed in the style transfer are continuous. Our framework is incrementally trained to learn latent representations that will satisfy both the reconstruction and the cross entropy objectives. Through joint optimization of both objectives, we produce evidence informed weights and biases. In our case, to improve the clustering outcome we transfer features of categorical variables (evidence) to the continuous variable of latent representations.

V Conclusions and Future Work

In this paper we presented the evidence transfer method. Evidence transfer is a general method of combining external additional evidence to increase the performance of clustering tasks. It makes no assumptions about the relation of the evidence to the primary dataset or its availability. We introduced a set of guidelines to cope with the categorical evidence, as well as the evaluation criteria of a solution that exploits one or more pieces of external evidence to improve a primary task. Our evidence transfer approach manipulates the latent representation to be more linearly separable and therefore leads to increased performance. The effectiveness achieved scales with increasing number of pieces of evidence, while at the same time it is robust when introducing low quality or ineffective evidence.

We evaluated our proposed solution using evidence of different quantitative and qualitative properties, however effective evidence is directly related to our primary clustering task. Future work is directed to evaluating this method when there is a non-linear relationship between the evidence and the primary dataset. Although this case may be partially covered by using the latent representations of the additional evidence autoencoder, it remains to be validated experimentally. Additionally, evaluating the ability of evidence transfer to improve other unsupervised tasks such as generation is also considered for future work.

Satisfying the robustness criterion during the optimization was a challenging task. The choice of optimizing algorithm proved to be crucial for the satisfaction of both the effectiveness and robustness, adaptive optimizers proved to disrupt the initial latent space during joint training with low quality of evidence. These experiments indicate further investigation of optimization techniques deployed on incremental training of multi task learning models with conflicting or unrelated objectives.