The availability of large-scale labelled datasets has been attributed as one of the reasons for the increased effectiveness of deep learning. In vision applications, a logarithmic relation between the performance and the volume of training data has been observed [Sun2017]. However, procedures regarding data collection such as acquisition, augmentation and labelling are considered to be a major bottleneck [Roh2018], since they often require human intervention. In some domains such as semantic detection [Socher2013], progress is held back by the lack of labelled resources, while other domains have been adapting to learning from limited resources of labelled data [Tajbakhsh2019].
Augmenting the sample size of labelled data is a proposed way of dealing with insufficient labelled data. Labelsets can be either augmented in automated ways [Tang2018] or by crowd sourcing the labelling procedure [Servadei2018]. Labelset augmentation is often prone to introducing noise. Depending on the preferred augmentation technique, noise could lead to issues such as introducing uncertainty in the decision boundary [Reamaroon2019] or introducing selective bias from disproportionate class sizes.
An alternative way to dealing with insufficient labelled data is using weak labels [Fuentes-Hurtado2019]. Weak labels are usually less informative than ground truth labels. Despite weak labels lacking in information, the procedure of acquiring weak labels for unlabelled data samples is usually less complicated than acquiring ground truth labels. Nevertheless, assuming that weak labels has been acquired for every data sample is not often realistic depending on the nature of the application (e.g real-time prediction), leading to incomplete weak labelsets.
Evidence transfer is a representation learning method that exploits external categorical evidence to manipulate the initial representations of an autoencoder[Davvetas2019]
. Learning representations according to external categorical evidence, can be considered as weakly supervised representation learning. External evidence are categorical variables that represent weak labels and not ground truth labels. Additionally, the relation between the dataset and any evidence source is unknown and may introduce uncertainty.
Evidence transfer is effective when introduced with meaningful evidence sources and robust against low quality of evidence. The experimental evaluation of evidence transfer includes individually introducing evidence sources, as well as, combining external categorical evidence sources. Furthermore, it was evaluated against two types of low quality evidence: high uncertainty evidence (random values or white noise evidence) and non-corresponding real evidence (high certainty evidence randomly distributed among samples).
In this paper we investigate whether evidence transfer can be effective and robust against incomplete evidence. We consider two cases of incomplete evidence: uniformly missing samples and selectively missing samples. Selectively missing samples, depending on the level of incompleteness can be considered as low quality evidence, since they can introduce selective bias.
Our contributions are to:
Evaluate the effectiveness of evidence transfer in a weak supervision setting of missing corresponding samples, known as incomplete supervision
Evaluate the robustness of evidence transfer against two cases of incomplete evidence
Establish evidence transfer as a versatile weak supervision method that can used to exploit both unknown and incomplete external categorical evidence
The rest of the paper is organised as follows. In Section II we introduce and discuss literature regarding weak supervision and its individual types. We briefly present the background of evidence transfer method and introduce an effective and robust way of using incomplete evidence in Section III. We report and discuss the results of the experimental evaluation of incomplete evidence transfer in Section IV. In Section V we conclude our work and propose future directions of evidence transfer.
Ii Related Work
Using limited amount of supervised information, that is otherwise known as weak supervision, is a difficult task to define since “limited” amount can be interpreted in multiple ways depending on the use case. Limited can be interpreted as containing noise, being incomplete or even be non corresponding. Most frequent types of weak supervision are “Incomplete Supervision”, “Inexact Supervision” and “Inaccurate Supervision” [Zhou2017]. Incomplete supervision includes cases during which the amount of labelled data is disproportionate to the amount of available unlabelled data. Inexact supervision includes cases where label information is course-grained, while inaccurate supervision includes cases where labels include errors.
We investigate the ability of evidence transfer to be effective and robust in the incomplete supervision setting, similar to previous inaccurate supervision setting evaluations. Since, evidence transfer does not operate on course-grained evidence, evaluating the performance of evidence transfer in an inexact supervision setting is considered impractical.
The task of semi-supervised learning is a well known case of incomplete supervision. During this task, only a small subset of data samples has available ground truth labels. The dataset is divided into unlabelled and labelled data samples. The objective of most methods is to exploit both unlabelled and labelled data samples. Using a hybrid of generative and discriminative models has been a well-received method in semi-supervised tasks[Zhu2005, Lasserre2006]. Evidence transfer can also be considered as such a method since it uses the layers of an autoencoder both for its generative properties as well as to discriminate the different samples according to evidence. Multiple versions of autoencoders has been used for semi-supervised tasks such as Adversarial Autoencoders [Makhzani2015], PixelGAN Autoencoders [Makhzani2017] or Variational Autoencoders [Kingma2014], [Narayanaswamy2017].
Other generative models such as Generative Adversarial Networks (GAN) [Goodfellow2014] have also been used in hybrid models. From variations of the original GAN called Categorical GAN [Springenberg2015] and Semi-Supervised GAN [Odena2016] to adversarial inference models [Dumoulin2016], [Belghazi2018], using genative models as a prior [Jafari2019] and domain specific GANs [Tu2018]. Generative adversarial networks are often evaluated and used in semi-supervised tasks.
Other than creating hybrid methods, self-train methods which are based on meta learning and distinguishing between high noise samples [Sun2019], [Roli2006] or semi-supervised learning based on graphs or kernels [He2007], [Liu2010], [Yang2016], [Goldberg2006] have also been proposed.
Semi-supervised learning is based on the notion that the available subset of labels corresponds to ground truth labels. Ground truth labels represent the class labels of the dataset. In our case, evidence transfer uses weak labels (external categorical evidence) of unknown relation to the dataset. Incomplete evidence transfer refers to incomplete correspondence between all data samples and weak labels.
Inaccurate supervision includes multiple scenarios of incorrect, non-corresponding or noisy label information. During realistic applications, label noise can be introduced to the labelset either during data acquisition or during automated labelling which can introduce label noise or uncertainty from mislabelled samples [Imoto2019], [Northcutt2019], [Brodley2011]. Other inaccurate supervision cases include noisy labels either regarding value noise or semantic noise (e.g fake news) [Wang2014], [Wang2019], [Shu2019], [Helmstetter2018], [Yao2019]. Lastly, intended biased class proportions [Li2019] or non intended biased class proportions [Wang2013] can also be considered as inaccurate supervision since they introduce selective bias in the labelset.
Despite evidence transfer’s previous performance in inaccurate supervision setting, by remaining robust against low quality of evidence, incomplete evidence can be considered as noisy. Selectively missing samples that can occur during acquisition of evidence can introduce selective bias which can impact the outcome of evidence transfer.
Iii Evidence Transfer
Evidence transfer is a two step method. The first step of the method is the initialisation step, followed by the evidence transfer step. During the initialisation, an autoencoder is trained to acquire the baseline latent representations of a primary dataset. The autoencoder is trained as a generative model that learns latent representation which approximate the data generation distribution. During the initial representation learning, no labels are used (ground truth or weak) and therefore is fully unsupervised.
After initialisation, external categorical evidence (weak labels in the form of external auxiliary task, not necessarily derived from or referring to the primary task) are used in order to manipulate the initial learned representations. When introduced with meaningful evidence, evidence transfer, through weak label discrimination, produces manipulated latent representations that are more linearly separable. Increased linear separation is an effect of conditioning initial latent representations to represent the relation between primary data samples and external categorical evidence sources.
In contrast to other methods of representation learning that involve auxiliary variables, evidence transfer avoids the underlying assumption of the constant availability of the auxiliary variables, since in practice, auxiliary variables are either not guaranteed or we may observe the outcome of external processes without having explicit access to the corresponding dataset.
In the context of evidence transfer, any categorical variable or set of categorical variables can be considered as external evidence, as long as it satisfies the assumption that there is a relation between the primary dataset and the categorical variable. The term external or auxiliary refers to the fact that these categorical variables are not necessarily extracted from the primary dataset. They could be outcomes of an auxiliary task that could be performed on the primary dataset or on other unknown auxiliary datasets.
In order to deal with unknown external evidence, evidence transfer was designed to satisfy the following criteria:
Effectiveness: Evidence transfer should discover and utilise meaningful evidence to effectively manipulate the initial latent representations
Robustness: In case of low quality evidence, evidence transfer step should maintain the initial latent representation quality
Modularity: Evidence transfer should be deployed as an incremental step since evidence availability is not guaranteed
Iii-B Incomplete Evidence Transfer
Let be the primary dataset for representation learning and the external categorical evidence. can either be a single set of auxiliary task outcomes or it may contain additional sources noted as . In the case of complete evidence, there is full correspondence between each data sample in and in each categorical evidence item in with . In other words, for each . However, during cases of incomplete evidence across evidence sources. The objective of incomplete evidence transfer is to learn latent representations of according to incomplete evidence which approximate the effectiveness of complete evidence transfer.
For consistency with incomplete supervision setting notation, let be the data samples with no corresponding evidence and with corresponding or additional .
Denoising autoencoders are used for both phases of evidence transfer. During the initialisation step we train the autoencoder to reconstruct the data samples for all , after being corrupted. We use mean squared error between the reconstruction and the primary data samples as defined in Equation 1.
Since the relation between any external evidence source and the primary dataset is unknown, we refrain from using external evidence in its raw form. Instead, we extract latent features from an evidence autoencoder to ensure the robustness of the method, as an intermediate step between initialisation and evidence transfer. Low quality of evidence can be divided into categorical variables with noisy values e.g random values, white noise, uniformly distributed values and categorical variables that introduce decision boundary uncertainty such as non corresponding labels e.g one-hot categorical variable samples introduced in non-corresponding order. White noise evidence is easier to identify by observing the distribution properties of the evidence items.
In order to create a generic method against any type of evidence (including low quality) we train an uninitialised biased autoencoder to reconstruct each evidence source in . We bias the autoencoder by restricting its generalisation properties through training for a small amount of iterations. Meaningful evidence is characterised by consistency and therefore its distribution can be learned during a small amount of iterations. However, low quality of evidence is characterised by uncertainty or inconsistency that leads to a uniformly distributed latent representations. The objective of the biased evidence autoencoders is defined in Equation 2.
The evidence transfer step is then deployed using the latent representations acquired by the biased evidence autoencoder instead of raw values. In order to manipulate the initial latent representations of the primary dataset we use cross entropy metric. The asymmetric computation of cross entropy allows the manipulation of latent space according to external evidence by considering the evidence samples as the “true” distribution. The intermediate step of pre-processing the evidence samples in combination with the cross entropy loss ensures the robustness criterion of evidence transfer. Meaningful evidence can successfully manipulate the latent space since its representations produce declining values of cross entropy. At the same time, low quality of evidence with high uncertainty representations produces high values of cross entropy.
To reject evidence of low quality, we incorporate the representations acquired from the evidence autoencoder by using new additional uninitialised layers in our primary autoencoder. The cross entropy is computed between the output of additional layers and evidence representations , as defined in Equation 3. We cooperatively train the primary autoencoder to manipulate its latent representations by using both the original reconstruction objective and the mean cross entropy of , with all being treated equally. The objective of evidence transfer step is defined in Equation 4. An algorithmic overview of incomplete evidence transfer is depicted in Figure 1, while Figure 2
depicts the artificial neural network configuration of evidence transfer. During low quality evidence, cross entropy loss produces high values that gradually decay the weights of layersQ, allowing reconstruction error to return the latent space to its initial version.
Introducing new uninitialised layers Q additionally benefits evidence transfer by avoiding catastrophic forgetting. Evidence transfer belongs in the “joint optimization” methods that do not suffer from catastrophic forgetting [Li2018]. This means that after training with all the data samples, explicitly training evidence transfer objective with only samples will not restrict the generalisation of learning representations for . Contrarily, it is considered as finetuning the pretrained layers to minimise the new objective of evidence transfer.
Iv Evaluation and Results
Iv-a Experimental setup
We evaluate incomplete evidence transfer both on image and text datasets. We use MNIST dataset that contains handwritten digits with 10 class labels varying from 0 to 9, as well as, CIFAR-10 dataset [cifar10]
that contains RGB images depicting different vehicles or animals (e.g airplane, horse, etc.). For the CIFAR-10 experiments we use features extracted from a pretrained VGG-16[vgg]
network on ImageNet[imagenet] instead of raw images.
Furthermore, we use 20newsgroups dataset that contains articles that can be classified into 20 news topics, as well as Reuters Corpus Volume I (RCV1)[Lewis2004] that also contains articles that can be classified into 103 categories (4 root categories with additional sub-categories). To achieve consistency with the other three datasets, we created and used a subset of RCV1 with 10 categories (4 root categories plus 6 sub categories) and 96,933 data samples. To train our models for the 20newsgroups dataset, we use extracted features from a pretrained word2vec model [Mikolov] on, the Google News Corpus. During training of Reuters100k subset we used TF-IDF features.
We simulated two types of incomplete evidence: sample percent incompleteness and class incompleteness. Sample percent incompleteness is simulated by uniformly removing a percent of the complete evidence set. In other words, the evidence classes are all represented in the evidence set with fewer samples. Class incompleteness is simulated by removing an amount of classes from the complete evidence set, i.e remove all samples of one class from the evidence set. Incomplete class evidence can be seen as a case where evidence is still in the process of gathering and therefore some classes are missing. For all experiments we use incomplete yet corresponding real evidence. For all datasets we experiment with one and two sources of evidence. For CIFAR-10 we additionally experimented with three sources of evidence.
We quantitatively evaluate the robustness and effectiveness of incomplete evidence transfer by measuring its performance on the task of clustering the primary dataset samples. We perform clustering before and after applying various stages of incomplete evidence in order to measure their distance from the baseline solution. Baseline solution represents clustering the latent representations acquired during the initialisation phase, i.e before applying incomplete evidence transfer. The metrics that we are using during our experiments are Unsupervised Clustering Accuracy (ACC) and Normalised Mutual Information (NMI).
Iv-B Discussion of results
As observed in Tables I, II, III, IV introducing incomplete evidence in evidence transfer does not affect the performance metrics of the robustness or effectiveness criteria. Our experiments show that percent incomplete evidence is mostly effective. The effectiveness gain of percent incomplete evidence is equivalent to the amount of missing samples. Therefore, using incomplete evidence that represents all classes, always leads to performance gain which is equivalent to the number of samples in each class.
On the other hand, class incomplete evidence (limited class proportions) in some experiments approaches the performance of low quality evidence. During experiments where the evidence classes are low (e.g evidence with 3 classes), having no samples representing one or two classes is heavily biasing the generalisation performance of evidence transfer. In cases where the amount of evidence classes is high (e.g evidence with 10 classes), limited class proportions are not as heavily biasing and are equivalently effective.
Despite some cases of class incomplete evidence behaving as low quality, there is no significant decrease in effectiveness from the baseline solution. From our experiments we can conclude that using unknown evidence, that may be incomplete either in the amount of representative samples of each class or in class proportions does not decrease the initial performance and may lead to significant gain in effectiveness.
Experimental evaluations in MNIST (Table I) confirm the general notion of uniformly incomplete evidence being proportionally effective. The performance of selectively incomplete evidence, relies on the total amount of auxiliary task classes. The increase in performance becomes more significant with the increase of total auxiliary task classes. Experimental evaluations in 20newsgroups (Table II) and CIFAR-10 (Table IV) is consistent in the same way as MNIST.
Evaluation with Reuters-100k (Table III), while fairly consistent with the other evaluations, during selectively incomplete evidence explicit optimisation was required to achieve robustness due to the intrinsic properties of the labelset. As mentioned during Subsection IV-A, the structure of RCV1 labels are derived from 4 root categories. The same structure is preserved in our subset Reuters-100k using 4 root categories along with 6 sub-categories. This intrinsic structure is prone to selective bias during selectively incomplete evidence, therefore requiring explicit optimisation of the evidence transfer objective.
V Conclusions and Future Work
In this paper we evaluated the effectiveness and robustness of evidence transfer in the weak supervision setting of incomplete supervision. Evidence transfer proved to be both effective and robust during experimental evaluation with two types of incomplete evidence, as well as, introducing multiple sources of incomplete evidence. Incomplete evidence was simulated both by uniformly and selectively reducing the class proportion samples. From the conducted experiments we can conclude that evidence transfer works as an all around weak supervision method of learning representations with lacking primary dataset labels.
Although during experimental evaluation we tried to simulate realistic cases of weak supervision, there is a need of evaluating evidence transfer on a realistic use case scenario or domain specific application. Even though evidence transfer proved to be fairly robust during all experiments, estimating label noisy evidence or uncertain evidence might help during the procedure of optimisation and hyperparameter finetuning.
This work has been supported by the Industrial Scholarships program of Stavros Niarchos Foundation.