1 Introduction
Semisupervised learning has become one of the most prevalent topics within image processing and computer vision research in recent years. With the ever increasing availability of high powered GPU hardware and the success of deep learning on applications such as computer vision
[9, 12], speech recognition [7], and natural language processing
[18, 6] the need for large scale datasets to support these methods becomes a higher priority, as well as a bottleneck to performance improvement. Semisupervised processes have been applied successfully in many areas such as image classification and segmentation [10], natural language processing and artificial intelligence
[3]. Typically, semisupervised deep learning uses novel model architectures, regularization methods or loss functions combining outputs from known labels with unknown labels to provide more accurate outputs,
[8, 4]. Laine [13]utilises an architecture based on ensemble predictions, acquired during the training of a network at different epochs or under different regularization and input conditions. Tarvainen et al.
[19] take the concept of temporal ensembling and extend it to the model weights. French et al. [5] extend by introducing class balancing and confidence thresholding. Miyato et al. [14] consider a novel regularization method to semisupervised learning.The following method is an extension of [rob2019tip]. The method iteratively reclassifies a dataset such that the model being trained is only ever exposed to what it considers fully labelled data. Firstly, a simple and easily implemented semisupervised learning framework, independent from model architecture or loss functions making it applicable to a wide range of classification tasks. Secondly, novel learned thresholding techniques and metrics to supervise the dataset growth, ensuring only confidently classed samples are added to a training dataset.
2 Methodology
The core assumption in this work is that generalization error always decreases with more training samples as shown by [1] and recently [2]
. To address this problem the Iterative LearningEnsemble (ILE) approach is presented. The iterative nature of the ILE is given by the train, classify, analyse and finally update cycle. Firstly, a model (
) is trained on a cleanly labelled dataset, , and validated on the dataset. The training of the model is performed in a relevant way to the application and task, neither the architecture nor the loss functions are changed in any way. Secondly, the unlabelled samples are classified and the process of updating the training set is run.Let represent an input variable in dimensions and represent the label associated with that sample, where represents the number of possible class labels. In this work, represents an image and the label from classes. From the pool of cleanly labeled and unlabelled data, three datasets are constructed: Labelled (), derived from a portion of the cleanly labelled data. Unlabelled (), indexed from only unlabelled data and validation (), derived from the remaining subset of the cleanly labelled data.
The primary issue when adding newly labelled samples to the training dataset is ensuring the model is confident that the additions are labelled correctly. This confidence is achieved in two ways: firstly, well established ensembling techniques are utilised to produce better predictions from a trained model [13, 16, 19]
and secondly, a novel set of confidence metrics have been devised based solely on the posterior probabilities produced from model
. Importantly, there are no additional clustering or preprocessing steps applied to the unlabelled data of any kind, the only assumption made within this work is that the data is of a similar quality, context and application as that within the cleanly labelled.The goal of ensembling is to find the most positive class distribution for use with the confidence metrics. To this end, a number of augmentations are applied to an unlabelled sample such that represents a single sample , augmented in different ways, each with the same label
. The augmented samples, including the original, are now passed to the model for inference and the posterior probability vectors, or class distributions, for all the augmented samples
returned.(1) 
The returned posterior probability vectors are then scaled by the similarity of the class distributions returned as a result of Eq. 1
. The standard deviation of the posterior probabilities between class labels across the augmented samples is then subtracted from
as form of scaling. Augmented ensembles which when evaluated differ greatly in their class distributions, result in a larger standard deviation. In turn, this would penalise the final confidence score more so than when the model produces similar distributions across the ensemble. Finally, the augmented sample with the highest posterior probability for any class label, is selected and its original unscaled class distribution used as input for the confidence metrics highlighted in Eq. 4, 5 and 6.(2) 
The confidence metrics used to further ensure unlabelled samples are correctly labelled, cover three distinct areas computed from the posterior probabilities after evaluation of the unlabeled data , (see supplementary material for visual representations of these concepts). Firstly, the single highest class activation obtained from the posterior distribution (higher is better). Formally, consider an unlabeled sample
(3) 
Where and are labels corresponding to the first and second highest posterior probabilities. The is then
(4) 
, a) Iter 1: A,B,C and D are classified as red or blue using proximity to the respective manifolds, while E and F remain unclassified (confidence is not high enough). b) Iter 2: after retraining the model with new samples AD, confident becomes sufficient to classify E and F and add them to the new training set. c) Iter 3: the manifolds are updated with the samples E and F. (Right) tSNE of the last fully connected layer (1024 neurons) of JFNet2 when evaluating the SVHN validation dataset. d) Clusters with the model trained on the initial 1000 samples. e) Clusters after ILE has been run for 75 iterations, increasing the training dataset size and improving classification accuracy.
Second, the difference between the highest and second highest activation (larger difference is better) is computed according to Eq. 5.
(5) 
Lastly, is calculated as the Euclidean distance between the posterior distribution for the unlabeled sample and the average distribution for the predicted class (lower score is better). is computed over all training samples of class . These average distributions per class are computed at the end of each model training iteration and are recorded for use in these confidence computations.
(6) 
For each of these three metrics a value is returned, in the cases of and the value returned by the model should be high and for
the distance between the two posterior probability distributions should be low, however the
scores are inverted so as to have a uniform, higher is better policy. The weighted sum of these metrics scores is then used to provide a final confidence score for a specific unlabeled sample . As some metrics are more informative that others their contribution to the final confidence should reflects this. The weighting is found experimentally but is rooted on the accuracy of the metric on a set of unlabelled samples. Importantly these values may change based on application as certain metrics may be more informative in different problems.Error Rate% ()  Error Rate % (Improvement)  Error Rate% ()  Added Samples (Acc. %)  
SVHN  
Model  1k Benchmark  1k Samples  Full Benchmark  
GAN [17]  N/A  8.11%  N/A   
model [13]  N/A  4.82%  2.54% (0.04)   
Temporal E. [13]  N/A  4.42%  2.74% (0.06)   
VAT+EntMin [14]  N/A  3.86%  N/A   
ResNet18 (ILE)  19.74% (0.32)  4.29% (15.45)  2.98% (0.04)  71,068 (94.89%) 
LeNet5 (ILE)  25.24% (1.55)  11.11% (14.13)  7.16% (0.09)  42,999 (96.86%) 
JFNet (ILE)  20.18% (0.50)  5.64% (14.54)  3.84% (0.05)  66,421 (96.13%) 
CIFAR100  
5k Benchmark  5k Samples  Full Benchmark  
Temporal E. [13]  N/A  38.65% (10k Samples)  N/A   
ResNet18 (ILE)  32.49% (0.45)  28.09% (4.4)  17.53% (0.09)  42,526 (75.1%) 
LeNet5 (ILE)  89.21% (0.22)  87.47% (1.74)  65.55% (0.38)  375 (72.53%) 
JFNet (ILE)  39.66% (0.22)  66.49% (1.36)  39.66% (0.22)  4,786 (73.21%) 
Tiny ImageNet  
10k Benchmark  10k Samples  Full Benchmark  
ResNet18 (ILE)  37.47% (0.46)  33.68% (3.79)  27.38% (0.15)  56,619 (81.37%) 
LeNet5 (ILE)  95.48% (0.43)  94.43% (1.05)  81.58% (0.27)  69 (43.49%) 
JFNet (ILE)  83.40% (0.12)  81.61% (1.79)  60.98% (0.25)  684 (83.19%) 
(7) 
Using a defined threshold , samples can now be approved for inclusion in the labeled dataset , updated for use in the next training iteration. The threshold could be defined manually, allowing for policies where only very confidently analyzed samples are added or, through the use of a lower threshold, a more “quantity over quality” policy can be adopted. In this work the threshold value is learned. A process is run to find a threshold , which when applied would add samples to with a defined accuracy , i.e defining a threshold whereby 99% of samples added to are correctly labelled. The process is run using only cleanly labelled data. The function calculates the percentage of correctly labeled samples in a dataset classified by model against their ground truth labels , given the threshold .
(8) 
Therefore given the required addition accuracy the max can be calculated,
(9) 
Accuracy was set to . This process is run once on training data, as the model will be most confident on samples it has already seen and, as a result of this, impose a higher threshold than one defined using the validation set. During these incremental updates the model is trained using an ever growing dataset. The dataset volume increases by the addition of unlabelled samples which the model has confidently identified belong to a respective class (i.e ). As a result the model develops its knowledge of specific classes and is therefore better able to identify additional samples in latter iterations. This process is symbolically shown in Figure 1 ac, whereby a subset of new, unlabeled, samples get projected closer to the existing manifolds due to already learned characteristics of respective classes. Figure 1 de shows a real world example of the effect ILE has had on the JFNet2 model’s class manifolds. Additionally as the model is reinitialized at the beginning of each iteration, this method can leverage randomly initialised weights to help with the classification of unlabelled samples.
3 Results and Conclusions
SVHN [15] is used for benchmarking and to better validate the performance of this iterative approach on a more challenging task, CIFAR100 [11], and a 200class subset of ImageNet known as Tiny ImageNet are used. Initially benchmarks are run for each of the three models on the three datasets. Table 1 (columns 1 & 3) outlines the benchmark error rates for each of these model architectures on both a subset of the training data and the full. The training subset size is based on 50 samples per class, CIFAR100 uses 5,000 samples and Tiny ImageNet uses 10,000 samples. As the SVHN dataset is one of the most commonly used datasets when comparing semisupervised learning techniques, the standard 1,000 samples is used (100 samples from each of the 10 classes). Each training subset is made up of an even distribution of classes with images from each class chosen at random. Each experiment was conducted four times with the average results presented along with the standard deviation given in brackets. The inclusion of these benchmarks is vital, especially for any result that uses a customised loss function or architecture, as without it is difficult to ascertain if improvement gains can be attributed to the model architecture used or the semisupervised method. As demonstrated the simple iterative approach to semisupervised learning ILE has a number of benefits. Most notable being state of the art error rates on the CIFAR100 dataset and near state of the art on SVHN dataset, achieved with no changes to the training methods, loss functions or model architectures used. The ILE demonstrates, through the application of novel confidence metrics, the ability for a model to leverage its own confidence scores to improve classification accuracy.
4 Acknowledgment
This work is cofunded by the NATO within the WITNESS project under grant agreement number G5437. The Titan X Pascal used for this research was donated by NVIDIA.
References

[1]
A. R. Barron.
Approximation and estimation bounds for artificial neural networks.
Machine Learning, 14(1):115–133, 1994.  [2] P. L. Bartlett, D. J. Foster, and M. J. Telgarsky. Spectrallynormalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, pages 6241–6250, 2017.
 [3] A. Carlson, J. Betteridge, and B. Kisiel. Toward an Architecture for NeverEnding Language Learning. Conference on Artificial Intelligence (AAAI), pages 1306–1313, 2010.
 [4] S. Cicek, A. Fawzi, and S. Soatto. SaaS: Speed as a Supervisor for Semisupervised Learning. arXiv preprint arXiv:1805.00980, 2018.
 [5] G. French, M. Mackiewicz, and M. Fisher. Selfensembling for visual domain adaptation. In International Conference on Learning Representations, 2018.
 [6] Y. Goldberg. Neural network methods for natural language processing. Synthesis Lectures on Human Language Technologies, 10(1):1–309, 2017.

[7]
A. Graves and N. Jaitly.
Towards EndToEnd Speech Recognition with Recurrent Neural Networks.
JMLR Workshop and Conference Proceedings, 32(1):1764–1772, 2014. 
[8]
P. Haeusser, A. Mordvintsev, and D. Cremers.
Learning by associationa versatile semisupervised training method
for neural networks.
In
Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR)
, 2017.  [9] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 [10] S. Hong, H. Noh, and B. Han. Decoupled Deep Neural Network for Semisupervised Semantic Segmentation. Advances in Neural Information Processing Systems, 28:1495—1503, 2015.
 [11] A. Krizhevsky. Learning Multiple Layers of Features from Tiny Images. PhD thesis, 2009.

[12]
A. Krizhevsky, I. Sutskever, and G. E. Hinton.
ImageNet classification with deep convolutional neural networks.
Advances In Neural Information Processing Systems, pages 1097–1105, 2012.  [13] S. Laine and T. Aila. Temporal Ensembling for SemiSupervised Learning. In International Conference on Learning Representation, pages 1–13, 2017.
 [14] T. Miyato, S.i. Maeda, M. Koyama, and S. Ishii. Virtual Adversarial Training: a Regularization Method for Supervised and Semisupervised Learning. pages 1–14, 2017.
 [15] Y. Netzer and T. Wang. Reading digits in natural images with unsupervised feature learning. In Neural Information Processing Systems (NIPS), page 5, 2011.
 [16] A. Rasmus, H. Valpola, M. Honkala, M. Berglund, and T. Raiko. SemiSupervised Learning with Ladder Networks. In Advances in Neural Information Processing Systems, pages 3546–3554, 2015.
 [17] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved Techniques for Training GANs. (Nips):1–9, 2016.
 [18] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014.
 [19] A. Tarvainen and H. Valpola. Mean teachers are better role models : Weightaveraged consistency targets improve semisupervised deep learning results. (Nips), 2017.