1 Introduction
The modern world generates vast amounts of data and provides many opportunities to exploit it. However, frequently this data is complex, noisy, and lacks obvious structure. Therefore, explicit modeling of, for example, its distribution is too challenging for a human agent. On the other hand, a human can specify an explicit procedure, i.e., an algorithm, for how to construct such a model. Machine learning (ML) is concerned with algorithms that enable computers to learn from data in this way, especially algorithms for prediction. Many ML algorithms need labeled data for such a task, but it is common that fewer labeled data are available than unlabeled ones. Manual labeling is costly and timeconsuming. Hence, there is an evergrowing need for ML methods to work with a limited amount of labeled data and also make efficient use of the side information available from unlabeled data. Algorithms designed to do so are known as semisupervised learning algorithms.
Supervised learning algorithms employ labeled data to predict class labels for unlabeled examples accurately. Unsupervised learning algorithms search for structure in data, which can then be used as a heuristic to infer labels for these examples, on the basis of assumptions about the structure of data. SemiSupervised learning (SSL) algorithms lie somewhere between supervised and unsupervised learning. SSL methods are designed to work with labeled
and unlabeled instances , where and relate to an input space and output space, are examples and are labels of and , being the number of classes. Usually, these methods assume a much smaller number of labeled instances than unlabeled ones i.e., , because unlabeled instances are more useful when we have a few labeled instances. SSL has proven to be useful especially when we are dealing with anticausal or confounded problems [15].Without making any assumptions on how the inputs and outputs are related it is impossible to justify semisupervised learning as a principled approach [4]. Like the authors in that paper, we make the same three assumptions:

If two points are close in a highdensity region, then their corresponding outputs should also be close.

If points are in the same structure (referred to as cluster or manifold), they are likely to be of the same class.

The decision boundary between classes should lie in a lowdensity region of input space.
In this work, we will consider a new training method designed to be used with deep neural networks in the semisupervised learning setting. Instead of the usual approach of learning a direct classification model based on crossentropy loss, we will use the labeled examples for learning a similarity function between instances, such that instances of the same class are considered similar and those instances belonging to different classes are considered dissimilar. Under this similarity function, which is parameterized by a neural network, the features (embeddings) of labeled examples will be grouped together according to the class labels, in Euclidean space. In addition, we will use these learned embeddings to assign class labels to unlabeled examples. We do this using a simple nearestneighbor classifier. Following that, confident predictions for unlabeled instances are added to the labeled examples for retraining of the neural network iteratively. In this way, we are able to achieve significant performance improvements over supervisedonly training.
2 Related Work
Semisupervised learning has been under study since the 1970s [12]
. ExpectationMaximization (EM)
[14] works by labeling unlabeled instances with the current supervised model’s best prediction in an iterative fashion (selflearning), thereby providing more training instances for the supervised learning algorithm. Cotraining [1] is a similar approach, where two models are trained on two separate subsets of the data features. Confident predictions from one model are then used as labeled data for the other model. CoEM [2] combines cotraining with EM and achieved better results than either of them. Another, graphbased SSL method, LLGC (Local Learning with Global Consistency) [22], works by propagating labels from labeled to unlabeled instances until labels are stable, maintaining local and global consistency.There is a substantial amount of literature available on SSL techniques using deep neural network based on autoencoders
[16, 11], generative adversarial networks (GAN)
[18, 6, 20] and based on regularization [9, 17, 13]. The Pseudolabel [10]approach is a deep learning version of selflearning with an extra loss from regularization and the reconstruction of a denoising autoencoder.
Our method builds on work investigating similarity metric learning using neural networks. [5] used a network with the contrastive loss for face verification in a supervised fashion. [19] suggested network training to be based on triplets of examples. This work was extended to the semisupervised paradigm [21] for the image classification task. [7] tries to minimize the sum of crossentropy and ratio loss between class indicators (sampled from labeled examples for each class) and the intraclass distances of instances calculated based on embeddings.
We train our network based on triplets of images and use the triplet margin loss [19]. We found this to perform better than the contrastive loss or the ratio loss in our experiments, while the network is trained in a selflearning fashion. For improving intermediate predictions, we use LLGC [22] in order to get better labels for unlabeled instances in subsequent iterations. Although triplet networks and LLGC are not new, this is the first attempt, to our knowledge, of combining these two approaches for semisupervised learning.
3 Siamese Networks
Siamese networks [3] are neural networks that are particularly efficient when we have a large number of classes and a few labeled instances per class. Siamese networks can be thought of multiple networks with identical copies of the same function, with the same weights. They can be employed for training a similarity function given labeled data. Fig. 1
shows a simple network architecture based on convolutional (CONV) and maxpooling (MP) layers. An input example is passed to the network for computing the embeddings.
Different losses are used for training Siamese networks, such as contrastive loss, marginbased loss, and triplet loss. Network parameters are updated according to the loss calculated on embeddings.
3.1 Triplet Loss
The triplet loss [19]
has been used for face recognition. A triplet’s anchor example
, positive example , and negative exampleare provided as a training example to the network for getting corresponding embeddings. During optimisation of the network parameters, we draw all possible triplets from labeled examples based on class labels. For each minibatch used in stochastic gradient descent, all valid triplets
are selected where and . Then the loss is calculated according to the following equation using the Euclidean distance between the embedded examples:(1) 
where
is the socalled ”margin” and constitutes a hyperparameter.
As illustrated in Fig. 2, the triplet loss attempts to push away the embedded negative example from the embedded anchor example based on a given margin and the given positive example . Depending on the location of the negative example with respect to the anchor and the positive example, it is possible to distinguish between hard negative examples, semihard negative examples, and easy negative examples. The latter are effectively ignored during optimisation because they yield the value zero for the loss.
3.2 Selflearning using Siamese networks
In the first iteration of our semisupervised learning approach, to be able to label (some of) the unlabeled examples instances, the Siamese network is trained on labeled examples only, using triplet loss. Then the standard nearest neighbor classifier is used to predict labels for the unlabeled examples and a fixed percentage of unlabeled examples is chosen based on their distance to the labeled instances and added to the set of labeled examples for the next iteration. Throughout, embedded data is used to calculate distances. For more details see the pseudocode in Listing 1.
4 Local Learning with Global Consistency (LLGC)
We also investigate local learning with global consistency [22] in addition to the nearestneighbor classifier. LLGC works by propagating label information to the neighbors of an example. The goal of LLGC is to predict labels for unlabeled instances. The algorithm initializes a matrix to represent label information, where if example is labeled as , and otherwise . We implement a little variation here for the unlabeled examples: instead of using for all when is unlabeled, we use predicted labels obtained with the nearestneighbour classifier after training the Siamese network.
LLGC is based on calculating an adjacency matrix. This adjacency matrix is then used to establish a matrix
that is applied to update the label probabilities for the unlabeled examples. The adjacency matrix is calculated using Eq.
2 by employing embeddings and for each pair of two examples and , obtained from the Siamese network. The parameter is a hyperparameter.(2) 
The matrix is computed as:
(3) 
where is a diagonal matrix: . The initial matrix of label probabilities is set to , and the probabilities are updated by:
(4) 
where is a hyperparameter for controlling the propagation of label information. The above operation is repeated till convergence. Finally, labels for the unlabeled instances are calculated as:
(5) 
For efficiently using unlabeled instances, the Siamese network is first trained on labeled examples only, using triplet loss. Then the nearestneighbor classifier is used to predict labels for unlabeled examples. Then, following that, labeled and unlabeled embeddings along with labels are passed to LLGC. After a certain number of iterations of LLGC, a fixed percentage of unlabeled examples are chosen based on their LLGC score and added to the labeled examples for the next iteration. For more details see the pseudocode in Listing 2.
5 Experiments
We consider four standard image classification problems for our evaluation. For all experiments, a small subset of labeled examples was chosen according to standard semisupervised learning practice, with a balanced number of examples from each class, and the rest were considered as unlabeled. Final accuracy was calculated on the standard test split for each dataset. No data augmentation was applied to the training sets. Siamese networks were trained using triplet loss with margin for all datasets.
A simple convolutional network architecture was chosen for each dataset to ensure performance achieved was due to the proposed method and not the network architecture. For more details about the network architectures, see Table 1
. Layer descriptions use (featuremaps, kernelsize, stride, padding) for convolutional layers and (poolsize, stride) for pooling layers. The simple model is used for MNIST, Fashion MNIST, and SVHN, and produces 16dimensional embeddings, while the CIFAR10 model produces 64dimensional embeddings. We trained the networks using minibatch sizes 50, 100, and 200. We found that batch size 50 was insufficient and 200 did not yield significant improvements compared to batch size 100. Batch size = 100 is used for all experiments, with Adam
[8]as the optimizer for updating network parameters for 200 epochs. Our proposed approaches Siamese selftraining (Algorithm
1) and LLGC selftraining (Algorithm 2) respectively were run for 25 metaiterations. For LLGC, is used in all experiments, while is optimized for each dataset. The final test accuracy is computed using a kNN classifier with for simplicity. Our results were averaged over 3 random runs, using a different random initialization of the Siamese network parameters for each run and random selection of initially labeled examples except SVHN. We set a baseline by (a) training the network on the small number of the labeled instances only, and by (b) using all the labeled instances. These two baselines should provide good empirical lower and upper bounds for the semisupervised error rates.Simple(#parameters=163908)  CIFAR10(#parameters=693792) 

INPUT  INPUT 
ConvRelu(32,7,1,2) 
ConvReluBN(192,5,1,2) 
MaxPooling(2,2)  ConvReluBN(160,1,1,2) 
ConvRelu(64,5,1,2)  ConvReluBN(96,1,1,2) 
MaxPooling(2,2)  MaxPooling(3,2) 
ConvRelu(128,3,1,2)  ConvReluBN(96,5,1,2) 
MaxPooling(2,2)  ConvReluBN(192,1,1,2) 
ConvRelu(256,1,1,2)  ConvReluBN(192,1,1,2) 
MaxPooling(2,2)  MaxPooling(3,2) 
Conv(4,1,1,2)  ConvReluBN(192,3,1,2) 
Flatten()  ConvReluBN(64,1,1,2) 
AvgPooling(8,1) 
We now consider the datasets used in our experiments. The MNIST dataset consists of grayscale 28 by 28 images of handwritten digits. We select only 100 instances (10 from each class) as labeled instances initially. We apply our algorithms with a selection percentage and the LLGCbased method with . Table 2 shows noticeable improvements over the supervisedonly approach when compared with the proposed semisupervised approaches, when using the same number of labeled examples.
# labels  100Labeled  All (60000) 

Supervisedonly  
Siamese selftraining  –  
LLGC selftraining  – 
The Fashion MNIST dataset consists of 28 by 28 grayscale images showing fashion items. 100 instances are considered as labeled initially. Again, we use selection percentage and . Table 3 again shows noticeable improvement over the supervisedonly approach when compared with the proposed semisupervised approaches, when using the same amount of labeled data.
# labels  100Labeled  All (60000) 

Supervisedonly  
Siamese selftraining  –  
LLGC selftraining  – 
SVHN comprises 32x32 RGB images of house numbers, taken from the Street View House Numbers dataset. Each image can have multiple digits, but only the digit in the center is considered for prediction. The proposed approaches are evaluated using 1000 labeled instances initially, with selection percentage , and . Table 4 shows noticeable improvement over the supervisedonly approach when compared to the proposed approaches when 1000 labeled examples are used. Interestingly, purely Siamese selftraining again performs better than LLGC selftraining in this case.
# labels  1000Labeled  All (73275) 

Supervisedonly  
Siamese selftraining  –  
LLGC selftraining  – 
The CIFAR10 dataset contains 32 by 32 RGB images of ten classes. The proposed semisupervised approaches are evaluated using 4000 labeled instances initially, with selection percentage , and . Table 5 shows little improvement over the supervisedonly approach when compared to the proposed semisupervised approaches. Siamese selftraining performs better than LLGC selftraining.
# labels  4000Labeled  All (50000) 

Supervisedonly  
Siamese selftraining  –  
LLGC selftraining  – 
Figures 3, 4, 5 and 6 show a detailed comparison between Siamese selftraining and LLGC selftraining across three different runs of all four datasets; MNIST, Fashion MNIST, SVHN, and CIFAR10. The accuracy curves show definite improvement with respect to the supervisedonly version on all datasets using Siamese selftraining as well as LLGC selftraining. However, CIFAR10 and SVHN seem to get low or negligible additional improvement from LLGC selftraining compared to Siamese selftraining only.
We also tried to visualize the quality of embeddings learned using the proposed method. We trained an additional model by slightly modifying the simple model 1. In order to get a 2dimensional embedding, two featuremaps are used instead of 4 in the last convolutional layer, followed by averagepooling(2,2) before the final flattening layer. For this purpose, we considered MNIST. Figure 7 (a) depicts the embeddings for test instances marked in color according to their true class after random initialization of the network. Figure 7 (b) depicts the embeddings for test instances after training the Siamese network with only the 100 labeled MNIST instances. It can be seen that the 10000 test examples’ embeddings form clusters in Euclidean space after training of the network according to the class labels; test examples’ embeddings are largely scattered randomly throughout the 2D space before the network is trained.
6 Conclusion
In this work, we have shown how neural networks can be used to learn in a semisupervised setting using small sets of labeled data by replacing the classification objective with an objective for learning a similarity function. This objective is compliant with standard techniques of training the deep neural network and requires no modification of the embedding model. For improving the intermediate prediction of unlabeled instances, we evaluated LLGC, but this yielded little additional benefit compared to kNN classification alone. Using the method in this work, we were able to achieve significant improvement compared to supervised learning only on MNIST, Fashion MNIST and SVHN, when training on a small subset of labeled examples, but obtained little improvement on CIFAR10. We speculate that instead of a fixed selection of unlabeled instances from LLGC’s predictions, a thresholdbased selection based on the LLGC score will be more beneficial for subsequent iterations of our metaalgorithm. Also, a more robust convolutional model may help the network in learning distinctive embeddings and achieving stateoftheart results for the semisupervised setting.
References

[1]
(1998)
Combining labeled and unlabeled data with cotraining.
In
Proceedings of the eleventh annual conference on Computational learning theory
, pp. 92–100. Cited by: §2. 
[2]
(2004)
Coem support vector learning
. In Proceedings of the twentyfirst international conference on Machine learning, pp. 16. Cited by: §2. 
[3]
(1993)
Signature verification using a “siamese” time delay neural network.
Int.]. Pattern Recognit. Artzf Intell
7. Cited by: §3.  [4] (2006) Semisupervised learning, ser. adaptive computation and machine learning. Cambridge, MA: The MIT Press. Cited by: §1.
 [5] (2005) Learning a similarity metric discriminatively, with application to face verification. In CVPR (1), pp. 539–546. Cited by: §2.
 [6] (2017) Good semisupervised learning that requires a bad gan. In Advances in Neural Information Processing Systems, pp. 6513–6523. Cited by: §2.
 [7] (2016) Semisupervised deep learning by metric embedding. arXiv preprint arXiv:1611.01449. Cited by: §2.
 [8] (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.
 [9] (2016) Temporal ensembling for semisupervised learning. arXiv preprint arXiv:1610.02242. Cited by: §2.
 [10] (2013) Pseudolabel: the simple and efficient semisupervised learning method for deep neural networks. In Workshop on Challenges in Representation Learning, ICML, Vol. 3, pp. 2. Cited by: §2.
 [11] (2016) Auxiliary deep generative models. arXiv preprint arXiv:1602.05473. Cited by: §2.
 [12] (1975) Iterative reclassification procedure for constructing an asymptotically optimal rule of allocation in discriminant analysis. Journal of the American Statistical Association 70 (350), pp. 365–369. Cited by: §2.
 [13] (2017) Virtual adversarial training: a regularization method for supervised and semisupervised learning. arXiv preprint arXiv:1704.03976. Cited by: §2.
 [14] (2006) Semisupervised text classification using em. SemiSupervised Learning, pp. 33–56. Cited by: §2.
 [15] (2017) Elements of causal inference: foundations and learning algorithms. MIT Press. Cited by: §1.
 [16] (2015) Semisupervised learning with ladder networks. In Advances in Neural Information Processing Systems, pp. 3546–3554. Cited by: §2.
 [17] (2016) Regularization with stochastic transformations and perturbations for deep semisupervised learning. In Advances in Neural Information Processing Systems, pp. 1163–1171. Cited by: §2.
 [18] (2016) Improved techniques for training gans. In Advances in Neural Information Processing Systems, pp. 2234–2242. Cited by: §2.

[19]
(2015)
Facenet: a unified embedding for face recognition and clustering.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 815–823. Cited by: §2, §2, §3.1.  [20] (2018) Improving the improved training of wasserstein gans: a consistency term and its dual effect. arXiv preprint arXiv:1803.01541. Cited by: §2.
 [21] (2012) Deep learning via semisupervised embedding. In Neural Networks: Tricks of the Trade, pp. 639–655. Cited by: §2.
 [22] (2004) Learning with local and global consistency. In Advances in neural information processing systems, pp. 321–328. Cited by: SemiSupervised Learning using Siamese Networks, §2, §2, §4.
Comments
There are no comments yet.