Transfer of Pretrained Model Weights Substantially Improves Semi-Supervised Image Classification

by   Attaullah Sahito, et al.
University of Waikato

Deep neural networks produce state-of-the-art results when trained on a large number of labeled examples but tend to overfit when small amounts of labeled examples are used for training. Creating a large number of labeled examples requires considerable resources, time, and effort. If labeling new data is not feasible, so-called semi-supervised learning can achieve better generalisation than purely supervised learning by employing unlabeled instances as well as labeled ones. The work presented in this paper is motivated by the observation that transfer learning provides the opportunity to potentially further improve performance by exploiting models pretrained on a similar domain. More specifically, we explore the use of transfer learning when performing semi-supervised learning using self-learning. The main contribution is an empirical evaluation of transfer learning using different combinations of similarity metric learning methods and label propagation algorithms in semi-supervised learning. We find that transfer learning always substantially improves the model's accuracy when few labeled examples are available, regardless of the type of loss used for training the neural network. This finding is obtained by performing extensive experiments on the SVHN, CIFAR10, and Plant Village image classification datasets and applying pretrained weights from Imagenet for transfer learning.



There are no comments yet.


page 1

page 2

page 3

page 4


Semi-Supervised Learning using Siamese Networks

Neural networks have been successfully used as classification models yie...

Deep Metric Transfer for Label Propagation with Limited Annotated Data

We study object recognition under the constraint that each object class ...

Training neural audio classifiers with few data

We investigate supervised learning strategies that improve the training ...

Improving Botnet Detection with Recurrent Neural Network and Transfer Learning

Botnet detection is a critical step in stopping the spread of botnets an...

Automatic Rule Induction for Efficient Semi-Supervised Learning

Semi-supervised learning has shown promise in allowing NLP models to gen...

Collaborative Learning of Semi-Supervised Clustering and Classification for Labeling Uncurated Data

Domain-specific image collections present potential value in various are...

Deep Transfer Learning with Ridge Regression

The large amount of online data and vast array of computing resources en...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural networks are frequently used for image classification tasks and yield state-of-the-art results in this application. However, for training, these models generally need a lot of labeled samples, and they tend to overfit on small amounts of labeled data. This problem is of particular importance when limited labeled samples are available due to time or financial constraints. Addressing this problem requires machine learning methods that are able to work with a limited amount of labeled data and also make efficient use of the side information available from unlabeled data.

Semi-supervised learning (SSL) aims to improve performance by exploiting both labeled and unlabeled examples. Given an input space containing the examples, SSL methods are designed to work with labeled examples and unlabeled examples , where with and and are the labels of , with ( being the number of classes).

A few assumptions are required to make semi-supervised learning a principled approach [3]:

  1. If two instances are close in a high-density region, then their corresponding outputs should also be close.

  2. If instances are in the same structure (referred to as a cluster or manifold), they are likely to be of the same class.

  3. The decision boundary between classes should lie in a low-density region of the input space.

Almost all standard neural networks for image classification are trained by minimising cross-entropy loss on labeled training data. In this paper, along with cross-entropy loss, we also consider another class of losses, comprising so-called similarity metric learning losses, which operate on the relationships between samples such that instances of the same class are considered similar and those belonging to different classes are considered dissimilar. Once a similarity function has been trained, which is parameterised by a neural network, feature vectors (embeddings) of examples produced by the network will be grouped together according to class labels, normally in Euclidean space. These learned embeddings lend themselves naturally to semi-supervised learning because they can be employed to assign class labels to unlabeled examples using very simple classification methods such as nearest-neighbor classifiers.

Figure 1: Overview of the approach.

This approach is related to work on pseudo-labeling [10, 15], where the model is initially trained on limited data. However, in this paper, instead of applying random initialisation of network parameters when training starts, we investigate using pretrained weights from another domain and show that this provides much better generalisation ability. Using pretrained model weights is a standard approach for transfer learning in supervised settings, but appears to have received little attention in the context of semi-supervised learning, particularly when applying self-learning with metric learning.

We use a pretrained neural network model trained on Imagenet [16]. A schematic overview of the proposed approach is shown in Figure 1. Fine-tuning on data from the target domain is performed on the (very small) initial set of labeled examples. Following that, confident predictions for unlabeled examples are added to labeled examples for iterative retraining of the neural network—this is the standard self-learning method for semi-supervised learning. It enables us to obtain more labeled training data and the assumption is that this eventually helps in achieving significant performance improvements. In our experiments on image classification tasks, we compare using pretrained weights for the neural network to random initialisation of the weights.

The main contribution of this work is an extensive empirical investigation of transfer learning in the context of self-learning. Using cross-entropy loss as well as combinations of similarity metric learning losses (e.g., triplet loss, contrastive loss, and Arcface loss) with simple nearest-neighbor-based label propagation, we find that transfer learning always substantially improves the classification accuracy of the model when few labeled examples are available, regardless of which loss function is used for training the neural network. More specifically, for semi-supervised learning using self-learning on the SVHN, CIFAR10, and Plant Village image classification datasets, we obtain a substantial improvement using pretrained weights when few labeled examples are available for training. Thus, our results indicate that the well-established method of performing transfer learning by re-using pretrained weights—commonly applied when performing a purely supervised training of a neural network—is particularly useful in the context of semi-supervised learning.

2 Related Work

In this section, we briefly discuss some existing work on semi-supervised learning and transfer learning.

2.1 Semi-supervised Learning

Semi-supervised Learning (SSL) lies between supervised and unsupervised learning. SSL tries to employ labeled examples as well as unlabeled examples for more accurate prediction. There are many different techniques available from the literature on SSL using deep neural networks. Some employ autoencoders 

[12, 14], others use generative models [5, 23, 19] or are based on regularization ideas [13, 18]. In pseudo-labeling [10], the model is trained on the limited labeled data first and then re-trained on an extended set of labeled data, based on the predictions of the original model for the unlabelled training data.

Our method builds on work investigating transferring learning using both cross-entropy loss and similarity-based metric learning with neural networks. Pair and triplet based loss functions provide the foundation for standard approaches to metric learning. A classic pair-based method is to use contrastive loss [4], which tries to bring similar pairs closer and push farther away dissimilar pairs. Pairs can be extended to triplets. They consist of an anchor, a positive, and a negative example, where the anchor is more similar to the positive example than the negative one. The resulting triplet loss function [20] was originally used on triplets of images for face verification. Metric learning-based loss functions [24] have also been successfully employed for image classification.

Another related class of metric learning methods are based on modified classification losses. Examples include Arcface [6], Sphereface [11] and Cosface [22]. For metric learning, Arcface, Sphereface, and Cosface apply multiplicative-angular, additive-cosine, and additive-angular margins, respectively.

2.2 Transfer Learning

Since the successful Imagenet challenge [16], transfer learning has been used widely in visual recognition tasks such as object detection [7]. Transfer learning uses the network weights learned by training on the large and labeled Imagenet dataset and fine-tunes the weights for the respective target domain. When the target domain is sufficiently closely related to the source domain of Imagenet, then transfer learning usually generalizes much better than training from scratch on the smaller target domain alone.

3 Semi-supervised Learning using Self-learning

The semi-supervised learning approach we apply is based on self-learning. The model is initially trained using a limited number of labeled examples. Then confident predictions for unlabelled examples are added to the set of labeled examples for retraining of the model. Generally, multiple iterations of labelling and retraining are performed. One important hyper-parameter is the selection percentage , which specifies how many of the most confident predictions are added to the training set after each iteration. We use a small value of in our experiments to select the most confident predictions only. Generating many more labeled data points in this fashion allows for deep neural networks to be trained to their full capacity, and generally results in significant performance improvements. For more details on this approach see, for instance, our previous work in [17].

In this paper, for network weight initialisation, we transfer pretrained weights from Imagenet classification and fine-tune on the target domain. We compare the performance achieved by this weight transfer to the performance of training using a fully random initialisation of the weights of the neural network.

The proposed approach is very general, suggesting that a spectrum of loss functions and label propagation algorithms can all work well in this framework. We use the most widely used classification loss, i.e., softmax cross-entropy, as a first option. In addition, we explore loss functions based on similarity metric learning. The embeddings produced by the neural network after training with a similarity function can be employed to assign class labels to unlabeled examples using very simple classification methods such as a nearest-neighbor classifier. Below we review the loss functions used for the experiments.

3.1 Softmax Cross-entropy Loss

The single most frequently used classification loss function is softmax cross-entropy, which is a measure of the difference between the desired probability distribution and the predicted probability distribution:



denotes the deep features (the ”embedding”) of the

sample, belonging to the class , and is the dimension of the embedding, denotes the column of the weight matrix and is the bias term. The batch size for gradient descent is and is the number of classes.

3.2 Siamese Networks

Siamese networks [2] are neural networks for training a similarity function given labeled data using one of several possible loss functions. They can be thought of as two identical copies of the same network, sharing all weights. They are particularly suitable for datasets with many classes containing only a few labeled instances per class and can employ any of the loss functions listed below.

3.2.1 Triplet Loss

The triplet loss [20] is widely used. A triplet’s anchor example , positive example , and negative example are provided as a training example to the network for getting corresponding embeddings. Normally and come from the same class, and is from a different class. Triplet loss tries to push the negative example’s embedding farther away from positive example’s one, with a user-specified minimum margin . Using, e.g., Euclidean distance between embedded examples, the triplet loss is calculated as:


Triplet loss tries to push to 0 and to be greater than . Triplets can be categorized as:

  • Easy triplets: those with a loss of 0.

  • Hard triplets: those where is closer to than .

  • Semi-hard triplets: those where is not closer to than , but is within the margin, thus still returning a positive loss.

In our experiments, we use semi-hard triplets for training of the neural network as they yield more distinctive embeddings [20].

3.2.2 Contrastive loss

The contrastive loss [8] is a pair-based loss that attempts to bring similar examples closer to each other and push dissimilar examples farther away with respect to a minimum margin . Contrastive loss for embeddings of two examples and can be calculated as follows:


Here, if and are from the same class, and otherwise.

3.2.3 ArcFace loss

Arcface loss [6] is a modified cross-entropy loss with angular margins in the softmax expression, which is designed for improved discrimination in metric learning. The loss is calculated as:


is the angle between the -normalized weight vector and the feature vector . The bias term is ignored for simplicity. The feature vector is -normalised and scaled to , the radius of the hypersphere. An additive angular margin penalty is added to the ground truth angle .

4 Experiments

For evaluating the effect of transfer learning, we consider three image classification problems. For all datasets, a small subset of labeled examples was chosen according to standard semi-supervised learning practice, with a balanced number of examples from each class. All remaining examples were used as unlabeled training examples. For triplet, contrastive and Arcface loss, -nearest neighbor is used for label prediction, with for simplicity. We always include two network version in the comparison: one using randomly initialised weights, and one using pretrained weights from ImageNet. All models are evaluated on the standard test split for each dataset in three different ways: after training only on the initially labeled examples, then after training for a number of meta-iterations using our semi-supervised learning approaches, and also — for comparison — after training on all labeled training examples. The two sets of results computed from a) only the initial labeled examples, and b) all labeled training examples, act as an empirical lower and upper bound for the semi-supervised approaches.

We used the VGG16 network architecture for all experiments. A fully connected layer is added at the end of the model for generating a 256-dimensional embedding space. A mini-batch size of 100 is used for all the experiments. For updating the network parameters, Adam is used as the optimizer, except for contrastive loss, which uses Rmsprop. For triplet, contrastive, and Arcface loss, the distance to the nearest labeled example is used as the confidence score when selecting unlabeled examples for labeling. For softmax cross-entropy loss, the softmax probability score is used as the confidence score. Our proposed self-learning approach was run for 25 meta-iterations and results were averaged over 3 runs with a random selection of initially labeled examples.

4.1 Results

SVHN (Street View House Numbers) comprises 32x32 color images of house numbers. A single image can contain multiple digits, but only the digit in the center is considered for the label prediction. The proposed approaches are evaluated using 1000 labeled instances initially and use a selection percentage of (i.e., in each meta-iteration of self-training, 5% of the remaining unlabeled examples are selected for labeling). Table 1 shows test accuracy for SVHN using all four losses, with random as well as pretrained weights, for the 1000-labeled, the self-learning, and the all-labeled setup.

Pretrained 1000 Labels Self-learning 73257 Labels
Cross-entropy loss
No 75.81 2.28 92.07 0.35 95.72 0.23
Yes 80.84 0.74 92.73 0.52 96.10 0.21

Triplet loss [20]
No 57.22 1.81 64.69 1.39 94.79 0.06
Yes 82.52 2.14 86.14 1.11 95.12 0.23

Contrastive loss [8]
No 54.73 0.57 62.80 0.63 81.82 2.29
Yes 79.46 0.99 82.59 0.31 93.41 0.26

Arcface loss [6]
No 68.33 0.91 70.42 1.59 93.74 0.11
Yes 80.84 0.21 82.01 1.41 95.66 0.31
Table 1: SVHN Test Accuracy %.

The CIFAR-10 dataset comprises 32x32 RGB images of ten different object classes. The proposed semi-supervised approaches are evaluated using 4000 labeled instances initially, with a selection percentage of

for self-training. Table 2 shows accuracy on the standard test set for all losses using 4000-labeled, all-labeled and self-learning, for pretrained weights from Imagenet as well as random initial weights.

Pretrained 4000 Labels Self-learning 50000 Labels
Cross-entropy loss
No 70.43 1.43 79.15 0.80 87.84 0.39
Yes 77.07 0.91 83.33 0.19 89.37 0.49

Triplet loss [20]
No 68.35 3.63 70.57 1.17 86.54 0.42
Yes 76.42 2.19 78.36 1.39 88.15 0.36

Contrastive loss [8]
No 34.90 0.73 44.58 1.67 71.16 0.05
Yes 71.98 0.95 76.58 0.05 85.92 0.32

Arcface loss [6]
No 55.04 1.36 69.54 3.69 75.31 0.24
Yes 74.76 0.72 76.55 1.80 87.76 0.24
Table 2: CIFAR10 Test Accuracy %.

The Plant Village [9] dataset consists of plant leaves. It has 43,456 training and 10,849 test RGB images resized to 96x96 from the original format (256x256). It has 38 categories of species and diseases. A sample image for each class is shown in Figure 2. The proposed semi-supervised approaches are evaluated using 10 images per class as labeled instances initially, with a selection percentage of in self-learning.

Figure 2: Plant Village disease [9] dataset

Table 3 shows accuracy on test examples for all four losses using 380-labeled, all-labeled and self-learning, with random weight initialization and pretrained weights.

Pretrained 380 Labels Self-learning 43456 Labels
Cross-entropy loss
No 45.78 4.09 54.58 2.65 98.24 0.62
Yes 73.76 1.70 84.62 1.2 99.24 0.08
Triplet loss [20]
No 29.81 2.59 33.16 1.96 92.15 1.63
Yes 76.88 0.36 77.80 1.15 99.02 0.11
Contrastive loss [8]
No 13.12 1.56 16.35 0.88 34.75 3.20
Yes 30.22 2.14 32.46 2.65 45.66 2.64

Arcface loss [6]
No 54.85 0.09 58.39 3.61 98.11 0.38
Yes 60.67 0.13 71.80 2.58 99.32 0.04
Table 3: Plant village 96x96 Test Accuracy %.

As we can see from the results for all three datasets, using pretrained weights generally results in substantial improvements over random initialisation. When comparing the four loss functions, cross-entropy emerges as the winner, with triplet loss often being second best. However, especially for small numbers of labeled examples, triplet loss seems competitive with cross-entropy, outperforming it for two of the three datasets. This seems reasonable, as paying explicit attention to the similarities of particular instances may be more important when only a few labeled instances are available.

Comparing the three metric losses with each other, triplet loss generally outperforms the other two when using pretrained weights. On the other hand, when using random initial weights, none of the three losses seems to have a clear advantage over the others, except for the Plant dataset, where Arcface performs very well, even outperforming cross-entropy.

Figure 3 shows a comparison of self-learning using random weights and pretrained weights, across three different runs on CIFAR10, using softmax cross-entropy loss for 4000 initially labeled examples and 25 meta-iterations of self-learning. The accuracy curves show similar improvements for both scenarios, with the pretrained version starting from a higher initial accuracy level, and retaining this advantage over the 25 meta-iterations of self-learning.

Meta Iterations

Accuracy %

CIFAR10 VGG16 (Cross-entropy loss)

Self learning run# 1

Self learning run# 2

Self learning run# 3

4000-label 70.13 1.45%

All label 87.84 0.39%

(a) Self-learning with random weights

Meta Iterations

Accuracy %

CIFAR10 VGG16 (Cross-entropy loss)

Self learning run# 1

Self learning run# 2

Self learning run# 3

4000-label 77.04 0.97%

All label 89.37 0.49%

(b) Self-learning with Imagenet weights
Figure 3: CIFAR10 meta-iterations of self-learning using random and pretrained Imagenet weights.

In order to investigate the effect of self-learning on the embeddings, we visualize the embeddings obtained using all four loss functions. Figure 4 shows the output of TSNE [21] on embeddings of CIFAR10 test instances after training on 4000 labeled examples and after 25 meta-iterations of self-learning using all four losses. It is evident that self-learning improves class separation, with cross-entropy showing the most dramatic improvement, consistent with its high final accuracy.

(a) Cross-entropy: initial training
(b) Cross-entropy: after 25 meta-iterations
(c) Triplet loss: initial training
(d) Triplet loss: after 25 meta-iterations
(e) Contrastive loss: initial training
(f) Contrastive loss: after 25 meta-iterations
(g) Arcface: initial training
(h) Arcface: after 25 meta-iterations
Figure 4: TSNE Visualization of CIFAR10 embeddings for all losses after the first 4000 labeled examples and after 25-meta iterations of self-learning.

5 Conclusions

In this paper, we have shown that transfer learning can be highly beneficial for semi-supervised image classification. In terms of loss functions, overall, cross-entropy outperforms more specialised losses like triplet loss, contrastive loss, or Arcface loss. Still, for a small number of labels, triplet loss is very competitive.

There are a number of directions for future work. Exploring combinations of well-performing loss functions, exploring alternatives to the label propagation scheme, and exploring connections to few-shot learning, are just a few obvious ones. Additionally, more lower-level engineering ideas, like mini-batch composition strategies as pointed out in [1], might help to further improve the performance of semi-supervised image classification.


  • [1] E. Arazo, D. Ortego, P. Albert, N. E. O’Connor, and K. McGuinness (2020) Pseudo-labeling and confirmation bias in deep semi-supervised learning. In 2020 International Joint Conference on Neural Networks (IJCNN), Vol. , pp. 1–8. Cited by: §5.
  • [2] J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah (1993) SIGNATURE verification using a “Siamese” time delay neural network.

    International Journal of Pattern Recognition and Artificial Intelligence

    07 (04), pp. 669–688.
    Cited by: §3.2.
  • [3] O. Chapelle, B. Schölkopf, and A. Zien (2006) Semi-supervised learning. The MIT Press. Cited by: §1.
  • [4] S. Chopra, R. Hadsell, and Y. LeCun (2005) Learning a similarity metric discriminatively, with application to face verification. In

    2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)

    Vol. 1, pp. 539–546. Cited by: §2.1.
  • [5] Z. Dai, Z. Yang, F. Yang, W. W. Cohen, and R. R. Salakhutdinov (2017) Good semi-supervised learning that requires a bad gan. In Advances in Neural Information Processing Systems, pp. 6513–6523. Cited by: §2.1.
  • [6] J. Deng, J. Guo, N. Xue, and S. Zafeiriou (2019)

    Arcface: additive angular margin loss for deep face recognition

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4690–4699. Cited by: §2.1, §3.2.3, Table 1, Table 2, Table 3.
  • [7] R. Girshick, J. Donahue, T. Darrell, and J. Malik (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587. Cited by: §2.2.
  • [8] R. Hadsell, S. Chopra, and Y. LeCun (2006) Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), Vol. 2, pp. 1735–1742. Cited by: §3.2.2, Table 1, Table 2, Table 3.
  • [9] D. Hughes, M. Salathé, et al. (2015) An open access repository of images on plant health to enable the development of mobile disease diagnostics. arXiv preprint arXiv:1511.08060. Cited by: Figure 2, §4.1.
  • [10] D. Lee (2013) Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks. In Workshop on Challenges in Representation Learning, ICML, Vol. 3, pp. 2. Cited by: §1, §2.1.
  • [11] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song (2017) Sphereface: deep hypersphere embedding for face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 212–220. Cited by: §2.1.
  • [12] L. Maaløe, C. K. Sønderby, S. K. Sønderby, and O. Winther (2016) Auxiliary deep generative models. In Proceedings of the 33rd International Conference on Machine Learning-Volume 48, pp. 1445–1454. Cited by: §2.1.
  • [13] T. Miyato, S. Maeda, M. Koyama, and S. Ishii (2018) Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence 41 (8), pp. 1979–1993. Cited by: §2.1.
  • [14] A. Rasmus, M. Berglund, M. Honkala, H. Valpola, and T. Raiko (2015) Semi-supervised learning with ladder networks. In Advances in Neural Information Processing Systems, pp. 3546–3554. Cited by: §2.1.
  • [15] C. Rosenberg, M. Hebert, and H. Schneiderman (2005) Semi-supervised self-training of object detection models. In 2005 Seventh IEEE Workshops on Applications of Computer Vision (WACV/MOTION’05) - Volume 1, Vol. 1, pp. 29–36. Cited by: §1.
  • [16] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2015) ImageNet large scale visual recognition challenge. International Journal of Computer Vision 115 (3), pp. 211–252. Cited by: §1, §2.2.
  • [17] A. Sahito, E. Frank, and B. Pfahringer (2019) Semi-supervised learning using Siamese networks. In AI 2019: Advances in Artificial Intelligence, J. Liu and J. Bailey (Eds.), Cham, pp. 586–597. External Links: ISBN 978-3-030-35288-2 Cited by: §3.
  • [18] M. Sajjadi, M. Javanmardi, and T. Tasdizen (2016) Regularization with stochastic transformations and perturbations for deep semi-supervised learning. In Advances in Neural Information Processing Systems, pp. 1163–1171. Cited by: §2.1.
  • [19] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen (2016) Improved techniques for training GANs. In Advances in Neural Information Processing Systems, pp. 2234–2242. Cited by: §2.1.
  • [20] F. Schroff, D. Kalenichenko, and J. Philbin (2015) FaceNet: a unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823. Cited by: §2.1, §3.2.1, §3.2.1, Table 1, Table 2, Table 3.
  • [21] L. van der Maaten and G. Hinton (2008) Visualizing data using t-SNE. Journal of Machine Learning Research 9, pp. 2579–2605. Cited by: §4.1.
  • [22] H. Wang, Y. Wang, Z. Zhou, X. Ji, D. Gong, J. Zhou, Z. Li, and W. Liu (2018) Cosface: large margin cosine loss for deep face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5265–5274. Cited by: §2.1.
  • [23] X. Wei, B. Gong, Z. Liu, W. Lu, and L. Wang (2018) Improving the improved training of wasserstein GANs: a consistency term and its dual effect. In International Conference on Learning Representations, Cited by: §2.1.
  • [24] J. Weston, F. Ratle, H. Mobahi, and R. Collobert (2012) Deep learning via semi-supervised embedding. In Neural Networks: Tricks of the Trade, pp. 639–655. Cited by: §2.1.