1 Introduction
Supervised learning and deep neural networks have proved their efficacy when achieving outstanding successes in a wide range of machine learning domains such as image recognition, language modeling, speech recognition, or machine translation. There is an empirical observation that better performance could be obtained if the model is trained on larger datasets with more labeled data
(Hestness et al., 2017; Mahajan et al., 2018; Kolesnikov et al., 2019; Xie et al., 2020; Raffel et al., 2019). However, data labeling is costly and humanlabordemanding, even requiring the participation of experts (for example, in medical applications, data labeling must be done by doctors).In many realworld problems, it is often very difficult to create a large amount of labeled training data. Therefore, numerous studies have focused on how to leverage unlabeled data, leading to a variety of research fields like selfsupervised learning (Doersch et al., 2015; Noroozi and Favaro, 2016; Gidaris et al., 2018), semisupervised learning (Berthelot et al., 2019b; Nair et al., 2019; Berthelot et al., 2019a; Sohn et al., 2020), or metric learning (Hermans et al., 2017; Zhang et al., 2019). In selfsupervised learning, pretext tasks are designed so that the model can learn meaningful information from a large number of unlabeled images. The model is then finetuned on a smaller set of labeled data. In another way, semisupervised learning (SSL) aims to leverage both labeled and unlabeled data in a single training process. On the other hand, metric learning does not directly predict semantic labels of given inputs but aims to measure the similarity among inputs.
In this paper, we unify the idea of semisupervised learning (SSL) and metric learning to propose RankingMatch, a more powerful SSL method for image classification (Figure 1). We adopt FixMatch SSL method (Sohn et al., 2020), which utilized pseudolabeling and consistency regularization to produce artificial labels for unlabeled data. Specifically, given an unlabeled image, its weaklyaugmented and stronglyaugmented version are created. The model prediction corresponding to the weaklyaugmented image is used as the target label for the stronglyaugmented image, encouraging the model to produce the same prediction for different perturbations of the same input.
Consistency regularization approach incites the model to produce unchanged with the different perturbations of the same input, but this is not enough. Inspired by the observation that the images from the same class (having the same label) should also have the similar model outputs, we utilize loss functions of metric learning, called Ranking losses, to apply more constraints to the objective function of our model. Concretely, we use Triplet and Contrastive loss with the aim of encouraging the model to produce the similar outputs for the images from the same class. Given an image from a class (saying
dog, for example), Triplet loss tries to pull positive samples (images from class dog) nearer the given image and push negative samples (images not from class dog) further the given image. On the other hand, Contrastive loss maximizes the similarity of the images from the same class and minimizes the similarity of the images from different classes. However, instead of applying Triplet and Contrastive loss to the image representation as previous works did (Hermans et al., 2017; Chen et al., 2020a), we directly apply them to the model output (the “logits” score) which is the output of the classification head.
We argue that the images from the same class do not have to have similar representations strictly, but their model outputs should be as similar as possible. Our motivation and argument could be consolidated in Appendix A. Especially, we propose a new version of Triplet loss which is called BatchMean. Our BatchMean Triplet loss has the advantage of computational efficiency of existing BatchHard Triplet loss while taking into account all input samples when computing the loss. More details will be presented in Section 3.3.1. Our key contributions are summarized as follows:
We introduce a novel SSL method, RankingMatch, that encourages the model to produce the similar outputs for not only the different perturbations of the same input but also the input samples from the same class.

Our proposed BatchMean Triplet loss surpasses two existing versions of Triplet loss which are BatchAll and BatchHard Triplet loss (Section 4.5).

Our method is simple yet effective, achieving stateoftheart results across many standard SSL benchmarks with various labeled data amounts.
2 Related Work
Many recent works have achieved success in semisupervised learning (SSL) by adding a loss term for unlabeled data. This section reviews two classes of this loss term (consistency regularization and entropy minimization) that are related to our work. Ranking loss is also reviewed in this section.
Consistency Regularization This is a widely used SSL technique which encourages the model to produce unchanged with different perturbations of the same input sample. Consistency regularization was early introduced by Sajjadi et al. (2016) and Laine and Aila (2016) with the methods named “Regularization With Stochastic Transformations and Perturbations” and “Model”, respectively. Both of these two approaches used Mean Squared Error (MSE) to enforce the model to produce the same output for different perturbed versions of the same input. Later stateoftheart methods adopted consistency regularization in diverse ways. In MixMatch (Berthelot et al., 2019b), a guessed label, computed based on weaklyaugmented versions of an unlabeled sample, was used as the target label for all these weaklyaugmented samples. On the other hand, in FixMatch (Sohn et al., 2020), a pseudolabel, which is computed based on the weaklyaugmented unlabeled sample, became the target label for the stronglyaugmented version of the same unlabeled sample.
Entropy Minimization One of the requirements in SSL is that the model prediction for unlabeled data should have low entropy. Grandvalet and Bengio (2005) and Miyato et al. (2018) introduced an additional loss term, which is explicitly incorporated in the objective function, to minimize the entropy of the distribution of the model prediction for unlabeled data. On the other hand, MixMatch (Berthelot et al., 2019b) used a sharpening function to adjust the model prediction distribution and thereby reduced the entropy of the predicted label. FixMatch (Sohn et al., 2020) implicitly obtained entropy minimization by constructing hard labels from highconfidence predictions (predictions which are higher than a predefined threshold) on weaklyaugmented unlabeled data. These hard labels were then used as the target labels for stronglyaugmented unlabeled data.
Metric Learning and Ranking Loss Metric learning is an approach that does not directly predict semantic labels of given images but trains the model to learn the similarity among samples (Kulis and others, 2012; Kaya and Bilge, 2019). There are various objective functions used in metric learning, including Triplet and Contrastive loss which are used in our work. Triplet loss was successfully exploited in person reidentification problem (Hermans et al., 2017). A triplet contains a person image referred to as anchor, a positive sample which is the image from the same person with the anchor, and a negative sample being the image from the different person with the anchor. Triplet loss was used to enforce the distance between the anchor and negative sample to be larger than the distance between the anchor and positive sample by at least a margin . Besides, SimCLR (Chen et al., 2020a) utilized Contrastive loss to maximize the similarity between two different augmented versions of the same sample while minimizing the similarity between different samples. Both Hermans et al. (2017) and Chen et al. (2020a) applied Triplet and Contrastive loss to the image representation. Contrastive loss was also used by Chen et al. (2020b)
for semisupervised image retrieval and person reidentification. Given feature (or image) representations,
Chen et al. (2020b) computed classwise similarity scores using a similarity measurement to learn semanticsoriented similarity representation. Contrastive loss was then applied to both image and semanticsoriented similarity representation in two learning phases. If the model output in image classification is viewed as a form of classwise similarity scores, the highlevel idea of our method might be similar to Chen et al. (2020b) in utilizing Contrastive loss. However, in our case, the model itself obtains classwise similarity scores, and Contrastive loss is only applied to the model output (“logits” score, but not image representation) in a single training process. More details will be presented in Section 3.3.3 RankingMatch
This section starts to describe the overall framework and objective function of RankingMatch. Next, two important factors of the objective function, CrossEntropy and Ranking loss, will be presented in detail. Concretely, Triplet and Contrastive loss will be separately shown with our proposed and modified versions.
3.1 Overall Framework
The overall framework of RankingMatch is illustrated in Figure 1
. Both labeled and unlabeled data are simultaneously leveraged in a single training process. Two kinds of augmentation are used to perturb the input sample. While weak augmentation uses standard paddingandcropping and horizontal flipping augmentation strategies, more complex transformations are used for strong augmentation. We utilize RandAugment
(Cubuk et al., 2020) for strong augmentation, consisting of multiple transformation methods such as contrast adjustment, shear, rotation, translation, etc. Given a collection of transformations, two of them are randomly selected to strongly perturb the input sample. Cutout (DeVries and Taylor, 2017) is followed to obtain the final stronglyaugmented sample.As shown in Figure 1
, only weak augmentation is used for labeled data. The weaklyaugmented labeled image is fed into the model to produce scores for labels. These scores are actually the output of the classification head, and we call them “logits” score for a convenient explanation. A softmax function is used to convert the “logits” scores to the probabilities for labels. These probabilities are then used along with groundtruth labels to compute CrossEntropy loss. An
normalization is applied to the “logits” scores before using them for computing Ranking loss. We experimented and found that normalization is an important factor contributing to the success of RankingMatch, which will be shown in Section 4.5. The groundtruth labels are used to determine positive samples (images from the same class) and negative samples (images from different classes) in computing Ranking loss. The same procedure is used for unlabeled data except that pseudolabels, obtained from weaklyaugmented unlabeled samples, are used instead of the groundtruth labels.Let define a batch of labeled samples, where is the training sample and is the corresponding onehot label. Let be a batch of unlabeled samples with a coefficient determining the relative size of and . We denote weak and strong augmentation as and respectively. Let be the “logits” score produced by the model for a given input . As a result, and are the softmax function and normalization applied to the “logits” score, respectively. Finally, let be CrossEntropy loss of the predicted class distribution and the target label . Notably, corresponds to the groundtruth label or pseudolabel in the case of labeled or unlabeled data respectively.
As illustrated in Figure 1, there are four elements contributing to the overall loss function of RankingMatch. Two of them are CrossEntropy loss for labeled and unlabeled data, denoted by and respectively. Two remaining ones are Ranking loss for labeled and unlabeled data, corresponding to and respectively. The objective is minimizing the loss function defined as follows:
(1) 
where and
are scalar hyperparameters denoting the weights of the loss elements. In Section
3.2 and 3.3, we will present how CrossEntropy and Ranking loss are computed for labeled and unlabeled data in detail. We also show comparisons between RankingMatch and other methods in Appendix B. The full algorithm of RankingMatch is provided in Appendix C.3.2 CrossEntropy Loss
For labeled data, since the groundtruth labels are available, the standard CrossEntropy loss is computed as follows:
(2) 
For unlabeled data, we adopt the idea of FixMatch (Sohn et al., 2020) to obtain the pseudolabel which plays the similar role as the groundtruth label of labeled data. Given an unlabeled image , the model first produces the “logits” score for the weaklyaugmented unlabeled image: . A softmax function is then applied to to obtain the model prediction: . The pseudolabel corresponds to the class having the highest probability: . Note that for simplicity, is assumed to produce the valid onehot pseudolabel. A threshold is used to ignore predictions that have low confidence. Finally, the highconfidence pseudolabels are used as the target labels for stronglyaugmented versions of corresponding unlabeled images, leading to:
(3) 
Equation 3 satisfies consistency regularization and entropy minimization. The model is encouraged to produce consistent outputs for stronglyaugmented samples against the model outputs for weaklyaugmented samples; this is referred to as consistency regularization. As advocated in Lee (2013) and Sohn et al. (2020), the use of a pseudolabel, which is based on the model prediction for an unlabeled sample, as a hard target for the same sample could be referred to as entropy minimization.
3.3 Ranking Loss
This section presents two types of Ranking loss used in our RankingMatch, which are Triplet and Contrastive loss. We directly apply these two loss functions to the “logits” scores, which is different from previous works such as Hermans et al. (2017) and Chen et al. (2020a). Especially, our novel version of Triplet loss, which is BatchMean Triplet loss, will also be presented in this section.
Let be a batch of normalized “logits” scores of the network shown in Figure 1. Let denote the label of the normalized “logits” score . This label could be the groundtruth label or pseudolabel in the case of labeled or unlabeled data, respectively. The procedure of obtaining the pseudolabel for unlabeled data was presented in Section 3.2. Notably, Ranking loss is separately computed for labeled and unlabeled data, and in Equation 1 could be either Triplet loss (Section 3.3.1) or Contrastive loss (Section 3.3.2). Let , , and be the anchor, positive, and negative sample, respectively. While the anchor and positive sample represent the normalized “logits” scores having the same label, the anchor and negative sample are for the normalized “logits” scores having the different labels.
3.3.1 BatchMean Triplet Loss
Let denote the distance between two “logits” scores and . Following Schroff et al. (2015) and Hermans et al. (2017), two existing versions of Triplet loss, BatchAll and BatchHard, could be defined as follows with the use of Euclidean distance for .
BatchAll Triplet loss:
(4) 
where is the number of triplets. A triplet consists of an anchor, a positive sample, and a negative sample.
BatchHard Triplet loss:
(5) 
In Equation 4 and 5, is the margin, and indicates the function to avoid revising “already correct” triplets. A hinge function () could be used in this circumstance. For instance, if a triplet already satisfied the distance between the anchor and negative sample is larger than the distance between the anchor and positive sample by at least a margin , that triplet should be ignored from the training process by assigning it zerovalue ( if , corresponding to ). However, as mentioned in Hermans et al. (2017), the softplus function () gives better results compared to the hinge function. Thus, we decided to use the softplus function for all our experiments, which is referred to as softmargin.
While BatchAll considers all triplets, BatchHard only takes into account hardest triplets. A hardest triplet consists of an anchor, a furthest positive sample, and a nearest negative sample relative to that anchor. The intuition behind BatchHard is that if we pull an anchor and its furthest positive sample together, other positive samples of that anchor will also be pulled obviously. BatchHard is more computationally efficient compared to BatchAll. However, because and
function are used in BatchHard, only the hardest triplets (anchors, furthest positive samples, and nearest negative samples) are taken into account when the network does backpropagation. We argue that it would be beneficial if all samples are considered and contribute to updating the network parameters. Therefore, we introduce a novel variant of Triplet loss, called
BatchMean Triplet loss, as follows:(6) 
By using “” function () instead of and function, our proposed BatchMean Triplet loss not only has the advantage of computational efficiency of BatchHard but also takes into account all samples. The efficacy of BatchMean Triplet loss will be clarified in Section 4.5.
3.3.2 Contrastive Loss
Let denote the similarity between two normalized “logits” scores and . Referring to Chen et al. (2020a), we define Contrastive loss applied to our work as follows:
(7) 
where is the number of valid pairs of anchor and positive sample, and is a constant denoting the temperature parameter. Note that if the and “logits” score of have the same label, there will be two valid pairs of anchor and positive sample. The “logits” score could become an anchor, and the “logits” score is a positive sample; and vice versa. The form of is referred to as the normalized temperaturescaled crossentropy loss. The objective is minimizing ; this corresponds to maximizing and minimizing
. Moreover, we also want the anchor and positive sample to be as similar as possible. As a result, cosine similarity is a suitable choice for the similarity function of
. For instance, if two “logits” score vectors are the same, the cosine similarity between them has the maximum value which is
.4 Experiments
We evaluate the efficacy of RankingMatch on standard semisupervised learning (SSL) benchmarks such as CIFAR10 (Krizhevsky et al., 2009), CIFAR100 (Krizhevsky et al., 2009), SVHN (Netzer et al., 2011)
, and STL10
(Coates et al., 2011). We also conduct experiments on Tiny ImageNet
^{1}^{1}1Stanford University. http://cs231n.stanford.edu/ to verify the performance of our method on a larger dataset. Our method is compared against MixMatch (Berthelot et al., 2019b), RealMix (Nair et al., 2019), ReMixMatch (Berthelot et al., 2019a), and FixMatch (Sohn et al., 2020). As recommended by Oliver et al. (2018), all methods should be implemented using the same codebase. However, due to the limited computing resources, we only reimplemented MixMatch and FixMatch. Our target is not reproducing stateoftheart results of these papers, but making the comparison with our method as fair as possible.4.1 Implementation Details
Unless otherwise noted, we utilize Wide ResNet282 network architecture (Zagoruyko and Komodakis, 2016) with
million parameters, and our experiments are trained for 128 epochs with a batch size of 64. Concretely, for our RankingMatch, we use a same set of hyperparameters (
, , , , , , and ) across all datasets and all amounts of labeled samples except that a batch size of 32 () is used for the STL10 dataset. More details of the training protocol and hyperparameters will be reported in Appendix D. In all our experiments, FixMatch and FixMatch refer to FixMatch with using RandAugment and CTAugment respectively (Sohn et al., 2020); RankingMatch, RankingMatch, RankingMatch, and RankingMatch refer to RankingMatch with using BatchMean Triplet loss, BatchHard Triplet loss, BatchAll Triplet loss, and Contrastive loss respectively. For each benchmark dataset, our results are reported on the corresponding test set.4.2 CIFAR10 and CIFAR100
Results with same settings We first implement all methods using the same codebase and evaluate them under same conditions to show how effective our method could be. The results are shown in Table 1. Note that different folds mean different random seeds. As shown in Table 1, RankingMatch outperforms all other methods across all numbers of labeled samples, especially with a small portion of labels. For example, on CIFAR10, RankingMatch with 40 labels reduces the error rate by 29.61% and 4.20% compared to MixMatch and FixMatch respectively. The results also show that cosine similarity might be more suitable than Euclidean distance if the dimension of the “logits” score grows up. For instance, on CIFAR100 where the “logits” score is a 100dimensional vector, RankingMatch reduces the error rate by 1.07% and 1.19% compared to RankingMatch in the case of 2500 and 10000 labels respectively.
CIFAR10  CIFAR100  
Method  40 labels  250 labels  4000 labels  400 labels  2500 labels  10000 labels 
MixMatch  44.838.70  19.461.25  7.740.21  82.100.78  48.980.88  35.110.36 
FixMatch  19.426.46  7.300.79  4.840.23  61.021.61  38.170.40  30.230.43 
RankingMatch  15.224.51  6.770.89  4.760.17  60.592.05  38.260.39  30.460.24 
RankingMatch  16.662.77  7.261.20  4.810.33  64.260.80  37.190.55  29.270.30 
CIFAR10 with more training epochs Since FixMatch, which our RankingMatch is based on, was trained for 1024 epochs, we attempted to train our models with more epochs to make our results comparable. Our results on CIFAR10, which were trained for 256 epochs, are reported on the left of Table 2. RankingMatch achieves a stateoftheart result, which is 4.87% error rate with 250 labels, even trained for fewer epochs compared to FixMatch. With 40 labels, RankingMatch has worse performance compared to FixMatch but reduces the error rate by 0.38% compared to FixMatch. Note that both RankingMatch and FixMatch use RandAugment, so the results are comparable.
CIFAR100 with a larger model We use a larger version of Wide ResNet282 network architecture, which is called Wide ResNet282Large, by using more filters per layer. Specifically, the number of filters of layers in the second group is increased from 32 to 135, and similarly for the following groups with a multiplication factor of 2. This Wide ResNet282Large network architecture with 26 million parameters was also used in MixMatch and FixMatch, so making our results comparable. As shown on the right of Table 2, RankingMatch achieves a stateoftheart result, which is 22.35% error rate with 10000 labels, even trained for only 128 epochs.
CIFAR10  CIFAR100  
Method  40 labels  250 labels  4000 labels  400 labels  2500 labels  10000 labels 
MixMatch*    11.080.87  6.240.06      25.880.30 
RealMix*    9.790.75  6.390.27       
ReMixMatch*    6.270.34  5.140.04       
FixMatch*  13.813.37  5.070.65  4.260.05  48.851.75  28.290.11  22.600.12 
FixMatch*  11.393.35  5.070.33  4.310.15  49.953.01  28.640.24  23.180.11 
RankingMatch  13.432.33  4.870.08  4.290.03  49.570.67  29.680.60  23.180.03 
RankingMatch  14.983.06  5.130.02  4.320.12  56.901.47  28.390.67  22.350.10 
4.3 SVHN and STL10
SVHN The results for SVHN are shown in Table 4. We achieve stateoftheart results, which are 2.24% and 2.23% error rate in the case of 250 and 1000 labels, respectively. With 40 labels, our results are worse than those of FixMatch; this may be excusable because our models were trained for 128 epochs while FixMatch’s models were trained for 1024 epochs.
STL10
STL10 is a dataset designed for unsupervised learning, containing 5000 labeled images and 100000 unlabeled images. To deal with the higher resolution of images in the STL10 dataset (
), we add one more group to the Wide ResNet282 network, resulting in Wide ResNet372 architecture with 5.9 million parameters. There are ten predefined folds with 1000 labeled images each. Table 4 shows our results on three of these ten folds. The result of SWWAE and CCGAN are cited from Zhao et al. (2015) and Denton et al. (2016) respectively. We achieve better results compared to numerous methods. Our RankingMatch obtains an error rate of 5.96% while the current stateoftheart method (FixMatch) has the error rate of 7.98% and 5.17% in the case of using RandAugment and CTAugment respectively.Method  40 labels  250 labels  1000 labels 

MixMatch*    3.780.26  3.270.31 
RealMix*    3.530.38   
ReMixMatch*    3.100.50  2.830.30 
FixMatch*  3.962.17  2.480.38  2.280.11 
FixMatch*  7.657.65  2.640.64  2.360.19 
MixMatch  42.5515.94  6.250.17  5.870.12 
FixMatch  24.9510.29  2.370.26  2.280.12 
RankingMatch  21.028.06  2.240.07  2.320.07 
RankingMatch  27.202.90  2.330.06  2.230.11 
Method  Error Rate 

SWWAE*  25.67 
CCGAN*  22.21 
MixMatch*  10.181.46 
ReMixMatch*  6.181.24 
FixMatch*  7.981.50 
FixMatch*  5.170.63 
FixMatch  6.100.11 
RankingMatch  5.960.07 
RankingMatch  7.550.37 
4.4 Tiny ImageNet
Tiny ImageNet is a compact version of ImageNet, consisting of 200 classes. We use the Wide ResNet372 architecture with 5.9 million parameters, as used for STL10. Only 9000 out of 100000 training images are used as labeled data. While FixMatch
obtains an error rate of 52.090.14%, RankingMatch and RankingMatch achieve the error rate of 51.470.25% and 49.100.41% respectively (reducing the error rate by 0.62% and 2.99% compared to FixMatch, respectively). Moreover, RankingMatch has the better result compared to RankingMatch, advocating the efficiency of cosine similarity in a highdimensional space, as presented in Section 4.2.4.5 Ablation Study
CIFAR10  CIFAR100  SVHN  

Ablation  250 labels  4000 labels  2500 labels  10000 labels  250 labels  1000 labels 
RankingMatch  5.50  4.49  37.79  30.14  2.11  2.23 
RankingMatch  70  310  730  580  30  2.29 
RankingMatch  11.96  8.59  38.83  31.19  3.13  2.72 
RankingMatch  24.17  12.05  49.06  34.96  3.00  3.45 
RankingMatch  5.76  4.64  36.53  28.99  2.25  2.19 
RankingMatch  6.31  4.67  36.77  29.15  2.43  2.20 
We carried out the ablation study, as summarized in Table 5. Our BatchMean Triplet loss outperforms BatchHard and BatchAll Triplet loss. For example, on CIFAR10 with 250 labels, RankingMatch reduces the error rate by a factor of two and four compared to RankingMatch and RankingMatch, respectively. We also show the efficacy of normalization which contributes to the success of RankingMatch. Applying normalization helps reduce the error rate of RankingMatch across all numbers of labels. Especially for RankingMatch, if normalization is not used, the model may not converge due to the very large value of the loss.
4.6 Analysis
Qualitative results We visualize the “logits” scores of the methods, as shown in Figure 2. While MixMatch has much more overlapping points, the other three methods have class cyan (cat) and yellow (dog) close together. Interestingly, RankingMatch has the shape of the clusters being different from MixMatch and FixMatch; this might open a research direction for future work. More details about our visualization are presented in Appendix E.
Computational efficiency Figure 3 shows the training time per epoch of RankingMatch with using BatchAll, BatchHard, or BatchMean Triplet loss on CIFAR10, SVHN, and CIFAR100. While the training time of BatchHard and BatchMean is stable and unchanged among epochs, BatchAll has the training time gradually increased during the training process. Moreover, BatchAll requires more training time compared to BatchHard and BatchMean. For example, on SVHN, BatchAll has the average training time per epoch is 434.05 seconds, which is 126.25 and 125.82 seconds larger than BatchHard and BatchMean respectively.
We also measure the GPU memory usage of the methods during the training process. On average, BatchAll occupies two times more GPU memory than BatchHard and BatchMean. For instance, on CIFAR10, the GPU memory usage of BatchAll is 9039.722043.30MB, while this value is 4845.920.72MB in BatchHard and BatchMean. More details are presented in Appendix F.
5 Conclusion
In this paper, we propose RankingMatch, a novel semisupervised learning (SSL) method that unifies the idea of consistency regularization SSL approach and metric learning. Our method encourages the model to produce the same prediction for not only the different augmented versions of the same input but also the samples from the same class. Delving into the objective function of metric learning, we introduce a new variant of Triplet loss, called BatchMean Triplet loss, which has the advantage of computational efficiency while taking into account all samples. The extensive experiments show that our method exhibits good performance and achieves stateoftheart results across many standard SSL benchmarks with various labeled data amounts. For future work, we are interested in researching the combination of Triplet and Contrastive loss in a single objective function so that we can take the advantages of these two loss functions.
References
 Remixmatch: semisupervised learning with distribution matching and augmentation anchoring. In International Conference on Learning Representations, Cited by: Appendix B, §1, §4.
 Mixmatch: a holistic approach to semisupervised learning. In Advances in Neural Information Processing Systems, pp. 5049–5059. Cited by: Appendix B, §1, §2, §2, §4.
 A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709. Cited by: §1, §2, §3.3.2, §3.3.
 Learning to learn in a semisupervised fashion. arXiv preprint arXiv:2008.11203. Cited by: §2.

An analysis of singlelayer networks in unsupervised feature learning.
In
Proceedings of the fourteenth international conference on artificial intelligence and statistics
, pp. 215–223. Cited by: §4. 
Randaugment: practical automated data augmentation with a reduced search space.
In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops
, pp. 702–703. Cited by: §3.1. 
Semisupervised learning with contextconditional generative adversarial networks
. arXiv preprint arXiv:1611.06430. Cited by: §4.3. 
Improved regularization of convolutional neural networks with cutout
. arXiv preprint arXiv:1708.04552. Cited by: §D.3, §3.1.  Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE international conference on computer vision, pp. 1422–1430. Cited by: §1.
 Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728. Cited by: §1.
 Semisupervised learning by entropy minimization. In Advances in neural information processing systems, pp. 529–536. Cited by: §2.
 In defense of the triplet loss for person reidentification. arXiv preprint arXiv:1703.07737. Cited by: §1, §1, §2, §3.3.1, §3.3.1, §3.3.
 Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409. Cited by: §1.
 Deep metric learning: a survey. Symmetry 11 (9), pp. 1066. Cited by: §2.
 Big transfer (bit): general visual representation learning. arXiv preprint arXiv:1912.11370. Cited by: §1.
 Learning multiple layers of features from tiny images. Cited by: §4.
 Metric learning: a survey. Foundations and trends in machine learning 5 (4), pp. 287–364. Cited by: §2.
 Temporal ensembling for semisupervised learning. arXiv preprint arXiv:1610.02242. Cited by: §2.
 Pseudolabel: the simple and efficient semisupervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, Vol. 3. Cited by: §3.2.

Sgdr: stochastic gradient descent with warm restarts
. arXiv preprint arXiv:1608.03983. Cited by: §D.1.  Visualizing data using tsne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: §E.1.
 Exploring the limits of weakly supervised pretraining. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 181–196. Cited by: §1.
 Virtual adversarial training: a regularization method for supervised and semisupervised learning. IEEE transactions on pattern analysis and machine intelligence 41 (8), pp. 1979–1993. Cited by: §2.
 Realmix: towards realistic semisupervised deep learning algorithms. arXiv preprint arXiv:1912.08766. Cited by: Appendix B, §1, §4.
 Reading digits in natural images with unsupervised feature learning. Cited by: §4.
 Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision, pp. 69–84. Cited by: §1.
 Realistic evaluation of deep semisupervised learning algorithms. In Advances in neural information processing systems, pp. 3235–3246. Cited by: §D.4, §4.

Exploring the limits of transfer learning with a unified texttotext transformer
. arXiv preprint arXiv:1910.10683. Cited by: §1.  Regularization with stochastic transformations and perturbations for deep semisupervised learning. In Advances in neural information processing systems, pp. 1163–1171. Cited by: §2.

Facenet: a unified embedding for face recognition and clustering
. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823. Cited by: §3.3.1.  Fixmatch: simplifying semisupervised learning with consistency and confidence. arXiv preprint arXiv:2001.07685. Cited by: Appendix A, Appendix B, §D.3, §1, §1, §2, §2, §3.2, §3.2, §4.1, §4.
 Selftraining with noisy student improves imagenet classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10687–10698. Cited by: §1.
 Wide residual networks. arXiv preprint arXiv:1605.07146. Cited by: §4.1.
 Learning incremental triplet margin for person reidentification. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 9243–9250. Cited by: §1.
 Stacked whatwhere autoencoders. arXiv preprint arXiv:1506.02351. Cited by: §4.3.
Appendix A Details of Our Motivation and Argument
For our motivation of utilizing Ranking loss in semisupervised image classification FixMatch (Sohn et al., 2020) is a simple combination of existing semisupervised learning (SSL) approaches such as consistency regularization and pseudolabeling. FixMatch, as well as the consistency regularization approach, only considers the different perturbations of the same input. The model should produce unchanged with the different perturbations of the same input, but this is not enough. Our work is to fulfill this shortcoming. Our main motivation is that the different inputs of the same class (for example, two different cat images) should also have the similar model outputs. We showed that by simply integrating Ranking loss (especially our proposed BatchMean Triplet loss) into FixMatch, we could achieve the promising results, as quantitatively shown in Section 4.
For our argument We argue that the images from the same class do not have to have similar representations strictly, but their model outputs should be as similar as possible. Our work aims to solve the image classification task. Basically, the model for image classification consists of two main parts: feature extractor and classification head. Given an image, the feature extractor is responsible for understanding the image and generates the image representation. The image representation is then fed into the classification head to produce the model output (the “logits” score) which is the scores for all classes.

If the feature extractor can generate the very similar image representations for the images from the same class, it will be beneficial for the classification head.

Otherwise, if these image representations are not totally similar, the classification head will have to pay more effort to produce the similar model outputs for the sameclass images.
Therefore, the model outputs somehow depend on the image representations. For image classification, the goal is to get the similar model outputs for the sameclass images even when the image representations are not totally similar. That is also the main motivation for us to apply Ranking loss directly to the model outputs. Figure 4 illustrates the image representations and model outputs of the model when given sameclass images.
As shown in Figure 4, given two images from the same class, although the model can exactly predict the semantic labels and get the very similar model outputs, the image representations are not totally similar. For instance, two cat images can have the model outputs with the cosine similarity of 0.9633, but the cosine similarity of two corresponding image representations is only 0.6813. To support why applying Ranking loss directly to the model outputs is beneficial, we visualize the image representations and model outputs of our method on the CIFAR10 dataset, as shown in Figure 5.
As illustrated in Figure (b)b, the model outputs of the samples from the same class are clustered relatively well. As a result, the image representations of the sameclass samples are also clustered relatively well, as shown in Figure (a)a. Consequently, by forcing the model outputs of the sameclass samples to be as similar as possible, we obtain the similar image representations as well.
Appendix B Comparison of Methods
As presented in Section 4, we evaluate our method against four methods: MixMatch (Berthelot et al., 2019b), RealMix (Nair et al., 2019), ReMixMatch (Berthelot et al., 2019a), and FixMatch (Sohn et al., 2020). The comparison of the methods is shown in Table 6. RankingMatch, RankingMatch, RankingMatch, and RankingMatch refer to RankingMatch with using BatchAll Triplet, BatchHard Triplet, BatchMean Triplet, and Contrastive loss respectively.
Method  Data augmentation  Pseudolabel postprocessing  Ranking loss  Note 

MixMatch  Weak  Sharpening  None  Uses squared loss for unlabeled data 
RealMix  Weak & Strong  Sharpening & Confidence threshold  None  Uses training signal annealing (TSA) to avoid overfitting 
ReMixMatch  Weak & Strong  Sharpening  None  Uses extra rotation loss for unlabeled data 
FixMatch  Weak & Strong  Confidence threshold  None  
RankingMatch  Weak & Strong  Confidence threshold  BatchAll Triplet loss  
RankingMatch  Weak & Strong  Confidence threshold  BatchHard Triplet loss  
RankingMatch  Weak & Strong  Confidence threshold  BatchMean Triplet loss  
RankingMatch  Weak & Strong  Confidence threshold  Contrastive loss 
Appendix C RankingMatch Algorithm
The full algorithm of RankingMatch is provided in Algorithm 1. Note that the meaning of , , , , , , , , and in Algorithm 1 were defined in Section 3.3. Algorithm 1 illustrates the use of BatchMean Triplet loss. When using Contrastive loss, most of the parts of Algorithm 1 are kept unchanged, except that and are replaced by Contrastive losses as presented in Section 3.3.2.
Appendix D Details of Training Protocol and Hyperparameters
d.1 Optimizer and Learning Rate Schedule
We use the same codebase, data preprocessing, optimizer, and learning rate schedule for methods implemented by us. An SGD optimizer with momentum is used for training the models. Additionally, we apply a cosine learning rate decay (Loshchilov and Hutter, 2016) which effectively decays the learning rate by following a cosine curve. Given a base learning rate , the learning rate at the training step is set to
(8) 
where is the total number of training steps.
Concretely, is equal to the number of epochs multiplied by the number of training steps within one epoch. Finally, we use Exponential Moving Average (EMA) to obtain the model for evaluation.
d.2 List of Hyperparameters
For all our experiments, we use

A batch size of 64 for all datasets except that STL10 uses a batch size of 32,

Nesterov Momentum with a momentum of 0.9,

A weight decay of 0.0005 and a base learning rate of 0.03.
For other hyperparameters, we first define notations as in Table 7.
Notation  Definition 

Temperature parameter for sharpening used in MixMatch  
Number of augmentations used when guessing labels in MixMatch  
Hyperparameter for the Beta distribution used in MixMatch 

Confidence threshold used in FixMatch and RankingMatch for choosing highconfidence predictions  
Margin used in RankingMatch with using Triplet loss  
Temperature parameter used in RankingMatch with using Contrastive loss  
A hyperparameter weighting the contribution of the unlabeled examples to the training loss. In RankingMatch, is the weight determining the contribution of CrossEntropy loss of unlabeled data to the overall loss.  
A hyperparameter used in RankingMatch to determine the contribution of Ranking loss element to the overall loss 
The details of hyperparameters for all methods are shown in Table 8.
Method  softmargin  normalization  

MixMatch  75    0.5  2  0.75           
FixMatch  1          0.95         
RankingMatch  1  1        0.95  0.5  True    True 
RankingMatch  1  1        0.95      0.2  True 
d.3 Augmentation Details
For weak augmentation, we adopt standard paddingandcropping and horizontal flipping augmentation strategies. We set the padding to 4 for CIFAR10, CIFAR100, and SVHN. Because STL10 and Tiny ImageNet have larger image sizes, a padding of 12 and 8 is used for STL10 and Tiny ImageNet, respectively. Notably, we did not apply horizontal flipping for the SVHN dataset.
For strong augmentation, we first randomly pick 2 out of 14 transformations. These 14 transformations consist of Autocontrast, Brightness, Color, Contrast, Equalize, Identity, Posterize, Rotate, Sharpness, ShearX, ShearY, Solarize, TranslateX, and TranslateY. Then, Cutout (DeVries and Taylor, 2017) is followed to obtain the final stronglyaugmented sample. We set the cutout size to 16 for CIFAR10, CIFAR100, and SVHN. A cutout size of 48 and 32 is used for STL10 and Tiny ImageNet, respectively. For more details about 14 transformations used for strong augmentation, readers could refer to FixMatch (Sohn et al., 2020).
d.4 Dataset Details
CIFAR10 and CIFAR100 are widely used datasets that consist of color images. Each dataset contains 50000 training images and 10000 test images. Following standard practice, as mentioned in Oliver et al. (2018), we divide training images into train and validation split, with 45000 images for training and 5000 images for validation. Validation split is used for hyperparameter tuning and model selection. In train split, we discard all except a number of labels (40, 250, and 4000 labels for CIFAR10; 400, 2500, and 10000 labels for CIFAR100) to vary the labeled data set size.
SVHN is a realworld dataset containing 73257 training images and 26032 test images. We use the similar data strategy as used for CIFAR10 and CIFAR100. We divide training images into train and validation split, with 65937 images used for training and 7320 images used for validation. In train split, we discard all except a number of labels (40, 250, and 1000 labels) to vary the labeled data set size.
STL10 is a dataset designed for unsupervised learning, containing 5000 labeled training images and 100000 unlabeled images. There are ten predefined folds with 1000 labeled images each. Given a fold with 1000 labeled images, we use 4000 other labeled images out of 5000 labeled training images as validation split. The STL10 test set has 8000 labeled images.
Tiny ImageNet is a compact version of ImageNet, including 100000 training images, 10000 validation images, and 10000 test images. Since the groundtruth labels of test images are not available, we evaluate our method on 10000 validation images and use them as the test set. There are 200 classes in Tiny ImageNet. We divide training images into 90000 images used for train split and 10000 used for validation split. For the semisupervised learning setting, we use 10% of train split as labeled data and treat the rest as unlabeled data. As a result, there are 9000 labeled images and 81000 unlabeled images.
Appendix E Qualitative Results
e.1 RankingMatch versus Other Methods
To cast the light for how the models have learned to classify the images, we visualize the “logits” scores using tSNE which was introduced by
Maaten and Hinton (2008). tSNE visualization reduces the highdimensional features to a reasonable dimension to help grasp the tendency of the learned models. We visualize the “logits” scores of four methods, which are MixMatch, FixMatch, RankingMatch, and RankingMatch, as shown in Figure 6. These four methods were trained on CIFAR10 with 4000 labels and were trained for 128 epochs with the same random seed.At first glance in Figure 6, both four methods tend to group the points of the same class into the same cluster depicted by the same color. The shape of the clusters is different among methods, and it is hard to say which method is the best one based on the shape of the clusters. However, the less the overlapping points among classes are, the better the method is. We can easily see that MixMatch (Figure (a)a) has more overlapping points than other methods, leading to worse performance. This statement is consistent with the accuracy of the method. We quantify the overlapping points by computing the confusion matrices, as shown in Figure 7.
If we pay more attention to tSNE visualization in Figure 6, we can realize that all methods have many overlapping points between class 3 (cat) and 5 (dog). These overlapping points could be regarded as the confusion points, where the model misclassifies them. For example, as shown in the confusion matrices in Figure 7, MixMatch misclassifies 100 points as dog while they are actually cat. This number is 66, 60, or 64 in the case of FixMatch, RankingMatch, or RankingMatch
, respectively. We leave researching the shape of the clusters and the relationship between tSNE visualization and the confusion matrix for future work.
e.2 RankingMatch with Variants of Triplet Loss
Figure 8 shows tSNE visualization for the “logits” scores of the models in Table 5 in the case of trained on CIFAR10 with 4000 labels. Triplet loss utilizes a series of triplets {, , } to satisfy the objective function. Once the input was given, the loss function is optimized to minimize the distance between and while maximizing the distance between and , implying that the way of treating the series of triplets might significantly affect how the model is updated. BatchAll, for instance, takes into account all possible triplets when calculating the loss function. Since BatchAll treats all samples equally, it is likely to be biased by the samples with predominant features, which might hurt expected performance. To shore up our argument, let see in Figure (a)a, BatchAll has numerous overlapping points and even gets lower accuracy by a large margin compared to others. Especially at the center of the figure, the model is confusing almost all the labels. It is thus natural to argue that BatchAll is poor at generalizing to unseen data. BatchHard (Figure (b)b) is better than BatchAll, but it still has many overlapping points at the center of the figure. Our BatchMean surpasses both BatchHard and BatchAll when much better clustering classes, leading to the best accuracy compared to other methods. The confusion matrices shown in Figure 9 quantify overlapping points, which could be regarded as confusion points where the model misclassifies them.
e.3 RankingMatch with normalization
We use the models reported in Table 5 in the case of trained on CIFAR10 with 4000 labels. Notably, we do not visualize RankingMatch without normalization because that model does not converge. tSNE visualizations of the ”logits” scores of RankingMatch models and corresponding confusion matrices are shown in Figure 10 and 11, respectively. There is not much difference between RankingMatch with and without normalization in terms of the cluster shape and overlapping points. However, in terms of accuracy, normalization actually helps improve classification performance, as shown in Table 5.
Appendix F Computational Efficiency of BatchMean Triplet Loss
As presented in Section 3.3,

BatchAll Triplet loss considers all possible triplets when computing the loss.

BatchHard Triplet loss only takes into account the hardest triplets when calculating the loss.

Our BatchMean Triplet loss only considers the “mean” triplets (consisting of anchors, “mean” positive samples, and “mean” negative samples) when computing the loss.
Because BatchMean does not consider all triplets but only the “mean” triplets, BatchMean has the advantage of computational efficiency of BatchHard Triplet loss. On the other hand, all samples are used to compute the “mean” samples, BatchMean also takes into account all samples as done in BatchAll Triplet loss. The efficacy of BatchMean Triplet loss was proved in Table 5 when achieving the lowest error rates compared to other methods. Therefore, this section only focuses on the contents of computational efficiency. Firstly, let us take a simple example to intuitively show the computational efficiency of BatchHard and BatchMean against BatchAll Triplet loss. Assume we have an anchor , three positive samples corresponding to : , , and , and two negative samples with respect to : and .

In BatchAll, there will have six possible triplets considered: (, , ), (, , ), (, , ), (, , ), (, , ), and (, , ).

BatchHard only takes into account one hardest triplet: (, , ).

Finally, in our BatchMean, there is only one “mean” triplet considered: (, , ).
As a result, BatchHard and BatchMean take fewer computations than BatchAll Triplet loss.
To quantitatively prove the computational efficiency of BatchHard and our BatchMean compared to BatchAll Triplet loss, we measure the training time and GPU memory usage, as presented in Appendix F.1 and F.2. We use the same hyperparameters for all methods to ensure a fair comparison. Notably, for clearance and simplicity, we use BatchAll, BatchHard, and BatchMean for RankingMatch, RankingMatch, and RankingMatch respectively.
f.1 Measurement per Epoch
Table 9 shows the training time per epoch (seconds) and GPU memory usage (MB) of the methods for 128 epochs on CIFAR10, SVHN, and CIFAR100. As shown in Table 9, BatchHard and our BatchMean have the similar training time and the similar GPU memory usage among datasets. The results also show that BatchHard and BatchMean are more computationally efficient than BatchAll across all datasets. For example:

On SVHN, BatchHard and BatchMean reduce the training time per epoch by 126.25 and 125.82 seconds compared to BatchAll, respectively.

BatchAll occupies much more GPU memory than BatchHard and BatchMean, which is about 1.87, 1.85, and 1.79 times on CIFAR10, SVHN, and CIFAR100 respectively.
Training time per epoch  

Method  CIFAR10  SVHN  CIFAR100 
BatchAll  385.8715.35  434.0520.07  323.038.81 
BatchHard  308.990.79  307.800.71  310.160.98 
BatchMean  309.270.74  308.230.57  309.331.03 
GPU memory usage  
Method  CIFAR10  SVHN  CIFAR100 
BatchAll  9039.722043.30  8967.591535.99  8655.812299.36 
BatchHard  4845.920.72  4845.920.72  4847.860.97 
BatchMean  4845.920.72  4845.920.72  4847.860.97 
The training time per epoch (seconds) and GPU memory usage (MB) are measured during 128 epochs, as illustrated in Figure 12. In addition to computational efficiency against BatchAll, BatchHard and BatchMean have the stable training time per epoch and the stable GPU memory usage. On the other hand, the training time of BatchAll is gradually increased during the training process. Especially, there is a time when the training time of BatchAll grows up significantly, and this time is different among datasets. Moreover, it seems that the amount of computations of BatchAll is also different among datasets. These differences will be clarified in the following section (Appendix F.2).
f.2 Measurement per Batch
Different from the previous section (Appendix F.1), this section measures the training time and monitors the GPU memory usage per training step. Each training step corresponds to the training over a batch. Table 10 and Figure 13 show the measurement of the methods for the first 5100 training steps on CIFAR10 and SVHN.
CIFAR10  SVHN  

Method  Time per batch  GPU memory usage  Time per batch  GPU memory usage 
BatchAll  318.4821.02  6643.712222.58  371.0428.64  8974.892167.56 
BatchHard  300.977.28  4844.003.10  302.217.77  4843.973.11 
BatchMean  302.377.92  4843.983.11  303.187.87  4843.953.12 
Table 10 demonstrates that BatchHard and our BatchMean is much more computationally efficient than BatchAll. For instance, on SVHN, BatchHard and BatchMean reduce the training time per batch by 68.83 and 67.86 milliseconds compared to BatchAll, respectively. It would be found more interesting in Figure 13.
In Figure (a)a and (b)b, the “peak” values indicate the starting time of a new epoch. At that time, there are some extra steps like initialization, so it might take more time. As shown in Figure 13, BatchAll starts to take more computations from the 2200 and 500 training step on CIFAR10 and SVHN, respectively; this is reasonable because we used a threshold to ignore the lowconfidence predictions for unlabeled data (Section 3.2). At the beginning of the training process, the model is not well trained and thus produces the predictions with very low confidence, so many samples are discarded. As a result, there are a few possible triplets for unlabeled data at the beginning of the training process, leading to fewer computations of BatchAll.
When the model is progressed, it is trained more and produces more highconfidence predictions, leading to more possible triplets. Therefore, BatchAll has more computations. Figure 13 also shows that the starting point of increasing the computation of BatchAll is earlier in the case of SVHN compared to CIFAR10. This is reasonable because the SVHN dataset only consists of digits from 0 to 9 and thus is simpler than the CIFAR10 dataset. As a result, it is easier for the model to learn SVHN than CIFAR10, leading to more highconfidence predictions and more possible triplets at the beginning of the training process in the case of SVHN compared to CIFAR10. Moreover, the training time per batch and GPU memory usage of BatchAll on SVHN are larger than those on CIFAR10 over the first 5100 training steps. Therefore, we can argue that the less complex the dataset is, the earlier and more BatchAll takes computations. This is also the reason for us to monitor the computational efficiency with more training steps on CIFAR100.
Since CIFAR100 has 100 classes, it is more complex than CIFAR10 and SVHN. Therefore, the model needs more training steps to be more confident. Table 11 and Figure 14 show the training time per batch and GPU memory usage of the methods on CIFAR100 over the first 31620 training steps. BatchHard and our BatchMean are still more computationally efficient than BatchAll, but it is not too much especially for the training time per batch. To show discernible changes, we need to monitor the training time per batch and GPU memory usage with more training steps, as presented in the previous section (Appendix F.1). Figure 14 also shows that BatchAll starts to consume more computations at around the 17500 training step, which is much later than on CIFAR10 and SVHN.
Method  Time per batch  GPU memory Usage 

BatchAll  306.328.48  5255.91593.29 
BatchHard  303.297.67  4846.802.58 
BatchMean  302.867.59  4846.772.57 
Comments
There are no comments yet.