1 Introduction
Deep neural networks (DNNs) have achieved impressive performance on tasks across a variety of domains, including vision krizhevsky2012imagenet ; simonyan2014very ; he2016deep ; he2017mask , speech recognition hinton2012deep ; hannun2014deep ; deng2013new ; noda2015audio
, and natural language processing
young2018recent ; hirschberg2015advances ; conneau2016very ; kumar2016ask. However, these achievements often heavily rely on largescale labelled datasets, requiring burdensome and expensive annotation efforts. This problem is especially acute in specialized domains such as medical image processing, where annotation may involve performing an invasive process on patients. To avoid the need for large numbers of labels in training DNNs, researchers have proposed unsupervised methods that operate solely with ubiquitously available unlabeled data. Such methods have attained significant recent progress in the visual domain, where stateoftheart unsupervised learning algorithms have begun to rival their supervised counterparts on largescale transfer learning tests
caron2018deep ; wu2018unsupervised ; zhuang2019local ; donahue2016adversarial ; zhang2016colorful ; noroozi2016unsupervised ; noroozi2017representation .While taskgeneric unsupervised methods may provide good starting points for feature learning, they must be adapted with at least some labelled data to solve any specific desired target task. Semisupervised learning seeks to leverage limited amounts of labelled data, in conjunction with extensive unlabelled data, to bridge the gap between the unsupervised and fully supervised cases. Recent work in semisupervised learning has shown significant promise liu2018deep ; iscen2019label ; zhai2019s4l ; miyato2018virtual ; tarvainen2017mean ; lee2013pseudo ; grandvalet2005semi ; qiao2018deep
, although gaps to supervised performance levels still remain significant, especially in largescale datasets where very few labels are available. An important consideration is that many recently proposed semisupervised methods rely on techniques whose efficiency scales poorly with dataset size and thus cannot be readily applied to many realworld machine learning problems
liu2018deep ; iscen2019label .Here, we propose a novel semisupervised learning algorithm that is specifically adapted for use with large sparselylabelled datasets. This algorithm, termed Local Label Propagation (LLP), learns a nonlinear embedding of the input data, and exploits the local geometric structure of the latent embedding space to help infer useful pseudolabels for unlabelled datapoints. LLP borrows the framework of nonparametric embedding learning, which has recently shown utility in unsupervised learning wu2018unsupervised ; zhuang2019local
, to first train a deep neural network that embeds labelled and unlabelled examples into a lowerdimensional latent space. LLP then propagates labels from known examples to unknown datapoints, weighting the likelihood of propagation by a factor involving the local density of known examples. The neural network embedding is then optimized to categorize all datapoints according to their pseudolabels (with stronger emphasis on true known labels), while simultaneously encouraging datapoints sharing the same (pseudo)labels to aggregate in the latent embedding space. The resulting embedding thus gathers both labelled images within the same class and unlabelled images sharing statistical similarities with the labelled ones. Through iteratively applying the propagation and network training steps, the LLP algorithm builds a good underlying representation for supporting downstream tasks, and trains an accurate classifier for the specific desired task.
We apply the LLP procedure in the context of object categorization in the ImageNet dataset deng2009imagenet , learning a highperforming network while discarding most of the known labels. The LLP procedure substantially outperforms previous stateoftheart semisupervised algorithms that are sufficiently scalable that they can be applied to ImageNet zhai2019s4l ; miyato2018virtual ; tarvainen2017mean ; lee2013pseudo ; grandvalet2005semi ; qiao2018deep , with gains that are consistent across a wide variety of training regimes. LLPtrained features also support improved transfer to Places205, a largescale scenerecognition task. In the sections that follow, we first discuss related literature (§2), describe the LLP method (§3), show experimental results (§4), and present analyses that provide insights into the learning procedure and justification of key parameter choices (§5).
2 Related Work
Below we describe conceptual relationships between our work and recent related approaches, and identify relevant major alternatives for comparison.
Deep Label Propagation. Like LLP, Deep Label Propagation iscen2019label
(DLP) also iterates between steps of label propagation and neural network optimization. In contrast to LLP, the DLP label propagation scheme is based on computing pairwise similarity matrices of learned visual features across all (unlabelled) examples. Unlike in LLP, the DLP loss function is simply classification with respect to pseudolabels, without any additional aggregation terms ensuring that the pseudolabelled and truelabelled points have similar statistical structure. The DLP method is effective on comparatively small datasets, such as CIFAR10 and MiniImageNet. However, DLP is challenging to apply to largescale datasets such as ImageNet, since its label propagation method is
in the number of datapoints, and is not readily parallelizable. In contrast, LLP is , where is the number of labelled datapoints, and is easily parallelized, making its effective complexity , where is the number of parallel processes. In addition, DLP uniformly propagates labels across networks’ implied embedding space, while LLP’s use of local densitydriven propagation weights specifically exploits the geometric structure in the learned embedding space, improving pseudolabel inference.Deep Metric Transfer and Pseudolabels. The Deep Metric Transfer liu2018deep (DMT) and Pseudolabels lee2013pseudo methods both use noniterative twostage procedures. In the first stage, the representation is initialized either with a selfsupervised task such as nonparametric instance recognition (DMT), or via direct supervision on the known labels (Pseudolabels). In the second stage, pseudolabels are obtained either by applying a label propagation algorithm (DMT) or naively from the pretrained classifier (Pseudolabels), and these are then used to finetune the network. As in DLP, the label propagation algorithm used by DMT cannot be applied to largescale datasets, and does not specifically exploit local statistical features of the learned representation. While more scalable, the Pseudolabels approach achieves comparatively poor results. A key point of contrast between LLP and the twostage methods is that in LLP, the representation learning and label propagation processes interact via the iterative training process, an important driver of LLP’s improvements.
SelfSupervised SemiSupervised Learning. SelfSupervised SemiSupervised Learning zhai2019s4l (SL) cotrains a network using selfsupervised methods on unlabelled images and traditional classification loss on labelled images. Unlike LLP, SL simply “copies” selfsupervised learning tasks as parallel cotraining loss branches. In contrast, LLP involves a nontrivial interaction between known and unknown labels via label propagation and the combination of categorization and aggregation terms in the shared loss function, both factors that are important for improved performance.
Consistencybased regularization. Several recent semisupervised methods rely on dataconsistency regularizations. Virtual Adversarial Training (VAT) miyato2018virtual adds small input perturbations, requiring outputs to be robust to this perturbation. Mean Teacher (MT) tarvainen2017mean requires the learned representation to be similar to its exponential moving average during training. Deep CoTraining (DCT) qiao2018deep requires the outputs of two views of the same image to be similar, while ensuring outputs vary widely using adversarial pairs. These methods all use unlabeled data in a “pointwise” fashion, applying the proposed consistency metric separately on each. They thus differ significantly from LLP, or indeed any method that explicitly relates unlabelled to labelled points. LLP benefits from training a shared embedding space that aggregates statistically similar unlabelled datapoints together with labelled (putative) counterparts. As a result, increasing the number of unlabelled images consistently increases the performance of LLP, unlike for the MeanTeacher method.
3 Methods
We first give an overview of the LLP method. At a high level, LLP learns a model from labeled examples , their associated labels , and unlabelled examples . is realized via a deep neural network whose parameters are network weights. For each input , generates two outputs (Fig. 1
): an “embedding output”, realized as a vector
in a dimensional sphere, and a category prediction output . In learning , the LLP procedure repeatedly alternates between two steps: label propagation and representation learning. First, known labels are propagated from to , creating pseudolabels . Then, network parameters are updated to minimize a loss function balancing category prediction accuracy evaluated on the outputs, and a metric of statistical consistency evaluated on the outputs.In addition to pseudolabels, the label propagation step also generates [0,1]valued confidence scores for each example . For labelled points, confidence scores are automatically set to 1, while for pseudolabelled examples, confidence scores are computed from the local geometric structure of the embedded points, reflecting how close the embedding vectors of the pseudolabelled points are to those of their putative labelled counterparts. The confidence values are then used as loss weights during representation learning.
Representation Learning. Assume that datapoints , labels and pseudolabels , and confidences are given. Let denote the set of corresponding embedded vectors, and denote the set of corresponding category prediction outputs. In the representation learning step, we update the network embedding parameters by simultaneously minimizing the standard crossentropy loss between predicted and propagated pseudolabels, while maximizing a global aggregation metric to enforce overall consistency between known labels and pseudolabels.
The definition of is based on the nonparametric softmax operation proposed by Wu et. al. wu2018unsupervised ; wu2018improving
, which defines the probability that an arbitrary embedding vector
is recognized as the th example as:(1) 
where temperature
is a fixed hyperparameter. For
, the probability of a given being recognized as an element of is:(2) 
We then define the aggregation metric as the (negative) log likelihood that will be recognized as a member of the set of examples sharing its pseudolabel:
(3) 
Optimizing encourages the embedding corresponding to a given datapoint to selectively become close to embeddings of other datapoints with the same pseudolabel (Fig. 1).
The crossentropy and aggregation loss terms are scaled on a perexample basis by the confidence score, and an weight regularization penalty is added. Thus, the final loss for example is:
(4) 
where is a regularization hyperparameter.
Label Propagation. We now describe how LLP generates pseudolabels and confidence scores . To understand our actual procedure, it is useful to start from the weighted KNearestNeighbor classification algorithm wu2018unsupervised , in which a “vote” is obtained from the top nearest labelled examples for each unlabelled example , denoted . The vote of each is weighted by the corresponding probabilities that will be identified as example . Assuming classes, the total weight for pseudolabelled as class is thus:
(5) 
Therefore, the probability that datapoint is of class , the associated inferred pseudolabel , and the corresponding confidence , may be defined as:
(6) 
Although intuitive, weightedKNN ignores all the other
unlabelled examples in the embedding space when inferring and . Fig. 1c depicts a scenario where this can be problematic: the embedded point is near examples of three different known classes (red, green and blue points) that are of similar distance to , but each having different local data densities. If we directly use as the weights for these three labelled neighbors, they will contribute similar weights when calculating . However, we should expect a higher weight from the lowerdensity red neighbor. The lower density indicates that the prior that any given point near is identified as the red instance is lower than for the other possible labels. However, this information is not reflected in , which represents the joint probability that is red. We should instead use the posteriorprobability as the vote weight, which by Bayes’ theorem means that
should divided by its prior. To formalize this reasoning, we replace in the definition of with a locally weighted probability:(7) 
where are nearest neighbors and denominator is a measure of the local embedding density. For consistency, we replace with , which contains the labelled neighbors with highest locallyweighted probability, to ensure that the votes come from the most relevant labelled examples. The final form of the LLP propagation weight equation is thus:
(8) 
Memory Bank. Both the label propagation and representation learning steps implicitly require access to all the embedded vectors at every computational step. However, recomputing rapidly becomes intractable as dataset size increases. We address this issue by approximating realtime with a memory bank that keeps a running average of the embeddings. As this procedure is directly taken from wu2018unsupervised ; wu2018improving ; zhuang2019local , we refer readers to these works for a detailed description.
4 Results
We first evaluate the LLP method on visual object categorization in the largescale ImageNet dataset deng2009imagenet , under a variety of training regimes. We also illustrate transfer learning to Places 205 zhou2014learning , a largescale scenerecognition dataset.
Experimental settings. Following wu2018unsupervised ; zhuang2019local , and . Optimization uses SGD with momentum of 0.9, batch size of 128, and weightdecay parameter
. Learning rate is initialized to 0.03 and then dropped by a factor of 10 whenever validation performance saturates. Depending on the training regime specifics (how many labelled and unlabelled examples), training takes 200400 epochs, comprising three learning rate drops. Similarly to
zhuang2019local , we initialize networks with the IR loss function in a completely unsupervised fashion for 10 epochs, and then switch to the LLP method. Most hyperparameters are directly taken from wu2018unsupervised ; wu2018improving ; zhuang2019local , although as shown in zhai2019s4l , a hyperparameter search can potentially improve performance. In the label propagation stage, we set and (these choices are justified in Section 5). The density estimate
is recomputed for all labelled images at once at the end of every epoch. For each traning regmine, we train both ResNet18v2 and ResNet50v2 he2016identity architectures, with an additional fully connected layer added alongside the standard softmax categorization layer to generate the embedding output.ImageNet with varying training regimes. We train on ImageNet with labels and total images available, meaning that , . Different regimes are defined by and . Results for each regime are shown in Tables 13. Due to the inconsistency of reporting metrics across different papers, we alternate between comparing top1 and top5 accuracy, depending on which metric was reported in the relevant previous work.
The results show that: 1. LLP significantly outperforms previous stateoftheart methods by large margins within all training regimes tested, regardless of network architectures used, number of labels, and number of available unlabelled images; 2. LLP shows especially large improvements to other methods when only small number of labels are known. For example, ResNet18 trained using LLP with only 3% labels achieves 53.24% top1 accuracy, which is 12.43% better than Mean Teacher; 3. Unlike Mean Teacher, where the number of unlabelled images appears essentially irrelevant, LLP consistently benefits from additional unlabelled images (see Table 3), and is not yet saturated using all the images in the ImageNet set. This suggests that with additional unlabelled data, LLP could potentially achieve even better performance.
Method  1% labels  3% labels  5% labels  10% labels 

Supervised  17.35  28.61  36.01  47.89 
DCT qiao2018deep  –  –  –  53.50 
MT tarvainen2017mean  16.91  40.81  48.34  56.70 
LLP (ours)  27.14  53.24  57.04  61.51 
# labels  Supervised  Pseudolabels  VAT  VATEM grandvalet2005semi  *  MT  LLP (ours) 

1%  48.43  51.56  44.05  46.96  53.37  40.54  61.89 
10%  80.43  82.41  82.78  83.39  83.82  85.42  88.53 
Method  30% unlabeled  70% unlabeled  100% unlabeled 

MT  56.07  55.59  55.65 
LLP (ours)  58.62  60.27  61.51 
# labels  Supervised  Pseudolabels  VAT  VATEM  MT  LLP (ours)  LA*  

10%  44.7  48.2  45.8  46.2  46.6  46.4  50.4  48.3 
1%  36.2  41.8  35.9  36.4  38.0  31.6  44.6 
Method  LS zhou2004learning  LP zhu2002learning  LP_DMT liu2018deep  LP_DLP iscen2019label  LLP (ours) 

Perf.  84.6 3.4  87.7 2.2  88.2 2.3  89.2 2.4  88.1 2.3 
are standard deviations after 10 independent data subsamples.
Model  Top50woc  Top50  Top20  Top10  Top5  Top50wc  Top50lw  Top10wclw 

NN perf.  52.43  54.37  54.82  55.42  55.44  55.46  56.27  57.54 
Transfer learning to Scene Recognition. To evaluate the quality of our learned representation in other downstream tasks besides ImageNet classification, we assess its transfer learning performance to the Places205 zhou2014learning dataset. This dataset has 2.45 images total in 205 distinct scene categories. We fix the nonlinear weights learned on ImageNet, add another linear readout layer on top of the penultimate layer, and train the readout using crossentropy loss using SGD as above. Learning rate is initialized at 0.01 and dropped by factor of 10 whenever validation performance on Places205 saturates. Training requires approximately 500,000 steps, comprising two learning rate drops. We only evaluate our ResNet50 trained with 1% or 10% labels, as previous work zhai2019s4l reported performance with that architecture. Table 4 show that LLP again significantly outperforms previous stateoftheart results. It is notable, however, that when trained with only 1% ImageNet labels, all current semisupervised learning methods show somewhat worse transfer performance to Places205 than the Local Aggregation (LA) method, the current stateoftheart unsupervised learning method zhuang2019local .
5 Analysis
To better understand the LLP procedure, we analyze both the final learned embedding space, and how it changes during training.
Emerging clusters during training. Intuitively, the aggregation term in eq. 3 should cause embedding outputs with the same ground truth label, whether known or propagated, to cluster together during training. Indeed, visualization (Fig. 2a) shows clustering becoming more pronounced along the training trajectory both for labelled and unlabelled datapoints, while unlabelled datapoints surround labelled datapoints increasingly densely. A simple metric measuring the aggregation of a group of embedding vectors is the L2 norm of the group mean, which, since all embeddings lie in the 128D unit sphere, is inversely related to the group dispersion. Computing this metric separately for each ImageNet category and averaging across categories, we obtain a quantitative description of aggregation over the learning timecourse (Fig. 2b), further supporting the conclusion that LLP embeddings become increasingly clustered.
Effect of architecture. We also investigate how network architecture influences learning trajectory and the final representation, comparing ResNet50 and ResNet18 trained with 10% labels (Fig. 2bd). The more powerful ResNet50 achieves a more clustered representation than ResNet18, both at all time points during training and for almost every ImageNet category.
Category structure analysis: successes, failures, and subcategory discovery. It is instructive to systematically analyze statistical patterns on a percategory basis. To illustrate this, we visualize the embeddings for three representative categories with 2D multidimensional scaling (MDS). For an “easy” category with a high aggregation score (Fig. 2
e), the LLP embedding identifies images with strong semantic similarity, supporting successful semisupervised image retrieval. For a “hard” category with low aggregation score (Fig.
2f), images statistics vary much more and the embedding fails to properly cluster examples together. Most interestingly, for multimodal categories with intermediate aggregation scores (Fig. 2g), the learned embedding can reconstruct semantically meaningful subclusters even when these are not present in the original labelling e.g. the “labrador” category decomposing into “black” and “yellow” subcategories.Comparison to global propagation in the smalldataset regime. To understand how LLP compares to methods that use global similarity information, but therefore lack scalability to large datasets, we test several such methods on ImageNet subsets containing 50 categories and 50 images per category. Table 5 shows that local propagtion methods can be effective even in this regime, as LLP’s performance is comparable to that of the global propagation algorithm used in DMT liu2018deep and only slightly lower than that of DLP iscen2019label .
Ablations. To justify key parameters and design choices, we conduct a series of ablation studies exploring the following alternatives, using: 1. Different values for (experiments Top50, Top20, Top10, and Top5 in Table 6); 2. Confidence weights or not (Top50woc and Top50); 3. Combined categorization loss or not (Top50wc and Top50); 4. Densityweighted probability or not (Top50lw and Top50). Results in Table 6 shows the significant contributions of each design choice. Since some models in these studies are not trained with softmax classifier layers, comparisons are made with a simple NearestNeighbor (NN) method, which for each test example finds the closest neighbor in the memory bank and uses that neighbor’s label as its prediction. As reported in wu2018unsupervised ; zhuang2019local , higher NN performance strongly predicts higher softmax categorization performance.
6 Discussion
In this work, we presented LLP, a method for semisupervised deep neural network training that scales to large datasets. LLP efficiently propagates labels from known to unknown examples in a common embedding space, ensuring highquality propagation by exploiting the local structure of the embedding. The embedding itself is simultaneously cotrained to achieve high categorization performance while enforcing statistical consistency between real and pseudolabels. LLP achieves stateoftheart semisupervised learning results across all tested training regimes, including those with very small amounts of labelled data, and transfers effectively to other nontrained tasks.
In future work, we seek to improve LLP by better integrating it with stateoftheart unsupervised learning methods (e.g. zhuang2019local ). This is especially relevant in the regime with verylow fractions of known labelled datapoints (e.g. 1% of ImageNet labels), where the best pure unsupervised methods appear to outperform the stateoftheart semisupervised approaches. In addition, in its current formulation, LLP may be less effective on small datasets than alternatives that exploit global similarity structure (e.g. iscen2019label ; liu2018deep ). We thus hope to improve upon LLP by identifying methods of label propagation that can take advantage of global structure while remaining scalable.
However, the real promise of semisupervised learning is that it will enable tasks that are not already essentially solvable with supervision (cf. izeki2019billi ). Thus, an important direction for future work lies in applying LLP or related algorithms to tasks beyond object categorization where dense labeling is possible but very costly, such as medical imaging, multiobject scene understanding, 3D shape reconstruction, videobased action recognition and tracking, or understanding multimodal audiovisual datastreams.
References

(1)
M. Caron, P. Bojanowski, A. Joulin, and M. Douze.
Deep clustering for unsupervised learning of visual features.
In
Proceedings of the European Conference on Computer Vision (ECCV)
, pages 132–149, 2018.  (2) A. Conneau, H. Schwenk, L. Barrault, and Y. Lecun. Very deep convolutional networks for natural language processing. arXiv preprint arXiv:1606.01781, 2, 2016.
 (3) J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei. Imagenet: A largescale hierarchical image database. 2009.
 (4) L. Deng, G. Hinton, and B. Kingsbury. New types of deep neural network learning for speech recognition and related applications: An overview. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 8599–8603. IEEE, 2013.
 (5) J. Donahue, P. Krähenbühl, and T. Darrell. Adversarial feature learning. arXiv preprint arXiv:1605.09782, 2016.
 (6) Y. Grandvalet and Y. Bengio. Semisupervised learning by entropy minimization. In Advances in neural information processing systems, pages 529–536, 2005.
 (7) A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, et al. Deep speech: Scaling up endtoend speech recognition. arXiv preprint arXiv:1412.5567, 2014.
 (8) K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask rcnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.

(9)
K. He, X. Zhang, S. Ren, and J. Sun.
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 770–778, 2016.  (10) K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In European conference on computer vision, pages 630–645. Springer, 2016.
 (11) G. Hinton, L. Deng, D. Yu, G. Dahl, A.r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, B. Kingsbury, et al. Deep neural networks for acoustic modeling in speech recognition. IEEE Signal processing magazine, 29, 2012.
 (12) J. Hirschberg and C. D. Manning. Advances in natural language processing. Science, 349(6245):261–266, 2015.
 (13) A. Iscen, G. Tolias, Y. Avrithis, and O. Chum. Label propagation for deep semisupervised learning. arXiv preprint arXiv:1904.04717, 2019.
 (14) A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
 (15) A. Kumar, O. Irsoy, P. Ondruska, M. Iyyer, J. Bradbury, I. Gulrajani, V. Zhong, R. Paulus, and R. Socher. Ask me anything: Dynamic memory networks for natural language processing. In International Conference on Machine Learning, pages 1378–1387, 2016.
 (16) D.H. Lee. Pseudolabel: The simple and efficient semisupervised learning method for deep neural networks. In Workshop on Challenges in Representation Learning, ICML, volume 3, page 2, 2013.
 (17) B. Liu, Z. Wu, H. Hu, and S. Lin. Deep metric transfer for label propagation with limited annotated data. arXiv preprint arXiv:1812.08781, 2018.
 (18) T. Miyato, S.i. Maeda, S. Ishii, and M. Koyama. Virtual adversarial training: a regularization method for supervised and semisupervised learning. IEEE transactions on pattern analysis and machine intelligence, 2018.

(19)
K. Noda, Y. Yamaguchi, K. Nakadai, H. G. Okuno, and T. Ogata.
Audiovisual speech recognition using deep learning.
Applied Intelligence, 42(4):722–737, 2015.  (20) M. Noroozi and P. Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision, pages 69–84. Springer, 2016.
 (21) M. Noroozi, H. Pirsiavash, and P. Favaro. Representation learning by learning to count. In Proceedings of the IEEE International Conference on Computer Vision, pages 5898–5906, 2017.
 (22) S. Qiao, W. Shen, Z. Zhang, B. Wang, and A. Yuille. Deep cotraining for semisupervised image recognition. In Proceedings of the European Conference on Computer Vision (ECCV), pages 135–152, 2018.
 (23) K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 (24) A. Tarvainen and H. Valpola. Mean teachers are better role models: Weightaveraged consistency targets improve semisupervised deep learning results. In Advances in neural information processing systems, pages 1195–1204, 2017.
 (25) Z. Wu, A. A. Efros, and S. X. Yu. Improving generalization via scalable neighborhood component analysis. In Proceedings of the European Conference on Computer Vision (ECCV), pages 685–701, 2018.
 (26) Z. Wu, Y. Xiong, S. X. Yu, and D. Lin. Unsupervised feature learning via nonparametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3733–3742, 2018.
 (27) I. Z. Yalniz, H. Jegou, K. Chen, M. Paluri, and D. Mahajan. Billionscale semisupervised learning for image classificatio. arXiv preprint arXiv:1905.00546, 2019.
 (28) T. Young, D. Hazarika, S. Poria, and E. Cambria. Recent trends in deep learning based natural language processing. ieee Computational intelligenCe magazine, 13(3):55–75, 2018.
 (29) X. Zhai, A. Oliver, A. Kolesnikov, and L. Beyer. S4l: Selfsupervised semisupervised learning. arXiv preprint arXiv:1905.03670, 2019.

(30)
R. Zhang, P. Isola, and A. A. Efros.
Colorful image colorization.
In European conference on computer vision, pages 649–666. Springer, 2016. 
(31)
B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva.
Learning deep features for scene recognition using places database.
In Advances in neural information processing systems, pages 487–495, 2014.  (32) D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Schölkopf. Learning with local and global consistency. In Advances in neural information processing systems, pages 321–328, 2004.
 (33) X. Zhu and Z. Ghahramani. Learning from labeled and unlabeled data with label propagation. Technical report, Citeseer, 2002.
 (34) C. Zhuang, A. L. Zhai, and D. Yamins. Local aggregation for unsupervised learning of visual embeddings. arXiv preprint arXiv:1903.12355, 2019.
Comments
There are no comments yet.