Deep neural networks (DNNs) have achieved impressive performance on tasks across a variety of domains, including vision krizhevsky2012imagenet ; simonyan2014very ; he2016deep ; he2017mask , speech recognition hinton2012deep ; hannun2014deep ; deng2013new ; noda2015audioyoung2018recent ; hirschberg2015advances ; conneau2016very ; kumar2016ask
. However, these achievements often heavily rely on large-scale labelled datasets, requiring burdensome and expensive annotation efforts. This problem is especially acute in specialized domains such as medical image processing, where annotation may involve performing an invasive process on patients. To avoid the need for large numbers of labels in training DNNs, researchers have proposed unsupervised methods that operate solely with ubiquitously available unlabeled data. Such methods have attained significant recent progress in the visual domain, where state-of-the-art unsupervised learning algorithms have begun to rival their supervised counterparts on large-scale transfer learning testscaron2018deep ; wu2018unsupervised ; zhuang2019local ; donahue2016adversarial ; zhang2016colorful ; noroozi2016unsupervised ; noroozi2017representation .
While task-generic unsupervised methods may provide good starting points for feature learning, they must be adapted with at least some labelled data to solve any specific desired target task. Semi-supervised learning seeks to leverage limited amounts of labelled data, in conjunction with extensive unlabelled data, to bridge the gap between the unsupervised and fully supervised cases. Recent work in semi-supervised learning has shown significant promise liu2018deep ; iscen2019label ; zhai2019s4l ; miyato2018virtual ; tarvainen2017mean ; lee2013pseudo ; grandvalet2005semi ; qiao2018deep
, although gaps to supervised performance levels still remain significant, especially in large-scale datasets where very few labels are available. An important consideration is that many recently proposed semi-supervised methods rely on techniques whose efficiency scales poorly with dataset size and thus cannot be readily applied to many real-world machine learning problemsliu2018deep ; iscen2019label .
Here, we propose a novel semi-supervised learning algorithm that is specifically adapted for use with large sparsely-labelled datasets. This algorithm, termed Local Label Propagation (LLP), learns a nonlinear embedding of the input data, and exploits the local geometric structure of the latent embedding space to help infer useful pseudo-labels for unlabelled datapoints. LLP borrows the framework of non-parametric embedding learning, which has recently shown utility in unsupervised learning wu2018unsupervised ; zhuang2019local
, to first train a deep neural network that embeds labelled and unlabelled examples into a lower-dimensional latent space. LLP then propagates labels from known examples to unknown datapoints, weighting the likelihood of propagation by a factor involving the local density of known examples. The neural network embedding is then optimized to categorize all datapoints according to their pseudo-labels (with stronger emphasis on true known labels), while simultaneously encouraging datapoints sharing the same (pseudo-)labels to aggregate in the latent embedding space. The resulting embedding thus gathers both labelled images within the same class and unlabelled images sharing statistical similarities with the labelled ones. Through iteratively applying the propagation and network training steps, the LLP algorithm builds a good underlying representation for supporting downstream tasks, and trains an accurate classifier for the specific desired task.
We apply the LLP procedure in the context of object categorization in the ImageNet dataset deng2009imagenet , learning a high-performing network while discarding most of the known labels. The LLP procedure substantially outperforms previous state-of-the-art semi-supervised algorithms that are sufficiently scalable that they can be applied to ImageNet zhai2019s4l ; miyato2018virtual ; tarvainen2017mean ; lee2013pseudo ; grandvalet2005semi ; qiao2018deep , with gains that are consistent across a wide variety of training regimes. LLP-trained features also support improved transfer to Places205, a large-scale scene-recognition task. In the sections that follow, we first discuss related literature (§2), describe the LLP method (§3), show experimental results (§4), and present analyses that provide insights into the learning procedure and justification of key parameter choices (§5).
2 Related Work
Below we describe conceptual relationships between our work and recent related approaches, and identify relevant major alternatives for comparison.
Deep Label Propagation. Like LLP, Deep Label Propagation iscen2019label
(DLP) also iterates between steps of label propagation and neural network optimization. In contrast to LLP, the DLP label propagation scheme is based on computing pairwise similarity matrices of learned visual features across all (unlabelled) examples. Unlike in LLP, the DLP loss function is simply classification with respect to pseudo-labels, without any additional aggregation terms ensuring that the pseudo-labelled and true-labelled points have similar statistical structure. The DLP method is effective on comparatively small datasets, such as CIFAR10 and Mini-ImageNet. However, DLP is challenging to apply to large-scale datasets such as ImageNet, since its label propagation method isin the number of datapoints, and is not readily parallelizable. In contrast, LLP is , where is the number of labelled datapoints, and is easily parallelized, making its effective complexity , where is the number of parallel processes. In addition, DLP uniformly propagates labels across networks’ implied embedding space, while LLP’s use of local density-driven propagation weights specifically exploits the geometric structure in the learned embedding space, improving pseudo-label inference.
Deep Metric Transfer and Pseudolabels. The Deep Metric Transfer liu2018deep (DMT) and Pseudolabels lee2013pseudo methods both use non-iterative two-stage procedures. In the first stage, the representation is initialized either with a self-supervised task such as non-parametric instance recognition (DMT), or via direct supervision on the known labels (Pseudolabels). In the second stage, pseudo-labels are obtained either by applying a label propagation algorithm (DMT) or naively from the pre-trained classifier (Pseudolabels), and these are then used to fine-tune the network. As in DLP, the label propagation algorithm used by DMT cannot be applied to large-scale datasets, and does not specifically exploit local statistical features of the learned representation. While more scalable, the Pseudolabels approach achieves comparatively poor results. A key point of contrast between LLP and the two-stage methods is that in LLP, the representation learning and label propagation processes interact via the iterative training process, an important driver of LLP’s improvements.
Self-Supervised Semi-Supervised Learning. Self-Supervised Semi-Supervised Learning zhai2019s4l (SL) co-trains a network using self-supervised methods on unlabelled images and traditional classification loss on labelled images. Unlike LLP, SL simply “copies” self-supervised learning tasks as parallel co-training loss branches. In contrast, LLP involves a nontrivial interaction between known and unknown labels via label propagation and the combination of categorization and aggregation terms in the shared loss function, both factors that are important for improved performance.
Consistency-based regularization. Several recent semi-supervised methods rely on data-consistency regularizations. Virtual Adversarial Training (VAT) miyato2018virtual adds small input perturbations, requiring outputs to be robust to this perturbation. Mean Teacher (MT) tarvainen2017mean requires the learned representation to be similar to its exponential moving average during training. Deep Co-Training (DCT) qiao2018deep requires the outputs of two views of the same image to be similar, while ensuring outputs vary widely using adversarial pairs. These methods all use unlabeled data in a “point-wise” fashion, applying the proposed consistency metric separately on each. They thus differ significantly from LLP, or indeed any method that explicitly relates unlabelled to labelled points. LLP benefits from training a shared embedding space that aggregates statistically similar unlabelled datapoints together with labelled (putative) counterparts. As a result, increasing the number of unlabelled images consistently increases the performance of LLP, unlike for the Mean-Teacher method.
We first give an overview of the LLP method. At a high level, LLP learns a model from labeled examples , their associated labels , and unlabelled examples . is realized via a deep neural network whose parameters are network weights. For each input , generates two outputs (Fig. 1
): an “embedding output”, realized as a vectorin a -dimensional sphere, and a category prediction output . In learning , the LLP procedure repeatedly alternates between two steps: label propagation and representation learning. First, known labels are propagated from to , creating pseudo-labels . Then, network parameters are updated to minimize a loss function balancing category prediction accuracy evaluated on the outputs, and a metric of statistical consistency evaluated on the outputs.
In addition to pseudo-labels, the label propagation step also generates [0,1]-valued confidence scores for each example . For labelled points, confidence scores are automatically set to 1, while for pseudo-labelled examples, confidence scores are computed from the local geometric structure of the embedded points, reflecting how close the embedding vectors of the pseudo-labelled points are to those of their putative labelled counterparts. The confidence values are then used as loss weights during representation learning.
Representation Learning. Assume that datapoints , labels and pseudolabels , and confidences are given. Let denote the set of corresponding embedded vectors, and denote the set of corresponding category prediction outputs. In the representation learning step, we update the network embedding parameters by simultaneously minimizing the standard cross-entropy loss between predicted and propagated pseudolabels, while maximizing a global aggregation metric to enforce overall consistency between known labels and pseudolabels.
, which defines the probability that an arbitrary embedding vectoris recognized as the -th example as:
is a fixed hyperparameter. For, the probability of a given being recognized as an element of is:
We then define the aggregation metric as the (negative) log likelihood that will be recognized as a member of the set of examples sharing its pseudo-label:
Optimizing encourages the embedding corresponding to a given datapoint to selectively become close to embeddings of other datapoints with the same pseudo-label (Fig. 1).
The cross-entropy and aggregation loss terms are scaled on a per-example basis by the confidence score, and an weight regularization penalty is added. Thus, the final loss for example is:
where is a regularization hyperparameter.
Label Propagation. We now describe how LLP generates pseudo-labels and confidence scores . To understand our actual procedure, it is useful to start from the weighted K-Nearest-Neighbor classification algorithm wu2018unsupervised , in which a “vote” is obtained from the top nearest labelled examples for each unlabelled example , denoted . The vote of each is weighted by the corresponding probabilities that will be identified as example . Assuming classes, the total weight for pseudo-labelled as class is thus:
Therefore, the probability that datapoint is of class , the associated inferred pseudo-label , and the corresponding confidence , may be defined as:
Although intuitive, weighted-KNN ignores all the otherunlabelled examples in the embedding space when inferring and . Fig. 1c depicts a scenario where this can be problematic: the embedded point is near examples of three different known classes (red, green and blue points) that are of similar distance to , but each having different local data densities. If we directly use as the weights for these three labelled neighbors, they will contribute similar weights when calculating . However, we should expect a higher weight from the lower-density red neighbor. The lower density indicates that the prior that any given point near is identified as the red instance is lower than for the other possible labels. However, this information is not reflected in , which represents the joint probability that is red. We should instead use the posterior
probability as the vote weight, which by Bayes’ theorem means thatshould divided by its prior. To formalize this reasoning, we replace in the definition of with a locally weighted probability:
where are nearest neighbors and denominator is a measure of the local embedding density. For consistency, we replace with , which contains the labelled neighbors with highest locally-weighted probability, to ensure that the votes come from the most relevant labelled examples. The final form of the LLP propagation weight equation is thus:
Memory Bank. Both the label propagation and representation learning steps implicitly require access to all the embedded vectors at every computational step. However, recomputing rapidly becomes intractable as dataset size increases. We address this issue by approximating realtime with a memory bank that keeps a running average of the embeddings. As this procedure is directly taken from wu2018unsupervised ; wu2018improving ; zhuang2019local , we refer readers to these works for a detailed description.
We first evaluate the LLP method on visual object categorization in the large-scale ImageNet dataset deng2009imagenet , under a variety of training regimes. We also illustrate transfer learning to Places 205 zhou2014learning , a large-scale scene-recognition dataset.
. Learning rate is initialized to 0.03 and then dropped by a factor of 10 whenever validation performance saturates. Depending on the training regime specifics (how many labelled and unlabelled examples), training takes 200-400 epochs, comprising three learning rate drops. Similarly tozhuang2019local , we initialize networks with the IR loss function in a completely unsupervised fashion for 10 epochs, and then switch to the LLP method. Most hyperparameters are directly taken from wu2018unsupervised ; wu2018improving ; zhuang2019local , although as shown in zhai2019s4l , a hyperparameter search can potentially improve performance. In the label propagation stage, we set and (these choices are justified in Section 5
). The density estimateis recomputed for all labelled images at once at the end of every epoch. For each traning regmine, we train both ResNet-18v2 and ResNet-50v2 he2016identity architectures, with an additional fully connected layer added alongside the standard softmax categorization layer to generate the embedding output.
ImageNet with varying training regimes. We train on ImageNet with labels and total images available, meaning that , . Different regimes are defined by and . Results for each regime are shown in Tables 1-3. Due to the inconsistency of reporting metrics across different papers, we alternate between comparing top1 and top5 accuracy, depending on which metric was reported in the relevant previous work.
The results show that: 1. LLP significantly outperforms previous state-of-the-art methods by large margins within all training regimes tested, regardless of network architectures used, number of labels, and number of available unlabelled images; 2. LLP shows especially large improvements to other methods when only small number of labels are known. For example, ResNet-18 trained using LLP with only 3% labels achieves 53.24% top1 accuracy, which is 12.43% better than Mean Teacher; 3. Unlike Mean Teacher, where the number of unlabelled images appears essentially irrelevant, LLP consistently benefits from additional unlabelled images (see Table 3), and is not yet saturated using all the images in the ImageNet set. This suggests that with additional unlabelled data, LLP could potentially achieve even better performance.
|Method||1% labels||3% labels||5% labels||10% labels|
|# labels||Supervised||Pseudolabels||VAT||VAT-EM grandvalet2005semi||*||MT||LLP (ours)|
|Method||30% unlabeled||70% unlabeled||100% unlabeled|
|# labels||Supervised||Pseudolabels||VAT||VAT-EM||MT||LLP (ours)||LA*|
|Method||LS zhou2004learning||LP zhu2002learning||LP_DMT liu2018deep||LP_DLP iscen2019label||LLP (ours)|
|Perf.||84.6 3.4||87.7 2.2||88.2 2.3||89.2 2.4||88.1 2.3|
are standard deviations after 10 independent data subsamples.
Transfer learning to Scene Recognition. To evaluate the quality of our learned representation in other downstream tasks besides ImageNet classification, we assess its transfer learning performance to the Places205 zhou2014learning dataset. This dataset has 2.45 images total in 205 distinct scene categories. We fix the nonlinear weights learned on ImageNet, add another linear readout layer on top of the penultimate layer, and train the readout using cross-entropy loss using SGD as above. Learning rate is initialized at 0.01 and dropped by factor of 10 whenever validation performance on Places205 saturates. Training requires approximately 500,000 steps, comprising two learning rate drops. We only evaluate our ResNet-50 trained with 1% or 10% labels, as previous work zhai2019s4l reported performance with that architecture. Table 4 show that LLP again significantly outperforms previous state-of-the-art results. It is notable, however, that when trained with only 1% ImageNet labels, all current semi-supervised learning methods show somewhat worse transfer performance to Places205 than the Local Aggregation (LA) method, the current state-of-the-art unsupervised learning method zhuang2019local .
To better understand the LLP procedure, we analyze both the final learned embedding space, and how it changes during training.
Emerging clusters during training. Intuitively, the aggregation term in eq. 3 should cause embedding outputs with the same ground truth label, whether known or propagated, to cluster together during training. Indeed, visualization (Fig. 2a) shows clustering becoming more pronounced along the training trajectory both for labelled and unlabelled datapoints, while unlabelled datapoints surround labelled datapoints increasingly densely. A simple metric measuring the aggregation of a group of embedding vectors is the L2 norm of the group mean, which, since all embeddings lie in the 128-D unit sphere, is inversely related to the group dispersion. Computing this metric separately for each ImageNet category and averaging across categories, we obtain a quantitative description of aggregation over the learning timecourse (Fig. 2b), further supporting the conclusion that LLP embeddings become increasingly clustered.
Effect of architecture. We also investigate how network architecture influences learning trajectory and the final representation, comparing ResNet-50 and ResNet-18 trained with 10% labels (Fig. 2b-d). The more powerful ResNet-50 achieves a more clustered representation than ResNet-18, both at all time points during training and for almost every ImageNet category.
Category structure analysis: successes, failures, and sub-category discovery. It is instructive to systematically analyze statistical patterns on a per-category basis. To illustrate this, we visualize the embeddings for three representative categories with 2D multi-dimensional scaling (MDS). For an “easy” category with a high aggregation score (Fig. 2
e), the LLP embedding identifies images with strong semantic similarity, supporting successful semi-supervised image retrieval. For a “hard” category with low aggregation score (Fig.2f), images statistics vary much more and the embedding fails to properly cluster examples together. Most interestingly, for multi-modal categories with intermediate aggregation scores (Fig. 2g), the learned embedding can reconstruct semantically meaningful sub-clusters even when these are not present in the original labelling e.g. the “labrador” category decomposing into “black” and “yellow” subcategories.
Comparison to global propagation in the small-dataset regime. To understand how LLP compares to methods that use global similarity information, but therefore lack scalability to large datasets, we test several such methods on ImageNet subsets containing 50 categories and 50 images per category. Table 5 shows that local propagtion methods can be effective even in this regime, as LLP’s performance is comparable to that of the global propagation algorithm used in DMT liu2018deep and only slightly lower than that of DLP iscen2019label .
Ablations. To justify key parameters and design choices, we conduct a series of ablation studies exploring the following alternatives, using: 1. Different values for (experiments Top50, Top20, Top10, and Top5 in Table 6); 2. Confidence weights or not (Top50woc and Top50); 3. Combined categorization loss or not (Top50wc and Top50); 4. Density-weighted probability or not (Top50lw and Top50). Results in Table 6 shows the significant contributions of each design choice. Since some models in these studies are not trained with softmax classifier layers, comparisons are made with a simple Nearest-Neighbor (NN) method, which for each test example finds the closest neighbor in the memory bank and uses that neighbor’s label as its prediction. As reported in wu2018unsupervised ; zhuang2019local , higher NN performance strongly predicts higher softmax categorization performance.
In this work, we presented LLP, a method for semi-supervised deep neural network training that scales to large datasets. LLP efficiently propagates labels from known to unknown examples in a common embedding space, ensuring high-quality propagation by exploiting the local structure of the embedding. The embedding itself is simultaneously co-trained to achieve high categorization performance while enforcing statistical consistency between real and pseudo-labels. LLP achieves state-of-the-art semi-supervised learning results across all tested training regimes, including those with very small amounts of labelled data, and transfers effectively to other non-trained tasks.
In future work, we seek to improve LLP by better integrating it with state-of-the-art unsupervised learning methods (e.g. zhuang2019local ). This is especially relevant in the regime with very-low fractions of known labelled datapoints (e.g. 1% of ImageNet labels), where the best pure unsupervised methods appear to outperform the state-of-the-art semi-supervised approaches. In addition, in its current formulation, LLP may be less effective on small datasets than alternatives that exploit global similarity structure (e.g. iscen2019label ; liu2018deep ). We thus hope to improve upon LLP by identifying methods of label propagation that can take advantage of global structure while remaining scalable.
However, the real promise of semi-supervised learning is that it will enable tasks that are not already essentially solvable with supervision (cf. izeki2019billi ). Thus, an important direction for future work lies in applying LLP or related algorithms to tasks beyond object categorization where dense labeling is possible but very costly, such as medical imaging, multi-object scene understanding, 3D shape reconstruction, video-based action recognition and tracking, or understanding multi-modal audio-visual datastreams.
M. Caron, P. Bojanowski, A. Joulin, and M. Douze.
Deep clustering for unsupervised learning of visual features.
Proceedings of the European Conference on Computer Vision (ECCV), pages 132–149, 2018.
- (2) A. Conneau, H. Schwenk, L. Barrault, and Y. Lecun. Very deep convolutional networks for natural language processing. arXiv preprint arXiv:1606.01781, 2, 2016.
- (3) J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. 2009.
- (4) L. Deng, G. Hinton, and B. Kingsbury. New types of deep neural network learning for speech recognition and related applications: An overview. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 8599–8603. IEEE, 2013.
- (5) J. Donahue, P. Krähenbühl, and T. Darrell. Adversarial feature learning. arXiv preprint arXiv:1605.09782, 2016.
- (6) Y. Grandvalet and Y. Bengio. Semi-supervised learning by entropy minimization. In Advances in neural information processing systems, pages 529–536, 2005.
- (7) A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, et al. Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567, 2014.
- (8) K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
K. He, X. Zhang, S. Ren, and J. Sun.
Deep residual learning for image recognition.
Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- (10) K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In European conference on computer vision, pages 630–645. Springer, 2016.
- (11) G. Hinton, L. Deng, D. Yu, G. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, B. Kingsbury, et al. Deep neural networks for acoustic modeling in speech recognition. IEEE Signal processing magazine, 29, 2012.
- (12) J. Hirschberg and C. D. Manning. Advances in natural language processing. Science, 349(6245):261–266, 2015.
- (13) A. Iscen, G. Tolias, Y. Avrithis, and O. Chum. Label propagation for deep semi-supervised learning. arXiv preprint arXiv:1904.04717, 2019.
- (14) A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
- (15) A. Kumar, O. Irsoy, P. Ondruska, M. Iyyer, J. Bradbury, I. Gulrajani, V. Zhong, R. Paulus, and R. Socher. Ask me anything: Dynamic memory networks for natural language processing. In International Conference on Machine Learning, pages 1378–1387, 2016.
- (16) D.-H. Lee. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on Challenges in Representation Learning, ICML, volume 3, page 2, 2013.
- (17) B. Liu, Z. Wu, H. Hu, and S. Lin. Deep metric transfer for label propagation with limited annotated data. arXiv preprint arXiv:1812.08781, 2018.
- (18) T. Miyato, S.-i. Maeda, S. Ishii, and M. Koyama. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence, 2018.
K. Noda, Y. Yamaguchi, K. Nakadai, H. G. Okuno, and T. Ogata.
Audio-visual speech recognition using deep learning.Applied Intelligence, 42(4):722–737, 2015.
- (20) M. Noroozi and P. Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision, pages 69–84. Springer, 2016.
- (21) M. Noroozi, H. Pirsiavash, and P. Favaro. Representation learning by learning to count. In Proceedings of the IEEE International Conference on Computer Vision, pages 5898–5906, 2017.
- (22) S. Qiao, W. Shen, Z. Zhang, B. Wang, and A. Yuille. Deep co-training for semi-supervised image recognition. In Proceedings of the European Conference on Computer Vision (ECCV), pages 135–152, 2018.
- (23) K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- (24) A. Tarvainen and H. Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Advances in neural information processing systems, pages 1195–1204, 2017.
- (25) Z. Wu, A. A. Efros, and S. X. Yu. Improving generalization via scalable neighborhood component analysis. In Proceedings of the European Conference on Computer Vision (ECCV), pages 685–701, 2018.
- (26) Z. Wu, Y. Xiong, S. X. Yu, and D. Lin. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3733–3742, 2018.
- (27) I. Z. Yalniz, H. Jegou, K. Chen, M. Paluri, and D. Mahajan. Billion-scale semi-supervised learning for image classificatio. arXiv preprint arXiv:1905.00546, 2019.
- (28) T. Young, D. Hazarika, S. Poria, and E. Cambria. Recent trends in deep learning based natural language processing. ieee Computational intelligenCe magazine, 13(3):55–75, 2018.
- (29) X. Zhai, A. Oliver, A. Kolesnikov, and L. Beyer. S4l: Self-supervised semi-supervised learning. arXiv preprint arXiv:1905.03670, 2019.
R. Zhang, P. Isola, and A. A. Efros.
Colorful image colorization.In European conference on computer vision, pages 649–666. Springer, 2016.
B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva.
Learning deep features for scene recognition using places database.In Advances in neural information processing systems, pages 487–495, 2014.
- (32) D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Schölkopf. Learning with local and global consistency. In Advances in neural information processing systems, pages 321–328, 2004.
- (33) X. Zhu and Z. Ghahramani. Learning from labeled and unlabeled data with label propagation. Technical report, Citeseer, 2002.
- (34) C. Zhuang, A. L. Zhai, and D. Yamins. Local aggregation for unsupervised learning of visual embeddings. arXiv preprint arXiv:1903.12355, 2019.