Local Label Propagation for Large-Scale Semi-Supervised Learning

05/28/2019 ∙ by Chengxu Zhuang, et al. ∙ 0

A significant issue in training deep neural networks to solve supervised learning tasks is the need for large numbers of labelled datapoints. The goal of semi-supervised learning is to leverage ubiquitous unlabelled data, together with small quantities of labelled data, to achieve high task performance. Though substantial recent progress has been made in developing semi-supervised algorithms that are effective for comparatively small datasets, many of these techniques do not scale readily to the large (unlaballed) datasets characteristic of real-world applications. In this paper we introduce a novel approach to scalable semi-supervised learning, called Local Label Propagation (LLP). Extending ideas from recent work on unsupervised embedding learning, LLP first embeds datapoints, labelled and otherwise, in a common latent space using a deep neural network. It then propagates pseudolabels from known to unknown datapoints in a manner that depends on the local geometry of the embedding, taking into account both inter-point distance and local data density as a weighting on propagation likelihood. The parameters of the deep embedding are then trained to simultaneously maximize pseudolabel categorization performance as well as a metric of the clustering of datapoints within each psuedo-label group, iteratively alternating stages of network training and label propagation. We illustrate the utility of the LLP method on the ImageNet dataset, achieving results that outperform previous state-of-the-art scalable semi-supervised learning algorithms by large margins, consistently across a wide variety of training regimes. We also show that the feature representation learned with LLP transfers well to scene recognition in the Places 205 dataset.



There are no comments yet.


page 1

page 2

page 3

page 4

page 5

page 6

page 7

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks (DNNs) have achieved impressive performance on tasks across a variety of domains, including vision krizhevsky2012imagenet ; simonyan2014very ; he2016deep ; he2017mask , speech recognition hinton2012deep ; hannun2014deep ; deng2013new ; noda2015audio

, and natural language processing 

young2018recent ; hirschberg2015advances ; conneau2016very ; kumar2016ask

. However, these achievements often heavily rely on large-scale labelled datasets, requiring burdensome and expensive annotation efforts. This problem is especially acute in specialized domains such as medical image processing, where annotation may involve performing an invasive process on patients. To avoid the need for large numbers of labels in training DNNs, researchers have proposed unsupervised methods that operate solely with ubiquitously available unlabeled data. Such methods have attained significant recent progress in the visual domain, where state-of-the-art unsupervised learning algorithms have begun to rival their supervised counterparts on large-scale transfer learning tests 

caron2018deep ; wu2018unsupervised ; zhuang2019local ; donahue2016adversarial ; zhang2016colorful ; noroozi2016unsupervised ; noroozi2017representation .

While task-generic unsupervised methods may provide good starting points for feature learning, they must be adapted with at least some labelled data to solve any specific desired target task. Semi-supervised learning seeks to leverage limited amounts of labelled data, in conjunction with extensive unlabelled data, to bridge the gap between the unsupervised and fully supervised cases. Recent work in semi-supervised learning has shown significant promise liu2018deep ; iscen2019label ; zhai2019s4l ; miyato2018virtual ; tarvainen2017mean ; lee2013pseudo ; grandvalet2005semi ; qiao2018deep

, although gaps to supervised performance levels still remain significant, especially in large-scale datasets where very few labels are available. An important consideration is that many recently proposed semi-supervised methods rely on techniques whose efficiency scales poorly with dataset size and thus cannot be readily applied to many real-world machine learning problems 

liu2018deep ; iscen2019label .

Here, we propose a novel semi-supervised learning algorithm that is specifically adapted for use with large sparsely-labelled datasets. This algorithm, termed Local Label Propagation (LLP), learns a nonlinear embedding of the input data, and exploits the local geometric structure of the latent embedding space to help infer useful pseudo-labels for unlabelled datapoints. LLP borrows the framework of non-parametric embedding learning, which has recently shown utility in unsupervised learning wu2018unsupervised ; zhuang2019local

, to first train a deep neural network that embeds labelled and unlabelled examples into a lower-dimensional latent space. LLP then propagates labels from known examples to unknown datapoints, weighting the likelihood of propagation by a factor involving the local density of known examples. The neural network embedding is then optimized to categorize all datapoints according to their pseudo-labels (with stronger emphasis on true known labels), while simultaneously encouraging datapoints sharing the same (pseudo-)labels to aggregate in the latent embedding space. The resulting embedding thus gathers both labelled images within the same class and unlabelled images sharing statistical similarities with the labelled ones. Through iteratively applying the propagation and network training steps, the LLP algorithm builds a good underlying representation for supporting downstream tasks, and trains an accurate classifier for the specific desired task.

We apply the LLP procedure in the context of object categorization in the ImageNet dataset deng2009imagenet , learning a high-performing network while discarding most of the known labels. The LLP procedure substantially outperforms previous state-of-the-art semi-supervised algorithms that are sufficiently scalable that they can be applied to ImageNet zhai2019s4l ; miyato2018virtual ; tarvainen2017mean ; lee2013pseudo ; grandvalet2005semi ; qiao2018deep , with gains that are consistent across a wide variety of training regimes. LLP-trained features also support improved transfer to Places205, a large-scale scene-recognition task. In the sections that follow, we first discuss related literature (§2), describe the LLP method (§3), show experimental results (§4), and present analyses that provide insights into the learning procedure and justification of key parameter choices (§5).

2 Related Work

Below we describe conceptual relationships between our work and recent related approaches, and identify relevant major alternatives for comparison.

Deep Label Propagation. Like LLP, Deep Label Propagation iscen2019label

(DLP) also iterates between steps of label propagation and neural network optimization. In contrast to LLP, the DLP label propagation scheme is based on computing pairwise similarity matrices of learned visual features across all (unlabelled) examples. Unlike in LLP, the DLP loss function is simply classification with respect to pseudo-labels, without any additional aggregation terms ensuring that the pseudo-labelled and true-labelled points have similar statistical structure. The DLP method is effective on comparatively small datasets, such as CIFAR10 and Mini-ImageNet. However, DLP is challenging to apply to large-scale datasets such as ImageNet, since its label propagation method is

in the number of datapoints, and is not readily parallelizable. In contrast, LLP is , where is the number of labelled datapoints, and is easily parallelized, making its effective complexity , where is the number of parallel processes. In addition, DLP uniformly propagates labels across networks’ implied embedding space, while LLP’s use of local density-driven propagation weights specifically exploits the geometric structure in the learned embedding space, improving pseudo-label inference.

Deep Metric Transfer and Pseudolabels. The Deep Metric Transfer liu2018deep (DMT) and Pseudolabels lee2013pseudo methods both use non-iterative two-stage procedures. In the first stage, the representation is initialized either with a self-supervised task such as non-parametric instance recognition (DMT), or via direct supervision on the known labels (Pseudolabels). In the second stage, pseudo-labels are obtained either by applying a label propagation algorithm (DMT) or naively from the pre-trained classifier (Pseudolabels), and these are then used to fine-tune the network. As in DLP, the label propagation algorithm used by DMT cannot be applied to large-scale datasets, and does not specifically exploit local statistical features of the learned representation. While more scalable, the Pseudolabels approach achieves comparatively poor results. A key point of contrast between LLP and the two-stage methods is that in LLP, the representation learning and label propagation processes interact via the iterative training process, an important driver of LLP’s improvements.

Self-Supervised Semi-Supervised Learning. Self-Supervised Semi-Supervised Learning zhai2019s4l (SL) co-trains a network using self-supervised methods on unlabelled images and traditional classification loss on labelled images. Unlike LLP, SL simply “copies” self-supervised learning tasks as parallel co-training loss branches. In contrast, LLP involves a nontrivial interaction between known and unknown labels via label propagation and the combination of categorization and aggregation terms in the shared loss function, both factors that are important for improved performance.

Consistency-based regularization. Several recent semi-supervised methods rely on data-consistency regularizations. Virtual Adversarial Training (VAT) miyato2018virtual adds small input perturbations, requiring outputs to be robust to this perturbation. Mean Teacher (MT) tarvainen2017mean requires the learned representation to be similar to its exponential moving average during training. Deep Co-Training (DCT) qiao2018deep requires the outputs of two views of the same image to be similar, while ensuring outputs vary widely using adversarial pairs. These methods all use unlabeled data in a “point-wise” fashion, applying the proposed consistency metric separately on each. They thus differ significantly from LLP, or indeed any method that explicitly relates unlabelled to labelled points. LLP benefits from training a shared embedding space that aggregates statistically similar unlabelled datapoints together with labelled (putative) counterparts. As a result, increasing the number of unlabelled images consistently increases the performance of LLP, unlike for the Mean-Teacher method.

3 Methods

Figure 1: Schematic of the Local Label Propagation (LLP) method. a.-b.

We use deep convolutional neural networks to simultaneously generate a lower-dimensional embedding and a category prediction for each input example.

c. If the embedding of the input (denoted by ) is unlabelled, we identify its close labelled neighbors (colored points), and infer ’s pseudo-label by votes from these neighbors, with voting weights jointly determined by their distances from and the density of their local neighborhoods (the highlighted circular areas). d. The pseudolabels thereby created (colored points) come equipped with a confidence (color brightness), measuring how accurate the pseudo-label is likely to be. The network in b. is optimized (with per-example confidence weightings) so that its category predictions match the pseudo-labels, while its embedding is attracted () toward other embeddings sharing the same pseudo-labels and repelled () by embeddings of other pseudo-labels.

We first give an overview of the LLP method. At a high level, LLP learns a model from labeled examples , their associated labels , and unlabelled examples . is realized via a deep neural network whose parameters are network weights. For each input , generates two outputs (Fig. 1

): an “embedding output”, realized as a vector

in a -dimensional sphere, and a category prediction output . In learning , the LLP procedure repeatedly alternates between two steps: label propagation and representation learning. First, known labels are propagated from to , creating pseudo-labels . Then, network parameters are updated to minimize a loss function balancing category prediction accuracy evaluated on the outputs, and a metric of statistical consistency evaluated on the outputs.

In addition to pseudo-labels, the label propagation step also generates [0,1]-valued confidence scores for each example . For labelled points, confidence scores are automatically set to 1, while for pseudo-labelled examples, confidence scores are computed from the local geometric structure of the embedded points, reflecting how close the embedding vectors of the pseudo-labelled points are to those of their putative labelled counterparts. The confidence values are then used as loss weights during representation learning.

Representation Learning. Assume that datapoints , labels and pseudolabels , and confidences are given. Let denote the set of corresponding embedded vectors, and denote the set of corresponding category prediction outputs. In the representation learning step, we update the network embedding parameters by simultaneously minimizing the standard cross-entropy loss between predicted and propagated pseudolabels, while maximizing a global aggregation metric to enforce overall consistency between known labels and pseudolabels.

The definition of is based on the non-parametric softmax operation proposed by Wu et. al. wu2018unsupervised ; wu2018improving

, which defines the probability that an arbitrary embedding vector

is recognized as the -th example as:


where temperature

is a fixed hyperparameter. For

, the probability of a given being recognized as an element of is:


We then define the aggregation metric as the (negative) log likelihood that will be recognized as a member of the set of examples sharing its pseudo-label:


Optimizing encourages the embedding corresponding to a given datapoint to selectively become close to embeddings of other datapoints with the same pseudo-label (Fig. 1).

The cross-entropy and aggregation loss terms are scaled on a per-example basis by the confidence score, and an weight regularization penalty is added. Thus, the final loss for example is:


where is a regularization hyperparameter.

Label Propagation. We now describe how LLP generates pseudo-labels and confidence scores . To understand our actual procedure, it is useful to start from the weighted K-Nearest-Neighbor classification algorithm wu2018unsupervised , in which a “vote” is obtained from the top nearest labelled examples for each unlabelled example , denoted . The vote of each is weighted by the corresponding probabilities that will be identified as example . Assuming classes, the total weight for pseudo-labelled as class is thus:


Therefore, the probability that datapoint is of class , the associated inferred pseudo-label , and the corresponding confidence , may be defined as:


Although intuitive, weighted-KNN ignores all the other

unlabelled examples in the embedding space when inferring and . Fig. 1c depicts a scenario where this can be problematic: the embedded point is near examples of three different known classes (red, green and blue points) that are of similar distance to , but each having different local data densities. If we directly use as the weights for these three labelled neighbors, they will contribute similar weights when calculating . However, we should expect a higher weight from the lower-density red neighbor. The lower density indicates that the prior that any given point near is identified as the red instance is lower than for the other possible labels. However, this information is not reflected in , which represents the joint probability that is red. We should instead use the posterior

probability as the vote weight, which by Bayes’ theorem means that

should divided by its prior. To formalize this reasoning, we replace in the definition of with a locally weighted probability:


where are nearest neighbors and denominator is a measure of the local embedding density. For consistency, we replace with , which contains the labelled neighbors with highest locally-weighted probability, to ensure that the votes come from the most relevant labelled examples. The final form of the LLP propagation weight equation is thus:


Memory Bank. Both the label propagation and representation learning steps implicitly require access to all the embedded vectors at every computational step. However, recomputing rapidly becomes intractable as dataset size increases. We address this issue by approximating realtime with a memory bank that keeps a running average of the embeddings. As this procedure is directly taken from wu2018unsupervised ; wu2018improving ; zhuang2019local , we refer readers to these works for a detailed description.

4 Results

We first evaluate the LLP method on visual object categorization in the large-scale ImageNet dataset deng2009imagenet , under a variety of training regimes. We also illustrate transfer learning to Places 205 zhou2014learning , a large-scale scene-recognition dataset.

Experimental settings. Following wu2018unsupervised ; zhuang2019local , and . Optimization uses SGD with momentum of 0.9, batch size of 128, and weight-decay parameter

. Learning rate is initialized to 0.03 and then dropped by a factor of 10 whenever validation performance saturates. Depending on the training regime specifics (how many labelled and unlabelled examples), training takes 200-400 epochs, comprising three learning rate drops. Similarly to 

zhuang2019local , we initialize networks with the IR loss function in a completely unsupervised fashion for 10 epochs, and then switch to the LLP method. Most hyperparameters are directly taken from wu2018unsupervised ; wu2018improving ; zhuang2019local , although as shown in zhai2019s4l , a hyperparameter search can potentially improve performance. In the label propagation stage, we set and (these choices are justified in Section 5

). The density estimate

is recomputed for all labelled images at once at the end of every epoch. For each traning regmine, we train both ResNet-18v2 and ResNet-50v2 he2016identity architectures, with an additional fully connected layer added alongside the standard softmax categorization layer to generate the embedding output.

ImageNet with varying training regimes. We train on ImageNet with labels and total images available, meaning that , . Different regimes are defined by and . Results for each regime are shown in Tables 1-3. Due to the inconsistency of reporting metrics across different papers, we alternate between comparing top1 and top5 accuracy, depending on which metric was reported in the relevant previous work.

The results show that: 1. LLP significantly outperforms previous state-of-the-art methods by large margins within all training regimes tested, regardless of network architectures used, number of labels, and number of available unlabelled images; 2. LLP shows especially large improvements to other methods when only small number of labels are known. For example, ResNet-18 trained using LLP with only 3% labels achieves 53.24% top1 accuracy, which is 12.43% better than Mean Teacher; 3. Unlike Mean Teacher, where the number of unlabelled images appears essentially irrelevant, LLP consistently benefits from additional unlabelled images (see Table 3), and is not yet saturated using all the images in the ImageNet set. This suggests that with additional unlabelled data, LLP could potentially achieve even better performance.

Method 1% labels 3% labels 5% labels 10% labels
Supervised 17.35 28.61 36.01 47.89
DCT qiao2018deep 53.50
MT tarvainen2017mean 16.91 40.81 48.34 56.70
LLP (ours) 27.14 53.24 57.04 61.51
Table 1: ResNet-18 Top-1 accuracy (%) on the ImageNet validation set trained on ImageNet with 10%, 5%, 3%, or 1% of labels. Mean Teacher performance is generated by us.
# labels Supervised Pseudolabels VAT VAT-EM grandvalet2005semi * MT LLP (ours)
1% 48.43 51.56 44.05 46.96 53.37 40.54 61.89
10% 80.43 82.41 82.78 83.39 83.82 85.42 88.53
Table 2: ResNet-50 Top-5 accuracy (%) on the ImageNet validation set trained on ImageNet using 10% or 1% of labels. The numbers for models except ours and Mean Teacher’s are from zhai2019s4l . Because zhai2019s4l only reports top-5 accuracies for these models, we also report top-5 accuracy here to be comparable to theirs. *: for , we list their -Rotation performance, which is their best reported performance using ResNet-50. Note that although a model with higher performance is reported by , that model uses a much more complex architecture than ResNet-50.
Method 30% unlabeled 70% unlabeled 100% unlabeled
MT 56.07 55.59 55.65
LLP (ours) 58.62 60.27 61.51
Table 3: ResNet-18 Top-1 accuracy (%) on the ImageNet validation set using 10% labels and 30%, 70%, or 100% unlabeled images.
# labels Supervised Pseudolabels VAT VAT-EM MT LLP (ours) LA*
10% 44.7 48.2 45.8 46.2 46.6 46.4 50.4 48.3
1% 36.2 41.8 35.9 36.4 38.0 31.6 44.6
Table 4: ResNet-50 transfer learning Top-1 accuracy (%) on the Places205 validation set using fixed weights trained on ImageNet using 10% or 1% of labels. The numbers for models except ours, Mean Teacher’s, and LA’s are from zhai2019s4l . *: LA number is produced by us through training ResNet-50 with Local Aggregation algorithm zhuang2019local without any labels and then using the pretrained weights to do the same transfer learning task to Places205. Note that this reported number is lower than the corresponding number in zhuang2019local . This is because they reported 10-crop top-1 accuracy.
Method LS zhou2004learning LP zhu2002learning LP_DMT liu2018deep LP_DLP iscen2019label LLP (ours)
Perf. 84.6 3.4 87.7 2.2 88.2 2.3 89.2 2.4 88.1 2.3
Table 5: Label propagation performance for different methods on subsets of ImageNet. LS represents Label Spreading introduced by zhou2004learning . LP represents Label Propagation introduced by zhu2002learning . LP_DMT represents the label propagation method used in DMT liu2018deep . Similarly, LP_DLP represents the method used in DLP iscen2019label . For all methods, we randomly sample 50 categories from ImageNet and 50 images from each category. For each category selected, we choose 5 images to be labelled. For all methods, we use embedding outputs of our trained ResNet-50 with 10% labels as data features. The numbers after

are standard deviations after 10 independent data subsamples.

Model Top50woc Top50 Top20 Top10 Top5 Top50wc Top50lw Top10wclw
NN perf. 52.43 54.37 54.82 55.42 55.44 55.46 56.27 57.54
Table 6: Nearest-Neighbor (NN) validation performance (%) for ResNet-18 trained using 10% labels with various settings. NN performances are lower than but correlate strongly with softmax classification performances reported in Table 1. Experiments labelled “TopX” involve training with X, optimizing only embedding loss, and updating the labels and confidence using the naive weighted NN algorithm as defined in Eq. 5. With these default settings, experiment with “woc” in the name have optimization loss not weighted by confidence. “wc” means the loss includes category loss. “lw” means using locally weighted probability for label propagation. Thus, “Top10wclw” indicates the LLP model.

Transfer learning to Scene Recognition. To evaluate the quality of our learned representation in other downstream tasks besides ImageNet classification, we assess its transfer learning performance to the Places205 zhou2014learning dataset. This dataset has 2.45 images total in 205 distinct scene categories. We fix the nonlinear weights learned on ImageNet, add another linear readout layer on top of the penultimate layer, and train the readout using cross-entropy loss using SGD as above. Learning rate is initialized at 0.01 and dropped by factor of 10 whenever validation performance on Places205 saturates. Training requires approximately 500,000 steps, comprising two learning rate drops. We only evaluate our ResNet-50 trained with 1% or 10% labels, as previous work zhai2019s4l reported performance with that architecture. Table 4 show that LLP again significantly outperforms previous state-of-the-art results. It is notable, however, that when trained with only 1% ImageNet labels, all current semi-supervised learning methods show somewhat worse transfer performance to Places205 than the Local Aggregation (LA) method, the current state-of-the-art unsupervised learning method zhuang2019local .

5 Analysis

To better understand the LLP procedure, we analyze both the final learned embedding space, and how it changes during training.

Emerging clusters during training. Intuitively, the aggregation term in eq. 3 should cause embedding outputs with the same ground truth label, whether known or propagated, to cluster together during training. Indeed, visualization (Fig. 2a) shows clustering becoming more pronounced along the training trajectory both for labelled and unlabelled datapoints, while unlabelled datapoints surround labelled datapoints increasingly densely. A simple metric measuring the aggregation of a group of embedding vectors is the L2 norm of the group mean, which, since all embeddings lie in the 128-D unit sphere, is inversely related to the group dispersion. Computing this metric separately for each ImageNet category and averaging across categories, we obtain a quantitative description of aggregation over the learning timecourse (Fig. 2b), further supporting the conclusion that LLP embeddings become increasingly clustered.

Effect of architecture. We also investigate how network architecture influences learning trajectory and the final representation, comparing ResNet-50 and ResNet-18 trained with 10% labels (Fig. 2b-d). The more powerful ResNet-50 achieves a more clustered representation than ResNet-18, both at all time points during training and for almost every ImageNet category.

Category structure analysis: successes, failures, and sub-category discovery. It is instructive to systematically analyze statistical patterns on a per-category basis. To illustrate this, we visualize the embeddings for three representative categories with 2D multi-dimensional scaling (MDS). For an “easy” category with a high aggregation score (Fig. 2

e), the LLP embedding identifies images with strong semantic similarity, supporting successful semi-supervised image retrieval. For a “hard” category with low aggregation score (Fig. 

2f), images statistics vary much more and the embedding fails to properly cluster examples together. Most interestingly, for multi-modal categories with intermediate aggregation scores (Fig. 2g), the learned embedding can reconstruct semantically meaningful sub-clusters even when these are not present in the original labelling e.g. the “labrador” category decomposing into “black” and “yellow” subcategories.

Figure 2: a. 2-dimensional MDS embeddings of 128-dimensional embedding outputs on 100 randomly sampled images from each of five randomly chosen ImageNet categories, from the beginning to the end of LLP training. Larger points with black borders are images with known labels. b. Trajectory of cross-category average of L2-norms of category-mean embedding vectors for each ImageNet category. Sudden changes in trajectories are due to learning rate drops. c. Histogram of the L2-norm metrics for each category, for fully-trained ResNet-18 and ResNet-50 networks. d. Scatter plot of L2-norm metric for ResNet-18 (-axis) and ResNet-50 (-axis). Each dot represents one category. e.-g. MDS embeddings and exemplar images for images of “Monarch”, ”Spatula”, and ”Labrador”. For each category, 700 images were randomly sampled to compute MDS embedding. In g., exemplar images are chosen from the indicated subclusters.

Comparison to global propagation in the small-dataset regime. To understand how LLP compares to methods that use global similarity information, but therefore lack scalability to large datasets, we test several such methods on ImageNet subsets containing 50 categories and 50 images per category. Table 5 shows that local propagtion methods can be effective even in this regime, as LLP’s performance is comparable to that of the global propagation algorithm used in DMT liu2018deep and only slightly lower than that of DLP iscen2019label .

Ablations. To justify key parameters and design choices, we conduct a series of ablation studies exploring the following alternatives, using: 1. Different values for (experiments Top50, Top20, Top10, and Top5 in Table 6); 2. Confidence weights or not (Top50woc and Top50); 3. Combined categorization loss or not (Top50wc and Top50); 4. Density-weighted probability or not (Top50lw and Top50). Results in Table 6 shows the significant contributions of each design choice. Since some models in these studies are not trained with softmax classifier layers, comparisons are made with a simple Nearest-Neighbor (NN) method, which for each test example finds the closest neighbor in the memory bank and uses that neighbor’s label as its prediction. As reported in wu2018unsupervised ; zhuang2019local , higher NN performance strongly predicts higher softmax categorization performance.

6 Discussion

In this work, we presented LLP, a method for semi-supervised deep neural network training that scales to large datasets. LLP efficiently propagates labels from known to unknown examples in a common embedding space, ensuring high-quality propagation by exploiting the local structure of the embedding. The embedding itself is simultaneously co-trained to achieve high categorization performance while enforcing statistical consistency between real and pseudo-labels. LLP achieves state-of-the-art semi-supervised learning results across all tested training regimes, including those with very small amounts of labelled data, and transfers effectively to other non-trained tasks.

In future work, we seek to improve LLP by better integrating it with state-of-the-art unsupervised learning methods (e.g. zhuang2019local ). This is especially relevant in the regime with very-low fractions of known labelled datapoints (e.g. 1% of ImageNet labels), where the best pure unsupervised methods appear to outperform the state-of-the-art semi-supervised approaches. In addition, in its current formulation, LLP may be less effective on small datasets than alternatives that exploit global similarity structure (e.g. iscen2019label ; liu2018deep ). We thus hope to improve upon LLP by identifying methods of label propagation that can take advantage of global structure while remaining scalable.

However, the real promise of semi-supervised learning is that it will enable tasks that are not already essentially solvable with supervision (cf. izeki2019billi ). Thus, an important direction for future work lies in applying LLP or related algorithms to tasks beyond object categorization where dense labeling is possible but very costly, such as medical imaging, multi-object scene understanding, 3D shape reconstruction, video-based action recognition and tracking, or understanding multi-modal audio-visual datastreams.