1 Introduction
Currently the predominant systems for visual object recognition and detection (Krizhevsky et al., 2012; Zeiler & Fergus, 2013; Girshick et al., 2013; Sermanet et al., 2013; Szegedy et al., 2014) use purely supervised training with regularization such as dropout (Hinton et al., 2012) to avoid overfitting. These systems do not account for missing labels, subjective labeling or inexhaustivelyannotated images. However, this assumption often does not hold, especially for very large datasets and in highresolution images with complex scenes. For example, in recognition, the class labels may be missing; in detection, the objects in the image may not all be localized; in subjective tasks such as facial emotion recognition, humans may not even agree on the class label. As training sets for deep networks become larger (as they should), the problem of missing and noisy labels becomes more acute, and so we argue it is a fundamental problem for scaling up vision.
In this work we propose a simple approach to hande noisy and incomplete labeling in weaklysupervised deep learning, by augmenting the usual prediction objective with a notion of perceptual consistency. We consider a prediction consistent if the same prediction is made given similar percepts, where the notion of similarity incorporates features learned by the deep network.
One interpretation of the perceptual consistency objective is that the learner makes use of its representation of the world (implicit in the network parameters) to match incoming percepts to known categories, or in general structured outputs. This provides the learner justification to “disagree” with a perceptuallyinconsistent training label, and effectively relabel the data while training. More accurate labels may lead to a better model, which allows further label cleanup, and the learner bootstraps itself in this way. Of course, too much skepticism of the labels carries the risk of ending up with a delusional agent, so it is important to balance the tradeoff between prediction and the learner’s perceptual consistency.
In our experiments we demonstrate that our approach yields substantial robustness to several types of label noise on several datasets. On MNIST handwritten digits (LeCun & Cortes, 1998) we show that our model is robust to label corruption. On the Toronto Face Database (Susskind et al., 2010) we show that our model handles well the case of subjective labels in emotion recognition, achieving stateoftheart results, and can also benefit from unlabeled face images with no modification to our method. On the ILSVRC2014 detection challenge data (Russakovsky et al., 2014), we show that our approach improves singleshot person detection using a MultiBox network (Erhan et al., 2014), and also improves performance in full 200way detection using MultiBox for region proposal and a deep CNN for postclassification.
2 Related Work
The literature on semisupervised and weaklysupervised learning is vast (see
Zhu (2005) for a survey), and so in this section we focus on the key previous papers that inspired this work and on other papers on weakly and semisupervised deep learning.The notion of bootstrapping, or “selftraining” a learning agent was proposed in (Yarowsky, 1995)
as a way to do wordsense disambiguation with only unlabeled examples and a small list of seed example sentences with labels. The algorithm proceeds by building an initial classifier using the seed examples, and then iteratively classifying unlabeled examples, extracting new seed rules for the classifier using the now expanded training data, and repeating these steps until convergence. The algorithm was analyzed by Abney
(Abney, 2004) and more recently by (Haffari & Sarkar, 2012).Cotraining (Blum & Mitchell, 1998; Nigam et al., 2000; Nigam & Ghani, 2000) was similarlymotivated but used a pair of classifiers with separate views of the data to iteratively learn and generate additional training labels. Whitney & Sarkar (2012) proposed bootstrapping labeled training examples with graphbased label propagation. Brodley & Friedl (1999) developed statistical methods for identifying mislabeled training data.
Rosenberg et al. (2005) also trained an object detection system in a weaklysupervised manner using selftraining, and demonstrated that their proposed model achieved comparable performance to models trained with a much larger set of labels. However, that approach works as a wrapper around an existing detection system, whereas in this work we integrate a consistency objective for bootstrapping into the training of the deep network itself.
Our work shares a similar motivation to these earlier works, but instead of explicitly generating new training labels and adding new examples to the training set in an outer loop, we incorporate our consistency objective directly into the model. In addition, we consider not only the case of learning from unlabeled examples, but also from noisy labels and inexhaustivelyannotated examples.
Mnih & Hinton (2012)
developed deep neural networks for improved labeling of aerial images, with robust loss functions to handle label omission and registration errors. This work shares a similar motivation of robustness to noisy labels, but rather than formulating loss functions for specific types of noise, we add a generic consistency objective to the loss to achieve robustness.
Minimum entropy regularization, proposed in (Grandvalet & Bengio, 2005, 2006)
, performs semisupervised learning by augmenting crossentropy loss with a term encouraging the classifier to make predictions with high confidence on the unlabeled examples
^{1}^{1}1see eq. 9.7 in (Grandvalet & Bengio, 2006). This is notable because in their approach training on unlabeled examples does not require a generative model, which is beneficial for training on highresolution images and other sensory data. We take a similar approach by sidestepping the difficulty of fullygenerative models of highdimensional sensory data. However, we extend beyond shallow models to deep networks, and to structured output prediction.Never ending language learning (NELL) (Carlson et al., 2010) and never ending image learning (NEIL) (Chen et al., 2013, 2014) are lifelonglearning systems for language and image understanding, respectively. They continuously bootstrap themselves using a cycle of data collection, propagation of labels to the newly collected data, and selfimprovement by training on the new data. Our work is complementary to these efforts, and focuses on building robustness to noisy and missing labels into the model for weaklysupervised deep learning.
Larochelle & Bengio (2008) developed an RBM for classification that uses a hybrid generative and discriminative training objective. Deep Boltmann Machines (Salakhutdinov & Hinton, 2009) can also be trained in a semisupervised manner with labels connected to the top layer. More recently, multiprediction DBM training (Goodfellow et al., 2013) and Generative Stochastic Networks (Bengio & ThibodeauLaufer, 2013)
improved the performance and simplified the training of deep generative models, enabling training via backpropagation much like in standard deep supervised networks. However, fullygenerative unsupervised training on highdimensional sensory data, e.g. ImageNet images, is still far behind supervised methods in terms of performance, and so in this work we do not follow the generative approach directly. Instead, this work focuses on a way to benefit from unlabeled and weaklylabeled examples with minimal modification to existing deep supervised networks. We demonstrate increased robustness to label noise and performance improvements from unlabeled data for a minimal engineering effort.
More recently, the problem of deep learning from noisy labels has begun to receive attention. Lee (2013) also followed the idea of minimum entropy regularization, and proposed generating “pseudolabels” as training targets for unlabeled data, and showed improved performance on MNIST with few labeled examples. Sukhbaatar & Fergus (2014) developed two deep learning techniques for handling noisy labels, learning to model the noise distribution in a topdown and bottomup fashion. In this work, we push further by extending beyond class labels to structured outputs, and we achieve stateoftheart scalable detection performance on ILSVRC2014, despite the fact that our method does not require explicitly modeling the noise distribution.
3 Method
In this section we describe two approaches: section 3.1 uses reconstruction error as a consistency objective and explicitly models the noise distribution as a matrix mapping model predictions to training labels. A reconstruction loss is added to promote topdown consistency of model predictions with the observations, which allows the model to discover the pattern of noise in the data.
The method presented in section 3.2 (bootstrapping) uses a convex combination of training labels and the current model’s predictions to generate the training targets, and thereby avoids directly modeling the noise distribution. This property is wellsuited to the case of structured outputs, for which modeling dense interactions among all pairs of output units may be neither practical nor useful. These two approaches are compared empirically in section 4.
In section 3.3 we show how to apply our bootstrapping approach to structured outputs by using the MultiBox (Erhan et al., 2014) region proposal network to handle the case of inexhaustive structured output labeling for singleshot person detection and for classagnostic region proposal.
3.1 Consistency in multiclass prediction via reconstruction
Let
be the data (or deep features computed from the data) and
the observed noisy multinomial labels. The standard softmax regresses onto without taking into account noisy or missing labels. In addition to optimizing the conditional loglikelihood , we add a regularization term encouraging the class prediction to be perceptually consistent.We first introduce into our model the “true” class label (as opposed to the noisy label observations) as a latent multinomial variable . Our deep feedforward network models the posterior over using the usual softmax regression:
(1) 
where
denotes the unnormalized probability distribution. Given the true label
, the label noise can be modeled using another softmax with logits as follows:
(2) 
Roughly,
learns the logprobability of observing true label
as noisy label . Given only an observation , we can marginalize over to compute the posterior of target given .(3) 
where the label noise distribution and posterior over true labels are defined above. We can perform discriminative training by gradient ascent on . However, this purely discriminative training does not yet incorporate perceptual consistency, and there is no explicit incentive for the model to treat as the “true” label; it can be viewed as another hidden layer, with the multinomial constraint resulting in an information bottleneck.
In unpublished work^{2}^{2}2Known from personal correspondence., Hinton & Mnih (2009)
developed a Restricted Boltzmann Machine (RBM)
(Smolensky, 1986) variant with hidden multinomial output unit and observed noisy label unit as described above. The associated energy function can be written as(4) 
Due to the bipartite structure of the RBM, and are conditionally independent given , and so the energy function in eq. (4) leads to a similar form of the posterior as in eq. (3), marginalizing out the hidden multinomial unit. The probability distribution arising from (4) is given by
where is the partition function. The model can be trained with a generative objective, e.g. by approximate gradient ascent on
(Hinton, 2002). Generative training naturally provides a notion of consistency between observations and predictions because the model learns to draw sample observations via the conditional likelihood , assuming binary observations.However, fullygenerative training is complicated by the fact that the exact likelihood gradient is intractable due to computing the partition function , and in practice MCMC is used. Generative training is further complicated (though certainly still possible) in cases where the features are nonbinary. To avoid these complications, and to make our approach rapidly applicable to existing deep networks using rectified linear activations, and trainable via exact gradient descent, we propose an analogous autoencoder version.
Figure 1 compares the RBM and autoencoder approaches to the multiclass prediction problem with perceptual consistency. The overall objective in the feedforward version is as follows:
(5) 
where as in equation 1. The parameter can be found via crossvalidation. Experimental results using this method are presented in sections 4.1 and 4.2.
3.2 Consistency in multiclass prediction via bootstrapping
In this section we develop a simple consistency objective that does not require an explicit noise distribution or a reconstruction term. The idea is to dynamically update the targets of the prediction objective based on the current state of the model. The resulting targets are a convex combination of (1) the noisy training label, and (2) the current prediction of the model. Intuitively, as the learner improves over time, its predictions can be trusted more. This mitigates the damage of incorrect labeling, because incorrect labels are likely to be eventually highly inconsistent with other stimuli predicted to have the same label by the model.
By paying less heed to inconsistent labels, the learner can develop a more coherent model, which further improves its ability to evaluate the consistency of noisy labels. We refer to this approach as “bootstrapping”, in the sense of pulling oneself up by one’s own bootstraps, and also due to inspiration from the work of Yarowsky (1995) which is also referred to as bootstrapping.
Concretely, we use a crossentropy objective as before, but generate new regression targets for each SGD minibatch based on the current state of the model. We empirically evaluated two types of bootstrapping. “Soft” bootstrapping uses predicted class probabilities directly to generate regression targets for each batch as follows:
(6) 
In fact, it can be shown that the resulting objective is equivalent to softmax regression with minimum entropy regularization, which was previously studied in (Grandvalet & Bengio, 2006). Intuitively, minimum entropy regularization encourages the model to have a high confidence in predicting labels (even for the unlabeled examples, which enables semisupervised learning).
“Hard” bootstrapping modifies regression targets using the MAP estimate of
given , which we denote as :(7) 
When used with minibatch stochastic gradient descent, this leads to an EMlike algorithm: In the Estep, estimate the “true” confidence targets as a convex combination of training labels and model predictions; in the Mstep, update the model parameters to better predict those generated targets.
Both hard and soft bootstrapping can be viewed as instances of a more general approach in which modelgenerated regression targets are modulated by a softmax temperature parameter ; i.e.
(8) 
Setting recovers soft boostrapping, and recovers hard bootstrapping. We only use these two operating points in our experiments, but it may be worthwhile to explore other values for , and learning for each dataset.
3.3 Consistency with structured output prediction
Noisy labels also occur in structured output prediction problems such as object detection. Current stateoftheart object detection systems train on images annotated with bounding box labels of the relevant objects in each image, and the class label for each box. However, it is expensive to exhaustively annotate each image, and for some commonlyappearing categories the data may be prone to missing annotations. In this section, we modify the training objective of the MultiBox (Erhan et al., 2014) network for object detection to incorporate a notion of perceptual consistency into the loss.
In the MultiBox approach, groundtruth bounding boxes are clustered and the resulting centroids are used as “priors” for predicting object location. A deep neural network is trained to predict, for each groundtruth object in an image, a residual of that groundtruth bounding box to the bestmatching bounding box prior. The network also outputs a logistic confidence score for each prior, indicating the model’s belief of whether or not an object appears in the corresponding location. Because MultiBox gives proposals with confidence scores, it enables very efficient runtimequality tradeoffs for detection via thresholding the topscoring proposals within budget. Thus it is an attractive target for further quality improvements, as we pursue in this section.
Denote the confidence score training targets as and the predicted confidence scores as . The objective for MultiBox ^{3}^{3}3We omit the bounding box regression term for simplicity, see (Erhan et al., 2014) for full details. can be written as the following crossentropy loss:
(9) 
Note that the sum here is over object locations, not class labels as in the case of sections 3.1 and 3.2.
If there is an object at location , but due to inexhaustive annotation, the model pays a large cost for correctly predicting . Training naively on the noisy labels leads to perverse learning situations such as the following: two objects of the same category (potentially within the same image) appear in the training data, but only one of them is labeled. To reduce the loss, the confidence prediction layer must learn to distinguish the two objects, which is exactly contrary to the objective of visual invariance to categorypreserving differences.
To incorporate a notion of perceptual consistency into the loss, we follow the same approach as in the case of multiclass classification: augment the regression targets using the model’s current state. In the “hard” case, MAP estimates can be obtained by thresholding .
(10)  
(11)  
With the bootstrap variants of the MultiBox objective, unlabeled positives pose less of a problem because penalties for large are downscaled by factor in the first term and in the second term. By mitigating penalties due to missing positives in the data, our approach allows the model to learn to predict with high confidence even if the objects at location are often unlabeled.
4 Experiments
We perform experiments on three image understanding tasks: MNIST handwritten digits recognition, Toroto Faces Database facial emotion recognition, and ILSVRC2014 detection. In all tasks, we train a deep neural network with our proposed consistency objective. In our figures, “bootstraprecon” refers to training as described in section 3.1, using reconstruction as a consistency objective. “bootstrapsoft” and “bootstraphard” refer to our method described in sections 3.2 and 3.3.
4.1 MNIST with noisy labels
In this section we train using our reconstructionbased objective (detailed in section 3.1) on MNIST handwritten digits with varying degrees of noise in the labels. Specifically, we used a fixed random permutation of the labels as visualized in figure 2, and we perform control experiments while varying the probability of applying the label permutation to each training example.
All models were trained with minibatch SGD, with the same architecture: 78450030010 neural network with rectified linear units. We used
weight decay of . We found that worked best for bootstraphard, for bootstrapsoft, and for bootstraprecon. We initializeto the identity matrix.
For the network trained with our proposed consistency objective, we initialized the network layers from the baseline predictiononly model. It is also possible to initialize from scratch using our approach, but we found that with pretraining we could use a larger and more quickly converge to a good result. Intuitively, this is similar to the initial collection of “seed” rules in the original bootstrapping algorithm of (Yarowsky, 1995). During the finetuning training phase, all network weights are updated by backpropagating gradients through the layers.
Figure 2 shows that our bootstrapping method provides a very significant benefit in the case of permuted labels. The bootstraprecon method performs the best, and bootstraphard nearly as well. The bootstrapsoft method provides some benefit in the highnoise regime, but only slightly better than the baseline overall.
Figure 2 also shows that bootstraprecon effectively learns the noise distribution via the parameters . Intuitively, the loss from the reconstruction term provides the learner a basis on which predictions may disagree with training labels . Since must be able to be reconstructed from in bootstraprecon, learning a nonidentity allows to flexibly vary from to better reconstruct without incurring a penalty from prediction error.
However, it is interesting to note that bootstraphard achieves nearly equivalent performance without explicitly parameterizing the noise distribution. This is useful because reconstruction may be challenging in many cases, such as when is drawn from a complicated, highdimensional distribution, and bootstraphard is trivial to implement on top of existing deep supervised networks.
4.2 Toronto Faces Database emotion recognition
In this section we present results on emotion recognition. The Toronto Faces Database has 112,234 images, 4,178 of which have emotion labels. In all experiments we first extracted spatialpyramidpooled OMP1 features as described in (Coates & Ng, 2011) to get dimensional features. We then trained a 320010005007 network to predict the 1of7 emotion labels for each image.
As in the case for our MNIST experiments, we initialize our model from the network pretrained with prediction only, and the finetuned all layers with our hybrid objective.
Figure 4 summarizes our TFD results. As in the case of MNIST, bootstraprecon and bootstraphard perform the best, significantly outperforming the softmax baseline, and bootstrapsoft provides a more modest improvement.
Training  Accuracy (%) 

baseline  
bootstraprecon  
bootstraphard  
bootstrapsoft  
disBM ^{4}^{4}4(Reed et al., 2014)  
CDA+CCA ^{5}^{5}5(Rifai et al., 2012) 
There is significant offdiagonal weight in learned by bootstraprecon on TFD, suggesting that the model learns to “hedge” its emotion prediction during training by spreading probability mass from the predicted class to commonlyconfused classes, such as “afraid” and “surprised”. The strongest off diagonals are in the “happy” column, which may be due to the fact that “happy” is the most common expression, or perhaps that happy expressions have a large visual diversity. Our method improves the emotion recognition performance, which to our knowledge is stateoftheart. The performance improvement from all three bootstrap methods on TFD suggests that our approach can be useful not just for mistaken labels, but also for semisupervised learning (missing labels) and learning from weak labels such as emotion categories.
4.3 ILSVRC 2014 fast singleshot person detection
In this section we apply our method to detecting persons using a MultiBox network built on top of the Inception architecture proposed in (Szegedy et al., 2014). We first pretrained MultiBox on classagnostic localization using the full ILSVRC2014 training set, since there are only several thousand images labeled with persons in ILSVRC, and then finetuned on person images only.
An important point for comparison is the top
bootstrapping heuristic introduced for MultiBox training in
(Szegedy et al., 2014) for person detection in the presence of missing annotations. In that approach, the top largest confidence predictions are dropped from the loss (which we show here in eq. (9)), and the setting used was . In other words, there is no gradient coming from the top most confident location predictions. In fact, it can be viewed as a form of bootstrapping where only the top most confident locations modify their targets, which become the predictions themselves. In this work, we aim to achieve similar or better performance in a more general way that can be applied to MultiBox and other discriminative models.Training  AP (%)  Recall @ 60p 

baseline  30.9  14.3 
topK heuristic  44.1  43.4 
bootstraphard  44.6  43.9 
bootstrapsoft  43.6  42.8 
The precisionrecall curves in figure 6 show that our proposed bootstrapping improves substantially over the predictiononly baseline. At the highprecision end of the PR curve, the approaches introduced in this paper perform better, while the top heuristic is slightly better at highrecall.
4.4 ILSVRC 2014 200category detection
In this section we apply our method to the case of object detection on the largescale ImageNet data. Our proposed method is applied in two ways: first to the MultiBox network for region proposal, and second to the classifier network that predicts labels for each cropped image region. We follow the approach in
(Szegedy et al., 2014) and combine image crops from MultiBox region proposals with deep network context features as the input to the classifier for each proposed region.We trained the MultiBox network as described in 3.3, and the postclassifier network as described in section 3.2. We found that “hard” performed better than the “soft” form of bootstrapping.
MultiBox  Postclassifier  mAP (%)  Recall @ 60p 
baseline  baseline  
baseline  bootstraphard  
bootstraphard  baseline  
bootstraphard  bootstraphard  
  GoogLeNet single model^{6}^{6}6(Szegedy et al., 2014)  38.8   
  DeepIDNet single model^{7}^{7}7(Ouyang et al., 2014)  40.1   
Using our proposed bootstrapping method, we observe modest improvement on the ILSVRC2014 detection “val2” data^{8}^{8}8In a previous version of this draft we included numbers on the full validation set. However, we discovered that context and postclassifier models used “val1” data, so we reran experiments only on the “val2” subset, mainly attributable to bootstrapping in MultiBox training.
5 Conclusions
In this paper we developed novel training methods for weaklysupervised deep learning, and demonstrated the effectiveness of our approach on multiclass prediction and structured output prediction for several datasets. Our method is exceedingly simple and can be applied with very little engineering effort to existing networks trained using a purelysupervised objective. The improvements that we show even with very simple methods, suggest that moving beyond purelysupervised deep learning is worthy of further research attention. In addition to achieving better performance with the data we already have, our results suggest that performance gains may be achieved from collecting more data at a cheaper price, since image annotation need not be as exhasutive and mistaken labels are not as harmful to the performance.
In future work, it may be promising to consider learning a timedependent policy for tuning , the scaling factor between prediction and perceptual consistency objectives, and also to extend our approach to the case of a situated agent. Another promising direction is to augment largescale training for detection (e.g. ILSVRC) with unlabeled and more weaklylabeled images, to further benefit from our proposed perceptual consistency objective.
References
 Abney (2004) Abney, Steven. Understanding the yarowsky algorithm. Computational Linguistics, 30(3):365–395, 2004.
 Bengio & ThibodeauLaufer (2013) Bengio, Yoshua and ThibodeauLaufer, Eric. Deep generative stochastic networks trainable by backprop. arXiv preprint arXiv:1306.1091, 2013.

Blum & Mitchell (1998)
Blum, Avrim and Mitchell, Tom.
Combining labeled and unlabeled data with cotraining.
In
Proceedings of the eleventh annual conference on Computational learning theory
, pp. 92–100. ACM, 1998. 
Brodley & Friedl (1999)
Brodley, Carla E and Friedl, Mark A.
Identifying mislabeled training data.
Journal of Artificial Intelligence Research
, 11:131–167, 1999.  Carlson et al. (2010) Carlson, Andrew, Betteridge, Justin, Kisiel, Bryan, Settles, Burr, Hruschka Jr, Estevam R, and Mitchell, Tom M. Toward an architecture for neverending language learning. In AAAI, volume 5, pp. 3, 2010.
 Chen et al. (2013) Chen, Xinlei, Shrivastava, Abhinav, and Gupta, Abhinav. Neil: Extracting visual knowledge from web data. In Computer Vision (ICCV), 2013 IEEE International Conference on, pp. 1409–1416. IEEE, 2013.
 Chen et al. (2014) Chen, Xinlei, Shrivastava, Abhinav, and Gupta, Abhinav. Enriching visual knowledge bases via object discovery and segmentation. CVPR, 2014.

Coates & Ng (2011)
Coates, Adam and Ng, Andrew Y.
The importance of encoding versus training with sparse coding and vector quantization.
InProceedings of the 28th International Conference on Machine Learning (ICML11)
, pp. 921–928, 2011.  Erhan et al. (2014) Erhan, Dumitru, Szegedy, Christian, Toshev, Alexander, and Anguelov, Dragomir. Scalable Object Detection Using Deep Neural Networks. In CVPR, pp. 2155–2162, 2014.
 Girshick et al. (2013) Girshick, Ross, Donahue, Jeff, Darrell, Trevor, and Malik, Jitendra. Rich feature hierarchies for accurate object detection and semantic segmentation. arXiv preprint arXiv:1311.2524, 2013.
 Goodfellow et al. (2013) Goodfellow, Ian, Mirza, Mehdi, Courville, Aaron, and Bengio, Yoshua. Multiprediction deep boltzmann machines. In Advances in Neural Information Processing Systems, pp. 548–556, 2013.
 Grandvalet & Bengio (2005) Grandvalet, Yves and Bengio, Yoshua. Semisupervised learning by entropy minimization. In Advances in Neural Information Processing Systems, pp. 529–536, 2005.
 Grandvalet & Bengio (2006) Grandvalet, Yves and Bengio, Yoshua. 9 entropy regularization. 2006.
 Haffari & Sarkar (2012) Haffari, Gholam Reza and Sarkar, Anoop. Analysis of semisupervised learning with the yarowsky algorithm. arXiv preprint arXiv:1206.5240, 2012.
 Hinton (2002) Hinton, Geoffrey E. Training products of experts by minimizing contrastive divergence. Neural computation, 14(8):1771–1800, 2002.
 Hinton & Mnih (2009) Hinton, Geoffrey E and Mnih, Volodymyr. Restricted boltzmann machine with hidden multinomial output unit. Unpublished work, 2009.
 Hinton et al. (2012) Hinton, Geoffrey E, Srivastava, Nitish, Krizhevsky, Alex, Sutskever, Ilya, and Salakhutdinov, Ruslan R. Improving neural networks by preventing coadaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.

Krizhevsky et al. (2012)
Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E.
Imagenet classification with deep convolutional neural networks.
In Advances in neural information processing systems, pp. 1097–1105, 2012.  Larochelle & Bengio (2008) Larochelle, Hugo and Bengio, Yoshua. Classification using discriminative restricted boltzmann machines. In Proceedings of the 25th international conference on Machine learning, pp. 536–543. ACM, 2008.
 LeCun & Cortes (1998) LeCun, Yann and Cortes, Corinna. The mnist database of handwritten digits, 1998.
 Lee (2013) Lee, DongHyun. Pseudolabel: The simple and efficient semisupervised learning method for deep neural networks. In Workshop on Challenges in Representation Learning, ICML, 2013.
 Mnih & Hinton (2012) Mnih, Volodymyr and Hinton, Geoffrey E. Learning to label aerial images from noisy data. In Proceedings of the 29th International Conference on Machine Learning (ICML12), pp. 567–574, 2012.
 Nigam & Ghani (2000) Nigam, Kamal and Ghani, Rayid. Analyzing the effectiveness and applicability of cotraining. In Proceedings of the ninth international conference on Information and knowledge management, pp. 86–93. ACM, 2000.
 Nigam et al. (2000) Nigam, Kamal, McCallum, Andrew Kachites, Thrun, Sebastian, and Mitchell, Tom. Text classification from labeled and unlabeled documents using em. Machine learning, 39(23):103–134, 2000.
 Ouyang et al. (2014) Ouyang, Wanli, Luo, Ping, Zeng, Xingyu, Qiu, Shi, Tian, Yonglong, Li, Hongsheng, Yang, Shuo, Wang, Zhe, Xiong, Yuanjun, Qian, Chen, et al. Deepidnet: multistage and deformable deep convolutional neural networks for object detection. arXiv preprint arXiv:1409.3505, 2014.
 Reed et al. (2014) Reed, Scott, Sohn, Kihyuk, Zhang, Yuting, and Lee, Honglak. Learning to disentangle factors of variation with manifold interaction. In Proceedings of The 31st International Conference on Machine Learning, 2014.
 Rifai et al. (2012) Rifai, Salah, Bengio, Yoshua, Courville, Aaron, Vincent, Pascal, and Mirza, Mehdi. Disentangling factors of variation for facial expression recognition. In Computer Vision–ECCV 2012, pp. 808–822. Springer, 2012.
 Rosenberg et al. (2005) Rosenberg, Chuck, Hebert, Martial, and Schneiderman, Henry. Semisupervised selftraining of object detection models. 2005.
 Russakovsky et al. (2014) Russakovsky, Olga, Deng, Jia, Su, Hao, Krause, Jonathan, Satheesh, Sanjeev, Ma, Sean, Huang, Zhiheng, Karpathy, Andrej, Khosla, Aditya, Bernstein, Michael, Berg, Alexander C., and FeiFei, Li. ImageNet Large Scale Visual Recognition Challenge, 2014.
 Salakhutdinov & Hinton (2009) Salakhutdinov, Ruslan and Hinton, Geoffrey E. Deep boltzmann machines. In International Conference on Artificial Intelligence and Statistics, pp. 448–455, 2009.
 Sermanet et al. (2013) Sermanet, Pierre, Eigen, David, Zhang, Xiang, Mathieu, Michaël, Fergus, Rob, and LeCun, Yann. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229, 2013.
 Smolensky (1986) Smolensky, Paul. Information processing in dynamical systems: Foundations of harmony theory. 1986.
 Sukhbaatar & Fergus (2014) Sukhbaatar, Sainbayar and Fergus, Rob. Learning from noisy labels with deep neural networks. arXiv preprint arXiv:1406.2080, 2014.
 Susskind et al. (2010) Susskind, Josh M, Anderson, Adam K, and Hinton, Geoffrey E. The toronto face database. Department of Computer Science, University of Toronto, Toronto, ON, Canada, Tech. Rep, 2010.
 Szegedy et al. (2014) Szegedy, C., Reed, S., Erhan, D., and Anguelov, D. Scalable, HighQuality Object Detection. ArXiv eprints, December 2014.
 Szegedy et al. (2014) Szegedy, Christian, Liu, Wei, Jia, Yangqing, Sermanet, Pierre, Reed, Scott, Anguelov, Dragomir, Erhan, Dumitru, Vanhoucke, Vincent, and Rabinovich, Andrew. Going deeper with convolutions. arXiv preprint arXiv:1409.4842, 2014.
 Whitney & Sarkar (2012) Whitney, Max and Sarkar, Anoop. Bootstrapping via graph propagation. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long PapersVolume 1, pp. 620–628. Association for Computational Linguistics, 2012.
 Yarowsky (1995) Yarowsky, David. Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd annual meeting on Association for Computational Linguistics, pp. 189–196. Association for Computational Linguistics, 1995.
 Zeiler & Fergus (2013) Zeiler, Matthew D and Fergus, Rob. Visualizing and understanding convolutional neural networks. arXiv preprint arXiv:1311.2901, 2013.
 Zhu (2005) Zhu, Xiaojin. Semisupervised learning literature survey. 2005.