1 Introduction
The success of deep neural networks to date depends strongly on the availability of labeled data which is costly and not always easy to obtain. Usually it is much easier to obtain small quantities of highquality labeled data and large quantities of unlabeled data. The problem of how to best integrate these two different sources of information during training is an active pursuit in the field of semisupervised learning
(Chapelle et al., 2006). However, for a large class of tasks it is also easy to define one or more socalled “weak annotators”, additional (albeit noisy) sources of weak supervisionbased on heuristics or “weaker”, biased classifiers trained on e.g. nonexpert crowdsourced data or data from different domains that are related. While easy and cheap to generate, it is not immediately clear if and how these additional weaklylabeled data can be used to train a stronger classifier for the task we care about. More generally, in almost all practical applications machine learning systems have to deal with data samples of variable quality. For example, in a large dataset of images only a small fraction of samples may be labeled by experts and the rest may be crowdsourced using e.g. Amazon Mechanical Turk
(Veit et al., 2017). In addition, in some applications, labels are intentionally perturbed due to privacy issues (Wainwright et al., 2012; Papernot et al., 2017).Assuming we can obtain a large set of weaklylabeled data in addition to a much smaller training set of “strong” labels, the simplest approach is to expand the training set by including the weaklysupervised samples (all samples are equal). Alternatively, one may pretrain on the weak data and then finetune on observations from the true function or distribution (which we call strong data). Indeed, it has recently been shown that a small amount of expertlabeled data can be augmented in such a way by a large set of raw data, with labels coming from a heuristic function, to train a more accurate neural ranking model (Dehghani et al., 2017d). The downside is that such approaches are oblivious to the amount or source of noise in the labels.
In this paper, we argue that treating weaklylabeled samples uniformly (i.e. each weak sample contributes equally to the final classifier) ignores potentially valuable information of the label quality. Instead, we propose FidelityWeighted Learning (FWL), a Bayesian semisupervised approach that leverages a small amount of data with true labels to generate a larger training set with confidenceweighted weaklylabeled samples, which can then be used to modulate the finetuning process based on the fidelity (or quality) of each weak sample. By directly modeling the inaccuracies introduced by the weak annotator in this way, we can control the extent to which we make use of this additional source of weak supervision: more for confidentlylabeled weak samples close to the true observed data, and less for uncertain samples further away from the observed data.
We propose a setting consisting of two main modules. One is called the student and is in charge of learning a suitable data representation and performing the main prediction task, the other is the teacher which modulates the learning process by modeling the inaccuracies in the labels. We explain our approach in much more detail in Section 2, but at a high level it works as follows (see Figure 1): We pretrain the student network on weak data to learn an initial taskdependent data representation which we pass to the teacher along with the strong data. The teacher then learns to predict the strong data, but crucially, based on the student’s learned representation. This then allows the teacher to generate new labeled training data from unlabeled data, and in the process correct the student’s mistakes, leading to a better final data representation and better final predictor.
We introduce the proposed FWL approach in more detail in Section 2. We then present our experimental setup in Section 3 where we evaluate FWL on a toy task and two realworld tasks, namely document ranking and sentence sentiment classification. In all cases, FWL outperforms competitive baselines and yields stateoftheart results, indicating that FWL makes better use of the limited true labeled data and is thereby able to learn a better and more meaningful taskspecific representation of the data. Section 4
provides analysis of the biasvariance tradeoff and the learning rate, suggesting also to view FWL from the perspective of Vapnik’s learning with privileged information (LUPI) framework
(Vapnik & Izmailov, 2015). Section 5 situates FWL relative to related work, and we end the paper by drawing the main conclusions in Section 6.2 FidelityWeighted Learning (FWL)
In this section, we describe our proposed FWL approach for semisupervised learning when we have access to weak supervision (e.g. heuristics or weak annotators). We assume we are given a large set of unlabeled data samples, a heuristic labeling function called the weak annotator, and a small set of highquality samples labeled by experts, called the strong dataset, consisting of tuples of training samples and their true labels , i.e. . We consider the latter to be observations from the true target function that we are trying to learn. We use the weak annotator to generate labels for the unlabeled samples. Generated labels are noisy due to the limited accuracy of the weak annotator. This gives us the weak dataset consisting of tuples of training samples and their weak labels , i.e. . Note that we can generate a large amount of weak training data at almost no cost using the weak annotator. In contrast, we have only a limited amount of observations from the true function, i.e. .
Our proposed setup comprises a neural network called the student and a Bayesian function approximator called the teacher. The training process consists of three phases which we summarize in Algorithm 1 and Figure 1.
Step 1 Pretrain the student on using weak labels generated by the weak annotator.
The main goal of this step is to learn a task dependent representation of the data as well as pretraining the student. The student function is a neural network consisting of two parts. The first part learns the data representation and the second part performs the prediction task (e.g. classification). Therefore the overall function is . The student is trained on all samples of the weak dataset . For brevity, in the following, we will refer to both data sample and its representation by when it is obvious from the context. From the selfsupervised feature learning point of view, we can say that representation learning in this step is solving a surrogate task of approximating the expert knowledge, for which a noisy supervision signal is provided by the weak annotator.
Step 2 Train the teacher on the strong data represented in terms of the student representation and then use the teacher to generate a soft dataset consisting of for all data samples.
We use a Gaussian process as the teacher to capture the label uncertainty in terms of the student representation, estimated w.r.t the strong data. We explain the finer details of the in Appendix C, and just present the overall description here. A prior mean and covariance function is chosen for . The learned embedding function
in Step 1 is then used to map the data samples to dense vectors as input to the
. We use the learned representation by the student in the previous step to compensate lack of data in and the teacher can enjoy the learned knowledge from the large quantity of the weakly annotated data. This way, we also let the teacher see the data through the lens of the student.The is trained on the samples from to learn the posterior mean (used to generate soft labels) and posterior covariance (which represents label uncertainty). We then create the soft dataset using the posterior , input samples from , and predicted labels with their associated uncertainties as computed by and :
The generated labels are called soft labels. Therefore, we refer to as a soft dataset. transforms the output of to the suitable output space. For example in classification tasks,
would be the softmax function to produce probabilities that sum up to one. For multidimensionaloutput tasks where a vector of variances is provided by the
, the vector is passed through an aggregating function to generate a scalar value for the uncertainty of each sample. Note that we train only on the strong dataset but then use it to generate soft labels and uncertainty for samples belonging to .In practice, we furthermore divide the space of data into several regions and assign each region a separate trained on samples from that region. This leads to a better exploration of the data space and makes use of the inherent structure of data. The algorithm called clustered gave better results compared to a single GP. See Appendix A for the detailed description and empirical observations which makes the use of multiple s reasonable.
Step 3 Finetune the weights of the student network on the soft dataset, while modulating the magnitude of each parameter update by the corresponding teacherconfidence in its label.
The student network of Step 1 is finetuned using samples from the soft dataset where . The corresponding uncertainty of each sample is mapped to a confidence value according to Equation 1
below, and this is then used to determine the step size for each iteration of the stochastic gradient descent (SGD). So, intuitively, for data points where we have true labels, the uncertainty of the teacher is almost zero, which means we have high confidence and a large stepsize for updating the parameters. However, for data points where the teacher is not confident, we downweight the training steps of the student. This means that at these points, we keep the student function as it was trained on the weak data in Step 1.
More specifically, we update the parameters of the student by training on using SGD:
where is the perexample loss, is the total learning rate, is the size of the soft dataset , is the parameters of the student network, and is the regularization term.
We define the total learning rate as , where is the usual learning rate of our chosen optimization algorithm that anneals over training iterations, and is a function of the label uncertainty that is computed by the teacher for each data point. Multiplying these two terms gives us the total learning rate. In other words, represents the fidelity (quality) of the current sample, and is used to multiplicatively modulate . Note that the first term does not necessarily depend on each data point, whereas the second term does. We propose
(1) 
to exponentially decrease the learning rate for data point if its corresponding soft label is unreliable (far from a true sample). In Equation 1, is a positive scalar hyperparameter. Intuitively, small results in a student which listens more carefully to the teacher and copies its knowledge, while a large makes the student pay less attention to the teacher, staying with its initial weak knowledge. More concretely speaking, as student places more trust in the labels estimated by the teacher and the student copies the knowledge of the teacher. On the other hand, as , student puts less weight on the extrapolation ability of and the parameters of the student are not affected by the correcting information from the teacher.
3 Experiments
In this section, we apply FWL first to a toy problem and then to two different real tasks: document ranking and sentiment classification
. The neural networks are implemented in TensorFlow
(Abadi et al., 2015; Tang, 2016). GPflow (Matthews et al., 2017) is employed for developing the modules. For both tasks, we evaluate the performance of our method compared to the following baselines:
[leftmargin=*]

WA. The weak annotator, i.e. the unsupervised method used for annotating the unlabeled data.

NN. The student trained only on weak data.

NN. The student trained only on strong data.

NN. The student trained on samples that are alternately drawn from without replacement, and with replacement. Since , it oversamples the strong data.

NN. The student trained on weak dataset and finetuned on strong dataset .

NN. The student trained on the weak data, but the stepsize of each weak sample is weighted by a fixed value , and finetuned on strong data. As an approximation for the optimal value for , we have used the mean of of our model (below).

FWL . The representation in the first step is trained in an unsupervised way^{1}^{1}1In the document ranking task, as the representation of documents and queries, we use weighted averaging over pretrained embeddings of their words based on their inverse document frequency (Dehghani et al., 2017d)
. In the sentiment analysis task, we use skipthoughts vectors
(Kiros et al., 2015) and the student is trained on examples labeled by the teacher using the confidence scores. 
FWL . The student trained on the weakly labeled data and finetuned on examples labeled by the teacher without taking the confidence into account. This baseline is similar to (Veit et al., 2017).

FWL. Our FWL model, i.e. the student trained on the weakly labeled data and finetuned on examples labeled by the teacher using the confidence scores.
In the following, we introduce each task and the results produced for it, more detail about the exact student network and teacher for each task are in the appendix.
3.1 Toy Problem
We first apply FWL to a onedimensional toy problem to illustrate the various steps. Let be the true function (red dotted line in Figure 1(a)) from which a small set of observations is provided (red points in Figure 1(b)). These observation might be noisy, in the same way that labels obtained from a human labeler could be noisy. A weak annotator function (magenta line in Figure 1(a)) is provided, as an approximation to .
The task is to obtain a good estimate of given the set of strong observations and the weak annotator function . We can easily obtain a large set of observations from with almost no cost (magenta points in Figure 1(a)).
We consider two experiments:

[leftmargin=*]

A neural network trained on weak data and then finetuned on strong data from the true function, which is the most common semisupervised approach (Figure 1(c)).

A teacherstudent framework working by the proposed FWL approach.
As can be seen in Figure 1(d), FWL by taking into account label confidence, gives a better approximation of the true hidden function. We repeated the above experiment 10 times. The average RMSE with respect to the true function on a set of test points over those 10 experiments for the student, were as follows:

[leftmargin=*]

Student is trained on weak data (blue line in Figure 1(a)): ,

Student is trained on weak data then fine tuned on true observations (blue line in Figure 1(c)): ,

Student is trained on weak data, then fine tuned by soft labels and confidence information provided by the teacher (blue line in Figure 1(d)): (best).
3.2 Document Ranking
This task is the core information retrieval problem and is challenging as the ranking model needs to learn a representation for long documents and capture the notion of relevance between queries and documents. Furthermore, the size of publicly available datasets with querydocument relevance judgments is unfortunately quite small ( queries). We employ a stateoftheart pairwise neural ranker architecture as the student (Dehghani et al., 2017d). In this model, ranking is cast as a regression task. Given each training sample as a triple of query , and two documents and , the goal is to learn a function , which maps each data sample to a scalar output value indicating the probability of being ranked higher than with respect to .
The student follows the architecture proposed in (Dehghani et al., 2017d). The first layer of the network, i.e. representation learning layer maps each input sample to an dimensional realvalued vector. In general, besides learning embeddings for words, function learns to compose word embedding based on their global importance in order to generate query/document embeddings. The representation layer is followed by a simple fullyconnected feedforward network with a sigmoidal output unit to predict the probability of ranking higher than . The general schema of the student is illustrated in Figure 3. More details are provided in Appendix B.1.
The teacher is implemented by clustered algorithm. See Appendix C for more details.
The weak annotator is BM25 (Robertson & Zaragoza, 2009), a wellknown unsupervised method for scoring querydocument pairs based on statistics of the matched terms. More details are provided in Appendix D.1.
Description of the data with weak labels and data with true labels as well as the setup of the documentranking experiments is presented in Appendix E.2 in more details.
Results and Discussions We conducted kfold cross validation on
(the strong data) and report two standard evaluation metrics for ranking: mean average precision (MAP) of the topranked
documents and normalized discounted cumulative gain calculated for the top retrieved documents (nDCG@20). Table 1 shows the performance on both datasets. As can be seen, FWL provides a significant boost on the performance over all datasets. In the ranking task, the student is designed in particular to be trained on weak annotations (Dehghani et al., 2017d), hence training the network only on weak supervision, i.e. NN performs better than NN. This can be due to the fact that ranking is a complex task requiring many training samples, while relatively few data with true labels are available.Alternating between strong and weak data during training, i.e. NN seems to bring little (but statistically significant) improvement. However, we can gain better results by the typical finetuning strategy, NN. Comparing the performance of FWL to FWL indicates that, first of all learning the representation of the input data downstream of the main task leads to better results compared to a taskindependent unsupervised or selfsupervised way. Also the dramatic drop in the performance compared to the FWL, emphasizes the importance of the preretraining the student on weakly labeled data. We can gain improvement by finetuning the NN using labels generated by the teacher without considering their confidence score, i.e. FWL . This means we just augmented the finetuning process by generating a finetuning set using teacher which is better than in terms of quantity and in terms of quality. This baseline is equivalent to setting in Equation 1. However, we see a big jump in performance when we use FWL to include the estimated label quality from the teacher, leading to the best overall results.
3.3 Sentiment Classification
In sentiment classification, the goal is to predict the sentiment (e.g., positive, negative, or neutral) of a sentence. Each training sample consists of a sentence and its sentiment label .
The student for the sentiment classification task is a convolutional model which has been shown to perform best on the dataset we used (Deriu et al., 2017; Severyn & Moschitti, 2015a, b; Deriu et al., 2016). The first layer of the network learns the function which maps input sentence to a dense vector as its representation. The inputs are first passed through an embedding layer mapping the sentence to a matrix
, followed by a series of 1d convolutional layers with maxpooling. The representation layer is followed by feedforward layers and a softmax output layer which returns the probability distribution over all three classes. Figure
4 presents the general schema of the architecture of the student. See Appendix B.2 for more details.The teacher for this task is modeled by a . See Appendix C for more details.
The weak annotator
is a simple unsupervised lexiconbased method
(Hamdan et al., 2013; Kiritchenko et al., 2014), which estimate a distribution over sentiments for each sentence, based on sentiment labels of its terms. More details are provided in Appendix D.2.Specification of the data with weak labels and data with true labels along with the detailed experimental setup are given in Appendix E.3.
Results and Discussion
We report MacroF1, the official SemEval metric, in Table 2. We see that the proposed FWL is the best performing approach.
For this task, since the amount of data with true labels are larger compared to the ranking task, the performance of NN is acceptable. Alternately sampling from weak and strong data gives better results. Pretraining on weak labels then finetuning the network on true labels, further improves the performance. Weighting the gradient updates from weak labels during pretraining and finetuning the network with true labels, i.e. NN seems to work quite well in this task. For this task, like ranking task, learning the representation in an unsupervised task independent fashion, i.e. FWL, does not lead to good results compared to the FWL. Similar to the ranking task, finetuning NN based on labels generated by instead of data with true labels, regardless of the confidence score, works better than standard finetuning.
Besides the baselines, we also report the best performing systems which are also convolutionbased models (Rouvier & Favre 2016 on SemEval14; Deriu et al. 2016 on SemEval15). Using FWL and taking the confidence into consideration outperforms the best systems and leads to the highest reported results on both datasets.
4 Analysis
In this section, we provide further analysis of FWL by investigating the biasvariance tradeoff and the learning rate.
4.1 Handling the BiasVariance Tradeoff
As mentioned in Section 2,
is a hyperparameter that controls the contribution of weak and strong data to the training procedure. In order to investigate its influence, we fixed everything in the model and ran the finetuning step with different values of
in all the experiments.Figure 5 illustrates the performance on the ranking (on Robust04 dataset) and sentiment classification tasks (on SemEval14 dataset). For both sentiment classification and ranking, gives the best results (higher scores are better). We also experimented on the toy problem with different values of in three cases: 1) having 10 observations from the true function (same setup as Section 3.1), marked as “Toy Data” in the plot, 2) having only 5 observations from the true function, marked as “Toy Data *” in the plot, and 3) having as the weak function, which is an extremely bad approximator of the true function, marked as “Toy Data **” in the plot. For the “Toy Data” experiment, turned out to be optimal (here, lower scores are better). However, for “Toy Data *”, where we have an extremely small number of observations from the true function, setting to a higher value acts as a regularizer by relying more on weak signals, and eventually leads to better generalization. On the other hand, for “Toy Data **”, where the quality of the weak annotator is extremely low, lower values of put more focus on the true observations. Therefore, lets us control the biasvariance tradeoff in these extreme cases.
4.2 A Good Teacher is Better Than Many Observations
We now look at the rate of learning for the student as the amount of training data is varied. We performed two types of experiments for all tasks: In the first experiment, we use all the available strong data but consider different percentages of the entire weak dataset. In the second experiment, we fix the amount of weak data and provide the model with varying amounts of strong data. We use standard finetuning with similar setups as for the baseline models. Details on the experiments for the toy problem are provided in Appendix E.1.
Figure 6 presents the results of these experiments. In general, for all tasks and both setups, the student learns faster when there is a teacher. One caveat is in the case where we have a very small amount of weak data. In this case the student cannot learn a suitable representation in the first step, and hence the performance of FWL is pretty low, as expected. It is highly unlikely that this situation occurs in reality as obtaining weakly labeled data is much easier than strong data.
The empirical observation of Figure 6 that our model learns more with less data can also be seen as evidence in support of another perspective to FWL, called learning using privileged information (Vapnik & Izmailov, 2015). We elaborate more on this connection in Appendix F.


4.3 Sensitivity of the FWL to the Quality of the Weak Annotator
Our proposed setup in FWL requires defining a socalled “weak annotator” to provide a source of weak supervision for unlabelled data. In Section 4.1 we discussed the role of parameter for controlling the biasvariance tradeoff by trying two weak annotators for the toy problem. Now, in this section, we study how the quality of the weak annotator may affect the performance of the FWL, for the task of document ranking as a realworld problem.
To do so, besides BM25 (Robertson & Zaragoza, 2009), we use three other weak annotators:
vector space model (Salton & Yang, 1973) with binary term occurrence (BTO) weighting schema and vector space model with TFIDF weighting schema, which are both weaker than BM25, and BM25+RM3 (Abduljaleel et al., 2004) that uses RM3 as the pseudorelevance feedback method on top of BM25, leading to better labels.
Figure 7 illustrates the performance of these four weak annotators in terms of their mean average precision (MAP) on the test data, versus the performance of FWL given the corresponding weak annotator. As it is expected, the performance of FWL depends on the quality of the employed weak annotator. The percentage of improvement of FWL over its corresponding weak annotator on the test data is also presented in Figure 7. As can be seen, the better the performance of the weak annotator is, the less the improvement of the FWL would be.
4.4 From Modifying the Learning Rate to Weighted Sampling
FWL provides confidence score based on the certainty associated with each generated label , given sample . We can translate the confidence score as how likely including in the training set for the student model improves the performance, and rather than using this score as the multiplicative factor in the learning rate, we can use it to bias sampling procedure of minibatches so that the frequency of training samples are proportional to the confidence score of their labels.
We design an experiment to try FWL with this setup (FWL), in which we keep the architectures of the student and the teacher and the procedure of the first two steps of the FWL fixed, but we changed the step 3 as follows: Given the soft dataset , consisting of , its label and the associated confidence score generated by the teacher, we normalize the confidence scores over all training samples and set the normalized score of each sample as its probability to be sampled. Afterward, we train the student model by minibatches sampled from this set with respect to the probabilities associated with each sample, but without considering the original confidence scores in parameter updating. This means the more confident the teacher is about the generated label for each sample, the more chance that sample has to be seen by the student model.
Figure 8 illustrates the performance of both FWL and FWL trained on different amount of data sampled from , in the document ranking and sentiment classification tasks. As can be seen, compared to FWL, the performance of FWL
increases rapidly in the beginning but it slows down afterward. We have looked into the sampling procedure and noticed that the confidence scores provided by the teacher form a rather skewed distribution and there is a strong bias in FWL
toward sampling from data points that are either in or closed to the points in , as has less uncertainty around these points and the confidence scores are high. We observed that the performance of FWLgets closer to the performance of FWL after many epochs, while FWL had already a log convergence. The skewness of the confidence distribution makes FWL
to have a tendency for more exploitation than exploration, however, FWL has more chance to explore the input space, while it controls the effect of updates on the parameters for samples based on their merit.5 Related Work
In this section, we position our FWL approach relative to related work.
Learning from imperfect labels has been thoroughly studied in the literature (Frénay & Verleysen, 2014). The imperfect (weak) signal can come from nonexpert crowd workers, be the output of other models that are weaker (for instance with low accuracy or coverage), biased, or models trained on data from different related domains. Among these forms, in the distant supervision setup, a heuristic labeling rule (Deriu et al., 2016; Severyn & Moschitti, 2015b) or function (Dehghani et al., 2017d) which can be relying on a knowledge base (Mintz et al., 2009; Min et al., 2013; Han & Sun, 2016) is employed to devise noisy labels.
Learning from weak data sometimes aims at encoding various forms of domain expertise or cheaper supervision from lay annotators. For instance, in the structured learning, the label space is pretty complex and obtaining a training set with strong labels is extremely expensive, hence this class of problems leads to a wide range of works on learning from weak labels (Roth, 2017). Indirect supervision is considered as a form of learning from weak labels that is employed in particular in the structured learning, in which a companion binary task is defined for which obtaining training data is easier (Chang et al., 2010; Raghunathan et al., 2016). In the responsebased supervision, the model receives feedback from interacting with an environment in a task, and converts this feedback into a supervision signal to update its parameters (Roth, 2017; Clarke et al., 2010; Riezler et al., 2014). Constraintbased supervision is another form of weak supervision in which constraints that are represented as weak label distributions are taken as signals for updating the model parameters. For instance, physicsbased constraints on the output (Stewart & Ermon, 2017) or output constraints on execution of logical forms (Clarke et al., 2010).
In the proposed FWL model, we can employ these approaches as the weak annotator to provide imperfect labels for the unlabeled data, however, a small amount of data with strong labels is also needed, which put our model in the class of semisupervised models. In the semisupervised setup, some ideas were developed to utilize weakly or even unlabeled data. For instance, the idea of self(incremental)training (Rosenberg et al., 2005), pseudolabeling (Lee, 2013; Hinton et al., 2014), and Cotraining (Blum & Mitchell, 1998) are introduced for augmenting the training set by unlabeled data with predicted labels. Some research used the idea of selfsupervised (or unsupervised) feature learning (Noroozi & Favaro, 2016; Dosovitskiy et al., 2016; Donahue et al., 2017) to exploit different labelings that are freely available besides or within the data, and to use them as intrinsic signals to learn generalpurpose features. These features, that are learned using a proxy task, are then used in a supervised task like object classification/detection or description matching.
As a common approach in semisupervised learning, the unlabeled set can be used for learning the distribution of the data. In particular for neural networks, greedy layerwise pretraining of weights using unlabeled data is followed by supervised finetuning (Hinton et al., 2006; Deriu et al., 2017; Severyn & Moschitti, 2015b, a; Go et al., 2009). Other methods learn unsupervised encoding at multiple levels of the architecture jointly with a supervised signal (Ororbia II et al., 2015; Weston et al., 2012).
Alternatively, some noise cleansing methods have been proposed to remove or correct mislabeled samples (Brodley & Friedl, 1999)
. There are some studies showing that weak or noisy labels can be leveraged by modifying the loss function
(Reed et al., 2015; Patrini et al., 2017, 2016; Vahdat, 2017) or changing the update rule to avoid imperfections of the noisy data (Malach & ShalevShwartz, 2017; Dehghani et al., 2017b, c).One direction of research focuses on modeling the pattern of the noise or weakness in the labels. For instance, methods that use a generative model to correct weak labels such that a discriminative model can be trained more effectively (Ratner et al., 2016; Rekatsinas et al., 2017; Varma et al., 2017). Furthermore, methods that aim at capturing the pattern of the noise by inserting an extra layer (Goldberger & BenReuven, 2017) or a separate module tries to infer better labels from noisy ones and use them to supervise the training of the network (Sukhbaatar et al., 2015; Veit et al., 2017; Dehghani et al., 2017b). Our proposed FWL can be categorized in this class as the teacher tries to infer better labels and provide certainty information which is incorporated as the update rule for the student model.
6 Conclusion
Training neural networks using large amounts of weakly annotated data is an attractive approach in scenarios where an adequate amount of data with true labels is not available, a situation which often arises in practice. In this paper, we introduced fidelityweighted learning (FWL), a new studentteacher framework for semisupervised learning in the presence of weakly labeled data. We applied FWL to document ranking and sentiment classification, and empirically verified that FWL speeds up the training process and improves over stateoftheart semisupervised alternatives.
References
 Abadi et al. (2015) Martín Abadi et al. TensorFlow: Largescale machine learning on heterogeneous systems, 2015. URL http://tensorflow.org/. Software available from tensorflow.org.
 Abduljaleel et al. (2004) Nasreen Abduljaleel, James Allan, W. Bruce Croft, O Diaz, Leah Larkey, Xiaoyan Li, Mark D. Smucker, and Courtney Wade. Umass at trec 2004: Novelty and hard. In TREC13, 2004.
 Baccianella et al. (2010) Stefano Baccianella, Andrea Esuli, and Fabrizio Sebastiani. Sentiwordnet 3.0: An enhanced lexical resource for sentiment analysis and opinion mining. In LREC, volume 10, pp. 2200–2204, 2010.

Blum & Mitchell (1998)
Avrim Blum and Tom Mitchell.
Combining labeled and unlabeled data with cotraining.
In
Proceedings of the Eleventh Annual Conference on Computational Learning Theory
, COLT’ 98, pp. 92–100, 1998. 
Brodley & Friedl (1999)
Carla E. Brodley and Mark A. Friedl.
Identifying mislabeled training data.
Journal of artificial intelligence research
, 11:131–167, 1999.  Chang et al. (2010) MingWei Chang, Vivek Srikumar, Dan Goldwasser, and Dan Roth. Structured output learning with indirect supervision. In Proceedings of the 27th International Conference on Machine Learning (ICML10), pp. 199–206, 2010.
 Chapelle et al. (2006) Olivier Chapelle, Bernhard Schölkopf, and Alexander Zien. SemiSupervised Learning. The MIT Press, 1st edition, 2006.
 Clarke et al. (2010) James Clarke, Dan Goldwasser, MingWei Chang, and Dan Roth. Driving semantic parsing from the world’s response. In Proceedings of the fourteenth conference on computational natural language learning, pp. 18–27, 2010.
 Cormack et al. (2011) Gordon V. Cormack, Mark D. Smucker, and Charles L. Clarke. Efficient and effective spam filtering and reranking for large web datasets. Inf. Retr., 14(5):441–465, 2011.
 Dehghani et al. (2017a) Mostafa Dehghani, Sascha Rothe, Enrique Alfonseca, and Pascal Fleury. Learning to attend, copy, and generate for sessionbased query suggestion. In Proceedings of The international Conference on Information and Knowledge Management (CIKM’17), 2017a.
 Dehghani et al. (2017b) Mostafa Dehghani, Aliaksei Severyn, Sascha Rothe, and Jaap Kamps. Learning to learn from weak supervision by full supervision. In NIPS2017 workshop on MetaLearning (MetaLearn 2017), 2017b.
 Dehghani et al. (2017c) Mostafa Dehghani, Aliaksei Severyn, Sascha Rothe, and Jaap Kamps. Avoiding your teacher’s mistakes: Training neural networks with controlled weak supervision. arXiv preprint arXiv:1711.00313, 2017c.
 Dehghani et al. (2017d) Mostafa Dehghani, Hamed Zamani, Aliaksei Severyn, Jaap Kamps, and W. Bruce Croft. Neural ranking models with weak supervision. In SIGIR’17, 2017d.

Deriu et al. (2016)
Jan Deriu, Maurice Gonzenbach, Fatih Uzdilli, Aurelien Lucchi, Valeria De Luca,
and Martin Jaggi.
Swisscheese at semeval2016 task 4: Sentiment classification using an ensemble of convolutional neural networks with distant supervision.
Proceedings of SemEval, pp. 1124–1128, 2016.  Deriu et al. (2017) Jan Deriu, Aurelien Lucchi, Valeria De Luca, Aliaksei Severyn, Simon Müller, Mark Cieliebak, Thomas Hofmann, and Martin Jaggi. Leveraging large amounts of weakly supervised data for multilanguage sentiment classification. In Proceedings of the 26th international International World Wide Web Conference (WWW’17), pp. 1045–1052, 2017.
 Desautels et al. (2014) Thomas Desautels, Andreas Krause, and Joel W. Burdick. Parallelizing explorationexploitation tradeoffs in gaussian process bandit optimization. Journal of Machine Learning Research, 15(1):3873–3923, 2014.
 Donahue et al. (2017) Jeff Donahue, Philipp Krähenbühl, and Trevor Darrell. Adversarial feature learning. In ICLR2017, 2017.
 Dosovitskiy et al. (2016) Alexey Dosovitskiy, Philipp Fischer, Jost Tobias Springenberg, Martin Riedmiller, and Thomas Brox. Discriminative unsupervised feature learning with exemplar convolutional neural networks. IEEE transactions on pattern analysis and machine intelligence, 38(9):1734–1747, 2016.
 Frénay & Verleysen (2014) Benoît Frénay and Michel Verleysen. Classification in the presence of label noise: a survey. IEEE transactions on neural networks and learning systems, 25(5):845–869, 2014.
 Go et al. (2009) Alec Go, Richa Bhayani, and Lei Huang. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, 1(12), 2009.
 Goldberger & BenReuven (2017) Jacob Goldberger and Ehud BenReuven. Training deep neuralnetworks using a noise adaptation layer. In ICLR2017, 2017.
 Hamdan et al. (2013) Hussam Hamdan, Frederic Béchet, and Patrice Bellot. Experiments with dbpedia, wordnet and sentiwordnet as resources for sentiment analysis in microblogging. In Second Joint Conference on Lexical and Computational Semantics (* SEM), volume 2, pp. 455–459, 2013.
 Han & Sun (2016) Xianpei Han and Le Sun. Global distant supervision for relation extraction. In AAAI’16, pp. 2950–2956, 2016.
 Hensman et al. (2015) James Hensman, Alexander G. de G. Matthews, and Zoubin Ghahramani. Scalable variational gaussian process classification. In Proceedings of AISTATS, 2015.
 Hinton et al. (2014) Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. In NIPS 2014 Deep Learning Workshop, 2014. arXiv preprint arXiv:1503.02531.
 Hinton et al. (2006) Geoffrey E. Hinton, Simon Osindero, and YeeWhye Teh. A fast learning algorithm for deep belief nets. Neural Comput., 18(7):1527–1554, 2006.
 Kingma & Ba (2015) Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015. arXiv preprint arXiv:1412.6980.
 Kiritchenko et al. (2014) Svetlana Kiritchenko, Xiaodan Zhu, and Saif M. Mohammad. Sentiment analysis of short informal texts. Journal of Artificial Intelligence Research, 50:723–762, 2014.
 Kiros et al. (2015) Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Skipthought vectors. In Advances in neural information processing systems, pp. 3294–3302, 2015.
 Lee (2013) DongHyun Lee. Pseudolabel: The simple and efficient semisupervised learning method for deep neural networks. In Workshop on Challenges in Representation Learning, ICML, volume 3, pp. 2, 2013.
 LopezPaz et al. (2016) David LopezPaz, Léon Bottou, Bernhard Schölkopf, and Vladimir Vapnik. Unifying distillation and privileged information. In ICLR’16, 2016. arXiv preprint arXiv:1511.03643.
 Malach & ShalevShwartz (2017) Eran Malach and Shai ShalevShwartz. Decoupling” when to update” from” how to update”. In NIPS2017, 2017.
 Matthews et al. (2017) Alexander G. de G. Matthews, Mark van der Wilk, Tom Nickson, Keisuke. Fujii, Alexis Boukouvalas, Pablo LeónVillagrá, Zoubin Ghahramani, and James Hensman. GPflow: A Gaussian process library using TensorFlow. Journal of Machine Learning Research, 18(40):1–6, 2017.
 Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. Distributed Representations of Words and Phrases and their Compositionality. In NIPS ’13, pp. 3111–3119, 2013.
 Min et al. (2013) Bonan Min, Ralph Grishman, Li Wan, Chang Wang, and David Gondek. Distant supervision for relation extraction with an incomplete knowledge base. In HLTNAACL, pp. 777–782, 2013.
 Mintz et al. (2009) Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. Distant supervision for relation extraction without labeled data. In ACL, pp. 1003–1011, 2009.
 Nair & Hinton (2010) Vinod Nair and Geoffrey E. Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML10), pp. 807–814, 2010.
 Nakov et al. (2016) Preslav Nakov, Alan Ritter, Sara Rosenthal, Fabrizio Sebastiani, and Veselin Stoyanov. Semeval2016 task 4: Sentiment analysis in twitter. Proceedings of SemEval, pp. 1–18, 2016.

Noroozi & Favaro (2016)
Mehdi Noroozi and Paolo Favaro.
Unsupervised learning of visual representations by solving jigsaw
puzzles.
In
European Conference on Computer Vision
, pp. 69–84. Springer, 2016.  Ororbia II et al. (2015) Alexander G. Ororbia II, C. Lee Giles, and David Reitter. Learning a deep hybrid model for semisupervised text classification. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2015.
 Papernot et al. (2017) Nicolas Papernot, Martín Abadi, Úlfar Erlingsson, Ian Goodfellow, and Kunal Talwar. Semisupervised knowledge transfer for deep learning from private training data. In ICLR, 2017. arXiv preprint arXiv:1610.05755.
 Pass et al. (2006) Greg Pass, Abdur Chowdhury, and Cayley Torgeson. A picture of search. In InfoScale ’06, 2006.
 Patrini et al. (2016) Giorgio Patrini, Frank Nielsen, Richard Nock, and Marcello Carioni. Loss factorization, weakly supervised learning and label noise robustness. In International Conference on Machine Learning, pp. 708–717, 2016.
 Patrini et al. (2017) Giorgio Patrini, Alessandro Rozza, Aditya Menon, Richard Nock, and Lizhen Qu. Making neural networks robust to label noise: a loss correction approach. In CVPR, 2017. arXiv preprint arXiv:1609.03683.

Raghunathan et al. (2016)
Aditi Raghunathan, Roy Frostig, John Duchi, and Percy Liang.
Estimation from indirect supervision with linear moments.
In International Conference on Machine Learning, pp. 2568–2577, 2016.  Ratner et al. (2016) Alexander J. Ratner, Christopher M. De Sa, Sen Wu, Daniel Selsam, and Christopher Ré. Data programming: Creating large training sets, quickly. In Advances in Neural Information Processing Systems, pp. 3567–3575, 2016.
 Reed et al. (2015) Scott Reed, Honglak Lee, Dragomir Anguelov, Christian Szegedy, Dumitru Erhan, and Andrew Rabinovich. Training deep neural networks on noisy labels with bootstrapping. In ICLR2015Workshop, 2015.
 Rekatsinas et al. (2017) Theodoros Rekatsinas, Xu Chu, Ihab F. Ilyas, and Christopher Ré. Holoclean: Holistic data repairs with probabilistic inference. PVLDB, 10(11):1190–1201, 2017.
 Riezler et al. (2014) Stefan Riezler, Patrick Simianer, and Carolin Haas. Responsebased learning for grounded machine translation. In ACL (1), pp. 881–891, 2014.
 Robertson & Zaragoza (2009) Stephen Robertson and Hugo Zaragoza. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends in Information Retrieval, 3(4):333–389, 2009.
 Rosenberg et al. (2005) Chuck Rosenberg, Martial Hebert, and Henry Schneiderman. Semisupervised selftraining of object detection models. In Seventh IEEE Workshop on Applications of Computer Vision, 2005.
 Rosenthal et al. (2015) Sara Rosenthal, Preslav Nakov, Svetlana Kiritchenko, Saif M. Mohammad, Alan Ritter, and Veselin Stoyanov. Semeval2015 task 10: Sentiment analysis in twitter. In Proceedings of the 9th international workshop on semantic evaluation (SemEval 2015), pp. 451–463, 2015.
 Roth (2017) Dan Roth. Incidental supervision: Moving beyond supervised learning. In AAAI, pp. 4885–4890, 2017.
 Rouvier & Favre (2016) Mickael Rouvier and Benoit Favre. Senseilif at semeval2016 task 4: Polarity embedding fusion for robust sentiment analysis. Proceedings of SemEval, pp. 202–208, 2016.
 Salton & Yang (1973) Gerard Salton and ChungShu Yang. On the specification of term values in automatic indexing. Journal of documentation, 29(4):351–372, 1973.
 Severyn & Moschitti (2015a) Aliaksei Severyn and Alessandro Moschitti. Twitter sentiment analysis with deep convolutional neural networks. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 959–962. ACM, 2015a.
 Severyn & Moschitti (2015b) Aliaksei Severyn and Alessandro Moschitti. Unitn: Training deep convolutional neural network for twitter sentiment classification. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), Association for Computational Linguistics, Denver, Colorado, pp. 464–469, 2015b.
 Shen et al. (2006) Yirong Shen, Matthias Seeger, and Andrew Y. Ng. Fast gaussian process regression using kdtrees. In Advances in neural information processing systems, pp. 1225–1232, 2006.
 Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15(1):1929–1958, 2014.
 Stewart & Ermon (2017) Russell Stewart and Stefano Ermon. Labelfree supervision of neural networks with physics and domain knowledge. In AAAI, pp. 2576–2582, 2017.
 Sukhbaatar et al. (2015) Sainbayar Sukhbaatar, Joan Bruna, Manohar Paluri, Lubomir Bourdev, and Rob Fergus. Training convolutional networks with noisy labels. In Workshop contribution at ICLR 2015, 2015.
 Tang (2016) Yuan Tang. Tf.learn: Tensorflow’s highlevel module for distributed machine learning. arXiv preprint arXiv:1612.04251, 2016.
 Titsias (2009) Michalis K. Titsias. Variational learning of inducing variables in sparse gaussian processes. In International Conference on Artificial Intelligence and Statistics, pp. 567–574, 2009.
 Vahdat (2017) Arash Vahdat. Toward robustness against label noise in training deep discriminative neural networks. In NIPS ’17, 2017.
 Vapnik & Izmailov (2015) Vladimir Vapnik and Rauf Izmailov. Learning using privileged information: similarity control and knowledge transfer. Journal of machine learning research, 16(20232049):55, 2015.
 Vapnik & Vashist (2009) Vladimir Vapnik and Akshay Vashist. A new learning paradigm: Learning using privileged information. Neural networks, 22(5):544–557, 2009.
 Vapnik (1998) Vladimir N. Vapnik. Statistical Learning Theory. WileyInterscience, 1998.
 Varma et al. (2017) Paroma Varma, Bryan He, Dan Iter, Peng Xu, Rose Yu, Christopher De Sa, and Christopher Ré. Socratic learning: Correcting misspecified generative models using discriminative models. arXiv preprint arXiv:1610.08123, 2017.

Veit et al. (2017)
Andreas Veit, Neil Alldrin, Gal Chechik, Ivan Krasin, Abhinav Gupta, and Serge
Belongie.
Learning from noisy largescale datasets with minimal supervision.
In
The Conference on Computer Vision and Pattern Recognition
, 2017.  Wainwright et al. (2012) Martin J. Wainwright, Michael I. Jordan, and John C. Duchi. Privacy aware learning. In Advances in Neural Information Processing Systems, pp. 1430–1438, 2012.
 Weston et al. (2012) Jason Weston, Frédéric Ratle, Hossein Mobahi, and Ronan Collobert. Deep learning via semisupervised embedding. In Neural Networks: Tricks of the Trade, pp. 639–655. Springer, 2012.

Wilson & Nickisch (2015)
Andrew Gordon Wilson and Hannes Nickisch.
Kernel interpolation for scalable structured gaussian processes (kissgp).
In Proceedings of the 32Nd International Conference on International Conference on Machine Learning  Volume 37, ICML’15, pp. 1775–1784, 2015.
Appendices
We moved additional details to the appendices in order to keep the main text focused on the overall idea of the FidelityWeighted Learning approach. Specifically, we include further details on the clustered Gaussian process approach (Appendix A); on the student network architectures (Appendix B); on the teacher Gaussian process model (Appendix C); on the weak annotators (Appendix D); on the experimental data and setup (Appendix E); and on the connection to “learning with privileged information” (Appendix F).
Appendix A Detailed description of clustered GP
We suggest using several to explore the entire data space more effectively. Even though inducing points and stochastic methods make s more scalable we still observed poor performance when the entire dataset was modeled by a single . Therefore, the reason for using multiple s is mainly empirical inspired by (Shen et al., 2006) which is explained in the following:
We used Sparse Gaussian Process implemented in GPflow. The algorithm is scalable in the sense that it is not as original is. It introduces inducing points in the data space and defines a variational lower bound for the marginal likelihood. The variational bound can now be optimized by stochastic methods which make the algorithm applicable in large datasets. However, the tightness of the bound depends on the location of inducing points which are found through the optimization process.
We empirically observed that a single does not give a satisfactory accuracy on leftout test dataset. We hypothesized that this can be due to the inability of the algorithm to find good inducing points when the number of inducing points is restricted to just a few.
Then we increased the number of inducing points which trades off the scalability of the algorithm because it scales with . Moreover, apart from scalability which is partly solved by stochastic methods, we argue that the structure of the entire space may not be explored well by a single and its inducing points.
We guess this can be due to the observation that our datasets are distributed in a highly sparse way within the high dimensional embedding space.
We also tried to cure the problem by means of PCA to reduce input dimensions and give a denser representation, but it did not result in a considerable improvement. The results are presented in Tabel 3.
max width=0.9 Document Ranking Sentiment Classification Method Robust04 ClueWeb Robust04 ClueWeb MAP nDCG@20 MAP nDCG@20 F1 F1 FWL 0.2614 0.4192 0.1205 0.2121 0.6904 0.6173 FWL 0.2864 0.4411 0.1331 0.2388 0.7022 0.6340 FWL 0.3124 0.4607 0.1472 0.2453 0.7470 0.6830
We may be able to argue that clustered makes better use of the data structure roughly close to the idea of KISSGP (Wilson & Nickisch, 2015).
In inducing point methods, it is normally assumed that ( is the number of inducing points and is the number of training samples) for computational and storage saving. However, we have this intuition that few number of inducing points make the model unable to explore the inherent structure of data. By employing several GPs, we were able to use a large number of inducing points even when ( is the total number of inducing points) which seemingly better exploits the structure of datasets. Because our work was not aimed to be a close investigation of GP, we considered clustered as the engineering side of the work which is a tool to give us a measure of confidence. Other tools such as a single with inducing points that form a Kronecker or Toeplitz covariance matrix are also conceivable. Therefore, we do not of course claim that we have proposed a new method of inference for GPs.
Here is practical description of clustered algorithm:
Clustered : Let be the size of the dataset on which we train the teacher. Assume we allocate teachers to the entire data space. Therefore, each sees a dataset of size
. Then we use a simple clustering method (e.g. kmeans) to find centroids of
clusters where consists of samples . We take the centroid of cluster as the representative sample for all its content. Note that does not necessarily belong to . We assign each cluster a trained by samples belonging to that cluster. More precisely, cluster is assigned a whose data points are . Because there is no dependency among different clusters, we train them in parallel to speedup the procedure more.The pseudocode of the clustered is presented in Algorithm 2. When the main issue is computational resources (when the number of inducing points for each is large), we can first choose the number which is the maximum size of the dataset on which our resources allow to train a , then find the number of clusters accordingly. The rest of the algorithm remains unchanged.
Appendix B Detailed Architecture of the Students
b.1 Ranking Task
For the ranking task, the employed student is proposed in (Dehghani et al., 2017d). The first layer of the network models function that learns the representation of the input data samples, i.e. , and consists of three components: (1) an embedding function (where denotes the vocabulary set and is the number of embedding dimensions), (2) a weighting function , and (3) a compositionality function . More formally, the function is defined as:
(2)  
where and denote the term in query respectively document . The embedding function maps each term to a dense dimensional real value vector, which is learned during the training phase. The weighting function assigns a weight to each term in the vocabulary. It has been shown that simulates the effect of inverse document frequency (IDF), which is an important feature in information retrieval (Dehghani et al., 2017d).
The compositionality function projects a set of embeddingweighting pairs to an dimensional representation, independent from the value of :
(3) 
which is in fact the normalized weighted elementwise summation of the terms’ embedding vectors. Again, it has been shown that having global term weighting function along with embedding function improves the performance of ranking as it simulates the effect of inverse document frequency (IDF). In our experiments, we initialize the embedding function with word2vec embeddings (Mikolov et al., 2013) pretrained on Google News and the weighting function with IDF.
The representation layer is followed by a simple fully connected feedforward network with hidden layers followed by a softmax which receives the vector representation of the inputs processed by the representation learning layer and outputs a prediction . Each hidden layer in this network computes , where and denote the weight matrix and the bias term corresponding to the hidden layer and is the nonlinearity. These layers follow a sigmoid output. We employ the cross entropy loss:
(4) 
where is a batch of data samples.
b.2 Sentiment Classification Task
The student for the sentiment classification task is a convolutional model which has been shown to perform best in the dataset we used (Deriu et al., 2017; Severyn & Moschitti, 2015a, b; Deriu et al., 2016). The first layer of the network learns the function which maps input sentence to a vector as its representation consists of an embedding function , where denotes the vocabulary set and is the number of embedding dimensions.
This function maps the sentence to a matrix , where each column represents the embedding of a word at the corresponding position in the sentence. Matrix is passed through a convolution layer. In this layer, a set of filters is applied to a sliding window of length over to generate a feature map matrix . Each feature map for a given filter is generated by , where denotes the concatenation of word vectors from position to . The concatenation of all produces a feature vector . The vectors are then aggregated over all filters into a feature map matrix .
We also add a bias vector
to the result of a convolution. Each convolutional layer is followed by a nonlinear activation function (we use ReLU
(Nair & Hinton, 2010)) which is applied elementwise. Afterward, the output is passed to the max pooling layer which operates on columns of the feature map matrix returning the largest value: (see Figure 4). This architecture is similar to the stateoftheart model for Twitter sentiment classification from Semeval 2015 and 2016 (Severyn & Moschitti, 2015b; Deriu et al., 2016).We initialize the embedding matrix with word2vec embeddings (Mikolov et al., 2013) pretrained on a collection of 50M tweets.
The representation layer then is followed by a feedforward layer similar to the ranking task (with different width and depth) but with softmax instead of sigmoid as the output layer which returns , the probability distribution over all three classes. We employ the cross entropy loss:
(5) 
where is a batch of data samples, and is a set of classes.
Appendix C Detailed Architecture of the Teachers
We use Gaussian Process as the teacher in all the experiments. For each task, either regression or (multiclass) classification, in order to generate soft labels, we pass the mean of through the same function that is applied on the output of the student network for that task, e.g. softmax, or sigmoid. For binary classification or one dimensional regression, is scalar and is identity. For multiclass classification or multidimensional regression tasks, is an aggregation function that takes variance over several dimensions and outputs a single measure of variance. As a reasonable choice, the aggregating function in our sentiment classification task (three classes) is mean of variances over dimensions.
In the teacher, linear combinations of different kernels are used for different tasks in our experiments.
Toy Problem: We use standard Gaussian process regression^{2}^{2}2http://gpflow.readthedocs.io/en/latest/notebooks/regression.html with this kernel:
(6) 
Document Ranking: We use sparse variational GP regression^{3}^{3}3http://gpflow.readthedocs.io/en/latest/notebooks/SGPR_notes.html (Titsias, 2009) with this kernel:
(7) 
Sentiment Classification: We use sparse variational GP for multiclass classification^{4}^{4}4http://gpflow.readthedocs.io/en/latest/notebooks/multiclass.html (Hensman et al., 2015) with the following kernel:
(8) 
where,
We empirically found satisfying value for the length scale of RBF and Matern3/2 kernels. We also set to obtain a homogeneous linear kernel. The constant value of determines the level of noise in the labels. This is different from the noise in weak labels. This term explains the fact that even in true labels there might be a trace of noise due to the inaccuracy of human labelers.
We set the number of clusters in the clustered algorithm for the ranking task to and for the sentiment classification task to .
Appendix D Weak Annotators
d.1 Document Ranking
The weak annotator in the document ranking task is BM25 (Robertson & Zaragoza, 2009), a wellknown unsupervised retrieval method. This method heuristically scores a given pair of querydocument based on the statistics of their matched terms. In the pairwise document ranking setup, for a given sample is the probability of document being ranked higher than : , where is the score obtained from the weak annotator.
d.2 Sentiment Classification
The weak annotator for the sentiment classification task is a simple lexiconbased method (Hamdan et al., 2013; Kiritchenko et al., 2014). We use SentiWordNet03 (Baccianella et al., 2010) to assign probabilities (positive, negative and neutral) for each token in set . We use a bagofwords model for the sentencelevel probabilities (i.e. just averaging the distributions of the terms), yielding a noisy label , where is the number of classes. We found empirically that using soft labels from the weak annotator works better than assigning a single hard label.
Appendix E Data Collection, Parameters and Setup
e.1 Toy Problem
Weak/True Data In all the experiments with the toy problem, we have randomly sampled 100 data points from the weak function and 10 data points from the true function. We introduce a small amount of noise to the observation of the true function to model the noise in the human labeled data.
Setup
The neural network employed in the toy problem experiments is a simple feedforward network with the depth of 3 layers and width of 128 neurons per layer. We have used
as the nonlinearity for the intermediate layers and a linear output layer. As the optimizer, we used Adam (Kingma & Ba, 2015) and the initial learning rate has been set to . For the teacher in the toy problem, we fit only one on all the data points (i.e. no clustering). Also during finetuning, we set .Setup of experiments in Section 4.2 We fixed everything in the model and tried running the finetuning step with different values for in all the experiments. For the experiments on toy problem in Section 4.2, the reported numbers are averaged over 10 trials. In the first experiment (i.e. Figure 5(a)), the size of sampled data data is: and (Fixed) and for the second one (i.e. Figure 5(a)): and (fixed).
e.2 Ranking Task
Collections We use two standard TREC collections for the task of adhoc retrieval: The first collection (Robust04) consists of 500k news articles from different news agencies as a homogeneous collection. The second collection (ClueWeb) is ClueWeb09 Category B, a largescale web collection with over 50 million English documents, which is considered as a heterogeneous collection. Spam documents were filtered out using the Waterloo spam scorer ^{5}^{5}5http://plg.uwaterloo.ca/~gvcormac/clueweb09spam/ (Cormack et al., 2011) with the default threshold .
Data with true labels We take query sets that contain humanlabeled judgments: a set of 250 queries (TREC topics 301–450 and 601–700) for the Robust04 collection and a set of 200 queries (topics 1200) for the experiments on the ClueWeb collection. For each query, we take all documents judged as relevant plus the same number of documents judged as nonrelevant and form pairwise combinations among them.
Data with weak labels We create a query set using the unique queries appearing in the AOL query logs (Pass et al., 2006). This query set contains web queries initiated by real users in the AOL search engine that were sampled from a threemonth period from March 2006 to May 2006. We applied standard preprocessing Dehghani et al. (2017d, a) on the queries: We filtered out a large volume of navigational queries containing URL substrings (“http”, “www.”, “.com”, “.net”, “.org”, “.edu”). We also removed all nonalphanumeric characters from the queries. For each dataset, we took queries that have at least ten hits in the target corpus using our weak annotator method. Applying all these steps, We collect 6.15 million queries to train on in Robust04 and 6.87 million queries for ClueWeb. To prepare the weakly labeled training set , we take the top retrieved documents using BM25 for each query from training query set , which in total leads to training samples.
Setup For the evaluation of the whole model, we conducted a 3fold crossvalidation. However, for each dataset, we first tuned all the hyperparameters of the student in the first step on the set with true labels using batched GP bandits with an expected improvement acquisition function (Desautels et al., 2014) and kept the optimal parameters of the student fixed for all the other experiments. The size and number of hidden layers for the student is selected from . The initial learning rate and the dropout parameter were selected from and , respectively. We considered embedding sizes of . The batch size in our experiments was set to . We use ReLU (Nair & Hinton, 2010) as a nonlinear activation function in student. We use the Adam optimizer (Kingma & Ba, 2015) for training, and dropout (Srivastava et al., 2014) as a regularization technique.
At inference time, for each query, we take the top retrieved documents using BM25 as candidate documents and rerank them using the trained models. We use the Indri^{6}^{6}6https://www.lemurproject.org/indri.php implementation of BM25 with default parameters (i.e., , , and ).
e.3 Sentiment Classification Task
Collections We test our model on the twitter messagelevel sentiment classification of SemEval15 Task 10B (Rosenthal et al., 2015). Datasets of SemEval15 subsume the test sets from previous editions of SemEval, i.e. SemEval13 and SemEval14. Each tweet was preprocessed so that URLs and usernames are masked.
Data with true labels We use train (9,728 tweets) and development (1,654 tweets) data from SemEval13 for training and SemEval13test (3,813 tweets) for validation. To make your results comparable to the official runs on SemEval we us SemEval14 (1,853 tweets) and SemEval15 (2,390 tweets) as test sets (Rosenthal et al., 2015; Nakov et al., 2016).
Data with weak labels We use a large corpus containing 50M tweets collected during two months for both, training the word embeddings and creating the weakly annotated set using the lexiconbased method explained in Section 3.3.
Setup Similar to the document ranking task, we tuned hyperparameters for the student in the first step with respect to the true labels of the validation set using batched GP bandits with an expected improvement acquisition function (Desautels et al., 2014) and kept the optimal parameters fixed for all the other experiments. The size and number of hidden layers for the classifier and is selected from . We tested the model with both, and convolutional layers. The number of convolutional feature maps and the filter width is selected from and , respectively. The initial learning rate and the dropout parameter were selected from and , respectively. We considered embedding sizes of and the batch size in these experiments was set to . ReLU (Nair & Hinton, 2010) is used as a nonlinear activation function in student. Adam optimizer (Kingma & Ba, 2015) is used for training, and dropout (Srivastava et al., 2014) as a regularizer.
Appendix F Connection with Vapnik’s learning using privileged information
In this section, we highlight the connections of our work with Vapnik’s learning using privileged information (LUPI) (Vapnik & Vashist, 2009; Vapnik & Izmailov, 2015). FWL makes use of information from a small set of correctly labeled data to improve the performance of a semisupervised learning algorithm. The main idea behind LUPI comes from the fact that humans learn much faster than machines. This can be due to the role that an Intelligent Teacher plays in human learning. In this framework, the training data is a collection of triplets
(9) 
where each is a pair of featurelabel and is the additional information provided by an intelligent teacher to ease the learning process for the student. Additional information for each is available only during training time and the learning machine must only rely on at test time. The theory of LUPI studies how to leverage such a teaching signal to outperform learning algorithms utilizing only the normal features . For example, MRI brain images can be augmented with highlevel medical or even psychological descriptions of Alzheimer’s disease to build a classifier that predicts the probability of Alzheimer’s disease from an MRI image at test time. It is known from statistical learning theory (Vapnik, 1998) that the following bound for test error is satisfied with probability :
(10) 
where denotes the training error over samples, is the VC dimension of the space of functions from which is chosen, and . When the classes are not , i.e. the machine learns at a slow rate of . For easier problems where classes are , resulting in a learning rate of . The difference between these two cases is severe. The same error bound achieved for a separable problem with 10 thousand data points is only obtainable for a nonseparable problem when 100 million data points are provided. This is prohibitive even when obtaining large datasets is not so costly. The theory of LUPI shows that an intelligent teacher can reduce resulting in a faster learning process for the student. In this paper, we proposed a teacherstudent framework for semisupervised learning. Similar to LUPI, in FWL a student is supposed to solve the main prediction task while an intelligent teacher provides additional information to improve its learning. In addition, we first train the student network so that it obtains initial knowledge of weakly labeled data and learns a good data representation. Then the teacher is trained on truly labeled data enjoying the representation learned by the student. This extends LUPI in a way that the teacher provides privileged information that is most useful for the current state of student’s knowledge. FWL also extends LUPI by introducing several teachers each of which is specialized to correct student’s knowledge related to a specific region of the data space.
Figure 6(a) provides evidence for the assumption that privileged information in our task can accelerate the learning process of the student. It shows how the privileged information from an intelligent teacher affects the exponent of the error bound in Equation 10. Figure 6(b) shows the test error for various number of samples with true label. As expected, In both extremes where is too small or too large, the performance of our model becomes close to the models without a teacher. The reason is that student has enough strong samples to learn a good model of true function. In more realistic cases where but is still large enough to be informative about , our model gives a lower test error than models without the intelligent teacher.
The theory of LUPI was first developed and proved for support vector machines by Vapnik as a method for knowledge transfer. Hinton introduced
Dark knowledge as a spiritually close idea in the context of neural networks (Hinton et al., 2006). He proposed to use a large network or an ensemble of networks for training and a smaller network at test time. It turned out that compressing knowledge of a large system into a smaller system can improve the generalization ability. It was shown in (LopezPaz et al., 2016) that dark knowledge and LUPI can be unified under a single umbrella, called generalized distillation. The core idea of these models is machinesteachingmachines. As the name suggests, a machine is learning the knowledge embedded in another machine. In our case, student is correcting his knowledge by receiving privileged information about label uncertainty from teacher.Our framework extends the core idea of LUPI in the following directions:

[leftmargin=*]

Trainable teacher: It is often assumed that the teacher in LUPI framework has some additional true information. We show that when this extra information is not available, one can still use the LUPI setup and define an implicit teacher whose knowledge is learned from the true data. In this approach, the performance of the final studentteacher system depends on a clever answer to the following question: which information should be considered as the privileged knowledge of teacher.

Bayesian teacher: The proposed teacher is Bayesian. It provides posterior uncertainty of the label of each sample.

Mutual representation: We introduced module which learns a mutual embedding (representation) for both student and teacher. This is in particular interesting because it defines a twoway channel between teacher and student.

Multiple teachers: We proposed a scalable method to introduce several teachers such that each teacher is specialized in a particular region of the data space.