The success of deep neural networks to date depends strongly on the availability of labeled data which is costly and not always easy to obtain. Usually it is much easier to obtain small quantities of high-quality labeled data and large quantities of unlabeled data. The problem of how to best integrate these two different sources of information during training is an active pursuit in the field of semi-supervised learning(Chapelle et al., 2006). However, for a large class of tasks it is also easy to define one or more so-called “weak annotators”, additional (albeit noisy) sources of weak supervision
based on heuristics or “weaker”, biased classifiers trained on e.g. non-expert crowd-sourced data or data from different domains that are related. While easy and cheap to generate, it is not immediately clear if and how these additional weakly-labeled data can be used to train a stronger classifier for the task we care about. More generally, in almost all practical applications machine learning systems have to deal with data samples of variable quality. For example, in a large dataset of images only a small fraction of samples may be labeled by experts and the rest may be crowd-sourced using e.g. Amazon Mechanical Turk.
Assuming we can obtain a large set of weakly-labeled data in addition to a much smaller training set of “strong” labels, the simplest approach is to expand the training set by including the weakly-supervised samples (all samples are equal). Alternatively, one may pretrain on the weak data and then fine-tune on observations from the true function or distribution (which we call strong data). Indeed, it has recently been shown that a small amount of expert-labeled data can be augmented in such a way by a large set of raw data, with labels coming from a heuristic function, to train a more accurate neural ranking model (Dehghani et al., 2017d, c, b). The downside is that such approaches are oblivious to the amount or source of noise in the labels. Simply speaking, they do not consider the cause of noise in the labels and only focus on the effect.
In this paper, we argue that treating weakly-labeled samples uniformly (i.e. each weak sample contributes equally to the final classifier) ignores potentially valuable information of the label quality. Instead, we propose Fidelity-Weighted Learning (FWL), a Bayesian semi-supervised approach that leverages a small amount of data with true labels to generate a larger training set with confidence-weighted weakly-labeled samples, which can then be used to modulate the fine-tuning process based on the fidelity (or quality) of each weak sample. By directly modeling the inaccuracies introduced by the weak annotator in this way, we can control the extent to which we make use of this additional source of weak supervision: more for confidently-labeled weak samples close to the true observed data, and less for uncertain samples further away from the observed data.
We propose a setting consisting of two main modules. One is called the student and is in charge of learning a suitable data representation and performing the main prediction task, the other is the teacher which modulates the learning process by modeling the inaccuracies in the labels. We explain our approach in much more detail in Section 2, but at a high level it works as follows (see Figure 1): We pretrain the student network on weak data to learn an initial task-dependent data representation which we pass to the teacher along with the strong data. The teacher then learns to predict the strong data, but crucially, based on the student’s learned representation. This then allows the teacher to generate new labeled data from unlabeled data, and in the process correct the student’s mistakes, leading to a better data representation and better final predictor.
2. Fidelity-Weighted Learning (FWL)
In this section, we describe our proposed FWL approach for semi-supervised learning when we have access to weak supervision (e.g. heuristics or weak annotators). We assume we are given a large set of unlabeled data samples, a heuristic labeling function called the weak annotator, and a small set of high-quality samples labeled by experts, called the strong dataset, consisting of tuples of training samples and their true labels , i.e. . We consider the latter to be observations from the true target function that we are trying to learn. We use the weak annotator to generate labels for the unlabeled samples. Generated labels are noisy due to the limited accuracy of the weak annotator. This gives us the weak dataset consisting of tuples of training samples and their weak labels , i.e. . Note that we can generate a large amount of weak training data at almost no cost using the weak annotator. In contrast, we have only a limited amount of observations from the true function, i.e. .
Our proposed setup comprises a neural network called the student and a Bayesian function approximator called the teacher. The training process consists of three phases which we summarize in Algorithm 1 and Figure 1.
Step 1 Pre-train the student on using weak labels generated by the weak annotator.
The main goal of this step is to learn a task dependent representation of the data as well as pretraining the student. The student function is a neural network consisting of two parts. The first part learns the data representation and the second part performs the prediction task (e.g. classification). Therefore the overall function is . The student is trained on all samples of the weak dataset . For brevity, in the following, we will refer to both data sample and its representation by when it is obvious from the context. From the self-supervised feature learning point of view, we can say that representation learning in this step is solving a surrogate task of approximating the expert knowledge, for which a noisy supervision signal is provided by the weak annotator.
Step 2 Train the teacher on the strong data represented in terms of the student representation and then use the teacher to generate a soft dataset consisting of for all data samples.
We use a Gaussian process as the teacher to capture the label uncertainty in terms of the student representation, estimated w.r.t the strong data. A prior mean and co-variance function is chosen for. The learned embedding function
in Step 1 is then used to map the data samples to dense vectors as input to the. We use the learned representation by the student in the previous step to compensate lack of data in and the teacher can enjoy the learned knowledge from the large quantity of the weakly annotated data. This way, we also let the teacher see the data through the lens of the student.
The is trained on the samples from to learn the posterior mean (used to generate soft labels) and posterior co-variance (which represents label uncertainty). We then create the soft dataset using the posterior , input samples from , and predicted labels with their associated uncertainties as computed by and :
The generated labels are called soft labels. Therefore, we refer to as a soft dataset. transforms the output of to the suitable output space. For example in classification tasks,
would be the softmax function to produce probabilities that sum up to one. For multidimensional-output tasks where a vector of variances is provided by the, the vector is passed through an aggregating function to generate a scalar value for the uncertainty of each sample. Note that we train only on the strong dataset but then use it to generate soft labels and uncertainty for samples belonging to .
In practice, we furthermore divide the space of data into several regions and assign each region a separate trained on samples from that region. This leads to a better exploration of the data space and makes use of the inherent structure of data. The algorithm called clustered gave better results compared to a single GP.
Step 3 Fine-tune the weights of the student network on the soft dataset, while modulating the magnitude of each parameter update by the corresponding teacher-confidence in its label.
The student network of Step 1 is fine-tuned using samples from the soft dataset where . The corresponding uncertainty of each sample is mapped to a confidence value according to Equation 1
below, and this is then used to determine the step size for each iteration of the stochastic gradient descent (SGD). So, intuitively, for data points where we have true labels, the uncertainty of the teacher is almost zero, which means we have high confidence and a large step-size for updating the parameters. However, for data points where the teacher is not confident, we down-weight the training steps of the student. This means that at these points, we keep the student function as it was trained on the weak data in Step 1.
More specifically, we update the parameters of the student by training on using SGD:
where is the per-example loss, is the total learning rate, is the size of the soft dataset , is the parameters of the student network, and is the regularization term.
We define the total learning rate as , where is the usual learning rate of our chosen optimization algorithm that anneals over training iterations, and is a function of the label uncertainty that is computed by the teacher for each data point. Multiplying these two terms gives us the total learning rate. In other words, represents the fidelity (quality) of the current sample, and is used to multiplicatively modulate . Note that the first term does not necessarily depend on each data point, whereas the second term does. We propose
to exponentially decrease the learning rate for data point if its corresponding soft label is unreliable (far from a true sample). In Equation 1, is a positive scalar hyper-parameter. Intuitively, small results in a student which listens more carefully to the teacher and copies its knowledge, while a large makes the student pay less attention to the teacher, staying with its initial weak knowledge. More concretely speaking, as student places more trust in the labels estimated by the teacher and the student copies the knowledge of the teacher. On the other hand, as , student puts less weight on the extrapolation ability of and the parameters of the student are not affected by the correcting information from the teacher.
In this section, we apply FWL to document ranking. We evaluate the performance of our method compared to the following baselines:
WA. The weak annotator, i.e. the unsupervised method used for annotating the unlabeled data.
NN. The student trained only on weak data.
NN. The student trained only on strong data.
NN. The student trained on samples that are alternately drawn from without replacement, and with replacement. Since , it oversamples the strong data.
NN. The student trained on weak dataset and fine-tuned on strong dataset .
FWL. Our FWL model, i.e. the student trained on the weakly labeled data and fine-tuned on examples labeled by the teacher using the confidence scores.
3.1. Document Ranking
This task is the core information retrieval problem and is challenging as the ranking model needs to learn a representation for long documents and capture the notion of relevance between queries and documents. Furthermore, the size of publicly available datasets with query-document relevance judgments is unfortunately quite small ( queries). We employ a pairwise neural ranker architecture as the student (Dehghani et al., 2017d). In this model, ranking is cast as a regression task. Given each training sample as a triple of query , and two documents and , the goal is to learn a function , which maps each data sample to a scalar output value indicating the probability of being ranked higher than with respect to .
The student follows the architecture proposed in (Dehghani et al., 2017d). The first layer of the network, i.e. representation learning layer maps each input sample to an -dimensional real-valued vector. In general, besides learning embeddings for words, function learns to compose word embedding based on their global importance in order to generate query/document embeddings. The representation layer is followed by a simple fully-connected feed-forward network with a sigmoidal output unit to predict the probability of ranking higher than . The general schema of the student is illustrated in Figure 2.
The teacher is implemented by clustered algorithm.
The weak annotator is BM25 (Robertson and Zaragoza, 2009), a well-known unsupervised method for scoring query-document pairs based on statistics of the matched terms.
Results and Discussions We conducted 3-fold cross validation on
(the strong data) with 80/20 training/validation split, and report two standard evaluation metrics for ranking: mean average precision (MAP) of the top-rankeddocuments and normalized discounted cumulative gain calculated for the top retrieved documents (nDCG@20). In all of the experiments, the experimental setup and preprocessing are similar to (Dehghani et al., 2017d, a). Table 1 shows the performance on both datasets. As can be seen, FWL provides a significant boost on the performance over all datasets. In the ranking task, the student is designed in particular to be trained on weak annotations (Dehghani et al., 2017d), hence training the network only on weak supervision, i.e. NN performs better than NN. This can be due to the fact that ranking is a complex task requiring many training samples, while relatively few data with true labels are available.
Alternating between strong and weak data during training, i.e. NN seems to bring little (but statistically significant) improvement. However, we can gain better results by the typical fine-tuning strategy, NN.
3.2. Sensitivity to Weak Annotation Quality
Our proposed setup in FWL requires defining a so-called “weak annotator” to provide a source of weak supervision for unlabelled data. Now, in this section, we study how the quality of the weak annotator may affect the performance of the FWL, for the task of document ranking.
To do so, besides BM25 (Robertson and Zaragoza, 2009), we use three other weak annotators:
vector space model (Salton and Yang, 1973) with binary term occurrence (BTO) weighting schema and vector space model with TF-IDF weighting schema, which are both weaker than BM25, and BM25+RM3 (Abdul-jaleel et al., 2004) that uses pseudo-relevance feedback (Dehghani et al., 2016) on top of BM25, leading to better labels.
Figure 3 illustrates the performance of these four weak annotators in terms of their mean average precision (MAP) on the test data, versus the performance of FWL given the corresponding weak annotator. As it is expected, the performance of FWL depends on the quality of the employed weak annotator. The percentage of improvement of FWL over its corresponding weak annotator on the test data is also presented in Figure 3. As can be seen, the better the performance of the weak annotator is, the less the improvement of the FWL would be.
Training neural networks using large amounts of weakly annotated data is an attractive approach in scenarios where an adequate amount of data with true labels is not available, a situation which often arises in practice. In this paper, we make use of fidelity-weighted learning (FWL), a new student-teacher framework for semi-supervised learning in the presence of weakly labeled data. We applied FWL to document ranking and empirically verified that FWL speeds up the training process and improves over state-of-the-art semi-supervised alternatives.
Our general conclusion is that explicitly modeling label quality is both possible and useful for learning task dependent data representations. The student-teacher configuration conceptually allows us to distinguish between the role of the student who learns the target representation, and the teacher who both learns to estimate the confidence in labels as well as adapts to the needs of the student by taking the student’s current state into account. One key observation is that the model with explicit feedback about the strong labels (i.e., pre-training and fine-tuning) is not as effective as the teacher model that implicitly gives feedback from the strong labels in terms of label confidence. While this can be explained in terms of avoiding overfitting or loss of generalization, there is also a conceptual explanation in terms of promoting the student to learn by not overruling or correcting, but by giving the student the right feedback to allow for a learning by discovery approach, retaining the full generative power of the student model. Arguably, this even resembles a kind of “Socratic dialog” between student and teacher.
Acknowledgments This research is funded in part by the Netherlands Organization for Scientific Research (NWO; ExPoSe project, NWO CI # 314.99.108).
- Abdul-jaleel et al.  N. Abdul-jaleel, J. Allan, W. B. Croft, O. Diaz, L. Larkey, X. Li, M. D. Smucker, and C. Wade. Umass at trec 2004: Novelty and hard. In TREC-13, 2004.
- Chapelle et al.  O. Chapelle, B. Schölkopf, and A. Zien. Semi-Supervised Learning. The MIT Press, 1st edition, 2006.
- Cormack et al.  G. V. Cormack, M. D. Smucker, and C. L. Clarke. Efficient and effective spam filtering and re-ranking for large web datasets. Inf. Retr., 14(5):441–465, 2011.
- Dehghani et al.  M. Dehghani, H. Azarbonyad, J. Kamps, D. Hiemstra, and M. Marx. Luhn revisited: Significant words language models. In CIKM ’16, 2016.
- Dehghani et al. [2017a] M. Dehghani, S. Rothe, E. Alfonseca, and P. Fleury. Learning to attend, copy, and generate for session-based query suggestion. In Proceedings of The international Conference on Information and Knowledge Management (CIKM’17), 2017a.
- Dehghani et al. [2017b] M. Dehghani, A. Severyn, S. Rothe, and J. Kamps. Learning to learn from weak supervision by full supervision. In NIPS2017 workshop on Meta-Learning (MetaLearn 2017), 2017b.
- Dehghani et al. [2017c] M. Dehghani, A. Severyn, S. Rothe, and J. Kamps. Avoiding your teacher’s mistakes: Training neural networks with controlled weak supervision. arXiv preprint arXiv:1711.00313, 2017c.
- Dehghani et al. [2017d] M. Dehghani, H. Zamani, A. Severyn, J. Kamps, and W. B. Croft. Neural ranking models with weak supervision. In SIGIR’17, 2017d.
- Dehghani et al.  M. Dehghani, A. Mehrjou, S. Gouws, J. Kamps, and B. Schölkopf. Fidelity-weighted learning. In International Conference on Learning Representations (ICLR), 2018.
- Robertson and Zaragoza  S. Robertson and H. Zaragoza. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends in Information Retrieval, 3(4):333–389, 2009.
- Salton and Yang  G. Salton and C.-S. Yang. On the specification of term values in automatic indexing. Journal of documentation, 29(4):351–372, 1973.