Deep learning models have shown promising results in the field of document retrieval. Specifically attention based models such as (Vaswani et al. (2017a); Devlin et al. (2018); Xiong et al. (2016)) demonstrate clear improvements in the performance of neural models in question answering tasks. In such models, rich encodings of claim (question) and document (answers) are generated using various attention mechanisms. A challenge when using such models in large scale document retrieval systems is the lack of separation between document and claim encodings, making it infeasible to pre-index and retrieve the document encodings efficiently during runtime. In this paper we explore the use of knowledge distillation as a means to transfer the embedded attention information to a simpler attention-free neural model.
Knowledge distillation using posterior probabilities of one model to improve the performance of another model has been widely studied (Bucila et al. (2006)). (Hinton et al. (2015)) discusses using aggregate posteriors of an ensemble of acoustic deep models to improve the performance of a single model. (Kim and Rush (2016)
) suggests using word-level knowledge distillation in Neural Machine Translation. (Zagoruyko and Komodakis (2016)
) defines an attention mechanism in Convolutional Neural Networks and uses knowledge distillation to improve the performance of a student model by forcing it to mimic such mechanism. (Romero et al. (2014)) explores training a student model which is deeper and thinner than the teacher while utilizing both softmax posteriors and intermediate layer representations of the teacher. (Mou et al. (2015)) has experimented with distilling knowledge from a large embedding to a smaller one. (Hu et al. (2018)) have used an ensemble of models as the teacher model, similar to (Hinton et al. (2015)), to guide the alignments of the student model in machine reading comprehension.
We conduct knowledge distillation experiments on document retrieval for the fact extraction and verification task introduced in (Thorne et al. (2018)). In order to make our approach generic, no restrictions are imposed on the type of attention that the teacher model can employ. Furthermore, the student model does not need to be the same type as the teacher model, e.g. the teacher model can be a CNN based model while the student is an LSTM. The student models that are experimented with in this paper are both faster (up to 12x) and smaller (up to 20x) than the teacher model. We start with the problem definition and task setup in section 2. Next we look at model training with knowledge distillation in section 3 and present experimental results in section 4.1.
2 Problem Description
In the knowledge distillation or teacher-student training framework, the student model is the target model to be trained using annotation labels and information such as posteriors or hidden unit activations from a complex teacher model. In this paper we consider single layer CNN and LSTM models with a linear layer on top as our student models. Specifically, we avoid models that require interactions between claim and document to create encodings of both the document and claim. Teacher model is a more complex model than the student model both in terms of the number of parameters and the structure of the network. As shown in 1, the teacher model uses claim dependent document encodings.
The document retrieval task can be considered as a classification task: Given a pair, shortened as , assign a score indicating the relevancy of the document to the claim. For each claim, the documents are sorted based on the assigned score, and the top ones are picked. We further discuss the metrics used in later sections.
The publicly available FEVER dataset is used in this paper (Thorne et al. (2018)
). In FEVER, a corpus of Wikipedia documents is given, and the task is to classify a given claim assupported, refuted or not enough info using the given corpus. Three sub tasks are defined: document retrieval, sentence retrieval and textual entailment. In this paper, we focus on the document retrieval task. The corpus consists of 5.4 million pages, and more than 175,000 claims. Each sample in FEVER consists of a claim, all the relevant documents, all the relevant sentences in those documents, and the annotated label.
For training and evaluating our model, we construct tuples. For each claim, all the annotated relevant documents are labeled as positive samples. DrQA (Chen et al. (2017)) is employed to find the
nearest documents of a claim from the entire corpus based on cosine similarity of TF-IDF vectors. The top results returned by DrQA that are not annotated as relevant are labeled as negative samples. The rationale behind this is to have most similar irrelevant documents to the claim as negative samples. This makes the resulting dataset to be non-trivial. Each claim has a fixed number of documentsC. The claims are split into train, dev and test sets, each having 145000, 20000 and 10000 claims respectively.
Table 1 shows given a certain C, what percentage of claims will have all the annotated relevant documents. We use C=10, as it will cover vast majority of the claims.
Being a ranking task, Discounted Cumulative Gain (DCG) and Recall at top k values are the performance metrics. In order to aggregate per-claim recall values, we define the followings:
Where, indicates the number of relevant documents for claim , is if document at th position in the sorted documents list is relevant to claim and otherwise. indicates the total number of claims.
In this section we discuss the training setup for our knowledge distillation experiments.
3.1 Teacher Model
As our teacher models, we experimented with architectures that have been shown to give state of the art performance on the SQuAD task (Rajpurkar et al. (2016)). Two models that performed the best were DCN (Xiong et al. (2016)) without highway maxout network layers and Transformer (Vaswani et al. (2017b)). Note that the purpose of our study is not to find the best teacher model, but a teacher model that significantly outperforms the baseline student model. Table 2 shows the performance of these two models. We picked the DCN model as the teacher for further experiments. Please note that the models were modified to be used in our classification task.
3.2 Student Model
The following candidate models were employed as student:
SimpleCNN: CNN and Maxpooling layers are separately applied to claim and document to create the encodings. A linear layer is used to join the encodings.
SimpleLSTM: Recurrent layers separately applied to claim and document to create the encodings. Similar to SimpleCNN, A linear layer is used to join the encodings.
The final claim and document encodings are independent of each other, as mentioned in 2.
3.3 Objective Function
In order to train the student model, the trained teacher model is run over the entire training, dev and test sets, and similarity score of each pair is recorded. When training the student network, these similarity scores alongside the annotated labels are used. We define the following losses:
denote teacher logits, student logits and true label of documentand claim , respectively. is a hyper parameter that dictates the importance of . Setting it to 0 indicates no teacher training. (Temperature) is another hyper parameter that indicates how much smoothing of the classification scores is done. Setting it to 0 is equal to picking the largest value only.
4 Experimental Results
4.1 Full Training Data
We first experiment with training the teacher model (3) with the entire training data, and then using the posteriors in training the student models (3.2). Tables 4 and 5 show the top performing results. For each loss type and model, the top three performing models are picked. Some observations are as follows:
Using teacher student training improves the performance of student models.
Improvements resulting from knowledge distillation are larger with SimpleLSTM. This indicates that the LSTM module is more capable of benefiting from the information embedded in the soft labels provided by the teacher model, as well as its superiority in encoding sequential inputs (Tan et al. (2015)), (Bahdanau et al. (2014))
Best performance is achieved with temperatures 1. This shows that using smoothing of the logits is crucial. (Hinton et al. (2015)) also shows improvements using smoothing. In fact, none of top performing runs have been with .
The best performance is achieved when using a mix of teacher and hard labels. It can be seen that values generate the largest improvements. Using only soft labels from teacher model lacks the more credible annotated labels. Using only hard labels lacks the extra information provided by soft labels. This indicates soft and hard labels provide complementing information.
MSE vs CE: Results do not show any consistent pattern of distillation favoring one versus the other. For SimpleLSTM, MSE performs better, and for SimpleCNN, CE is a better choice.
4.2 Partial Training Data
It has been claimed (Hinton et al. (2015)) that teacher student training could act as a regularizer. We test this claim by designing an experiment where using a small portion of the entire dataset to train, we expect less overfitting when employing knowledge distillation versus when no teacher training is involved.
In this section, the results of experiments with only partial training data to train the student model are discussed. Please note that the teacher model is trained with the entire training dataset. We experimented with SimpleLSTM model training it with only and of the training set.
4.3 Running time
The experiments were done on AWS EC2 instances running on Tesla V100 GPUs. PyTorch (Paszke et al. (2017)) was employed to implement the neural models. The student model is up to 12x faster and 20x smaller in the number of parameters than the teacher. This is besides the reduction in computational complexity by reusing the indexed document encodings as discussed in 1. Particularly, if there are D documents and N claims where each document should be evaluated for each of the claims, computation cost of student model is O(N+D) while teacher’s is O(ND). Please note that the cost is in the unit of computing the encoding of claim or document. Table 8 shows detailed running time metrics.
|Model||# Parameters||Loading Time||Evaluation Time|
In this paper, we proposed using knowledge distillation to improve the performance of student models that generate claim independent document encodings in document retrieval task for factual verification. We experimented with various configurations when adding the teacher model posteriors to the student training, and results show that significant improvements can be achieved across the ranking metrics, without sacrificing runnig time advantages of simpler models. In future, we propose applying this work to a larger set of input documents (C) to replace the DrQA retriever with the student model.
- Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. volume abs/1409.0473.
- Bucila et al. (2006) Cristian Bucila, Rich Caruana, and Alexandru Niculescu-Mizil. 2006. Model compression.
- Chen et al. (2017) Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. 2017. Reading wikipedia to answer open-domain questions. volume abs/1704.00051.
- Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: pre-training of deep bidirectional transformers for language understanding. volume abs/1810.04805.
- Hinton et al. (2015) Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the Knowledge in a Neural Network. arXiv e-prints, page arXiv:1503.02531.
- Hu et al. (2018) Minghao Hu, Yuxing Peng, Furu Wei, Zhen Huang, Dongsheng Li, Nan Yang, and Ming Zhou. 2018. Attention-guided answer distillation for machine reading comprehension.
- Kim and Rush (2016) Yoon Kim and Alexander M. Rush. 2016. Sequence-level knowledge distillation. volume abs/1606.07947.
- Mou et al. (2015) Lili Mou, Ge Li, Yan Xu, Lu Zhang, and Zhi Jin. 2015. Distilling word embeddings: An encoding approach. volume abs/1506.04488.
- Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in pytorch.
- Rajpurkar et al. (2016) Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100, 000+ questions for machine comprehension of text. volume abs/1606.05250.
- Romero et al. (2014) Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. 2014. FitNets: Hints for Thin Deep Nets. arXiv e-prints, page arXiv:1412.6550.
- Tan et al. (2015) Ming Tan, Bing Xiang, and Bowen Zhou. 2015. Lstm-based deep learning models for non-factoid answer selection. volume abs/1511.04108.
- Thorne et al. (2018) James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. 2018. FEVER: a large-scale dataset for fact extraction and VERification.
- Vaswani et al. (2017a) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017a. Attention is all you need. volume abs/1706.03762.
- Vaswani et al. (2017b) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017b. Attention is all you need. volume abs/1706.03762.
- Xiong et al. (2016) Caiming Xiong, Victor Zhong, and Richard Socher. 2016. Dynamic coattention networks for question answering. volume abs/1611.01604.
- Zagoruyko and Komodakis (2016) Sergey Zagoruyko and Nikos Komodakis. 2016. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. volume abs/1612.03928.