Model Compression with Multi-Task Knowledge Distillation for Web-scale Question Answering System

04/21/2019 ∙ by Ze Yang, et al. ∙ Microsoft 0

Deep pre-training and fine-tuning models (like BERT, OpenAI GPT) have demonstrated excellent results in question answering areas. However, due to the sheer amount of model parameters, the inference speed of these models is very slow. How to apply these complex models to real business scenarios becomes a challenging but practical problem. Previous works often leverage model compression approaches to resolve this problem. However, these methods usually induce information loss during the model compression procedure, leading to incomparable results between compressed model and the original model. To tackle this challenge, we propose a Multi-task Knowledge Distillation Model (MKDM for short) for web-scale Question Answering system, by distilling knowledge from multiple teacher models to a light-weight student model. In this way, more generalized knowledge can be transferred. The experiment results show that our method can significantly outperform the baseline methods and even achieve comparable results with the original teacher models, along with significant speedup of model inference.



There are no comments yet.


page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Question Answering relevance task is a fundamental task in Q&A system Cimiano et al. (2014) (Table 1 shows an example), which is to distinguish whether an answer could well address the given question. This task can provide a more natural way to retrieve information, and help users find answers more efficiently.

Question: Can CT scan detect polyps?
Passage: Polyps are diagnosed by either looking at the colon lining directly (colonoscopy) or by a specialized CT scan called CT colography (also called a virtual colonoscopy). Barium enema x-rays have been used in the past and may be appropriate …
Label: Relevant
Table 1: An example of Q&A Relevance Task.

This task is formalized as a text matching problem Xue et al. (2008)

. Traditional methods usually used vector space models 

Salton et al. (1975); Robertson et al. (1999)

, shallow neural network models 

Huang et al. (2013); Shen et al. (2014); Palangi et al. (2014) to model the interaction similarity.

In recent years, deep pre-training approaches Radford (2018); Devlin et al. (2018) have brought great break-through in NLP tasks. For question answering systems, it also shows very promising results (like QnA relevance, MRC tasks, etc.). However, due to the sheer amount of parameters, model inference is very time-consuming. Even with powerful GPU machines, the speed is still very limited, as shown in Table 2111For fair comparison, we set the batch size as 1, and limit the GPU memory as 1GB..

Model Name Samples Per Second
BERT Base 52
BERT Large 16
Table 2: The inference speed of BERT on 1080Ti GPU.

In a commercial question answering system, two approaches are adopted for model inference. i) for head and body queries, large-scale batch-mode processing is used to compute answers in offline. For this part, the number of QnA pairs is at the magnitude of 100 billions, ii) for tail queries, online inference is used and the latency requirement is about 10ms. Both approaches require fast model inference speed. Therefore, we have to perform model compression for inference speedup.

A popular method, called knowledge distillation Hinton et al. (2015) has been widely used for model compression, which implements a teacher-student framework to transfer knowledge from complex networks to simple networks by learning the distribution of the teacher model’s soft target (the label distribution provided by teacher’s output) rather than the golden label. However, since usually model compression induces information loss, i.e. the performance of student model usually cannot reach parity of its teacher model. Is it possible to have compressed models with comparable or even better performance than that of the teacher model?

To address the above challenge, we may consider an ensemble approach. In other words, we first train multiple teacher models, and then for each teacher model, a separate student model is compressed. Finally the student models ensemble is treated as the final model. Although this approach performs better than the single teacher approach, it takes more capacity due to multiple student models. If we compare the ensemble model with the teacher model, we are actually using capacity to trade off speed. And if we compare this ensemble approach with the single student model, the reason why it is better is as the following. Each teacher may over-fit the training data somehow. If we have multiple teachers, the ensemble approach can cancel off the over-fitting effect to certain degree. However, the over-fitting bias has been transferred from the teacher to the student during the distillation process. The cancelling off is like “late calibration”. Can we do “early calibration” during the distillation stage?

Based on the above motivations, we propose a unified Multi-task Knowledge Distillation Model (MKDM for short) for model compression. Specifically, we train multiple teacher models to obtain knowledge, then we design a multi-task framework to train a single student by leveraging multiple teachers’ knowledge, hence improve the generalization performance and cancel off the over-fitting bias during the distillation stage.

The major contributions of our work are summarized as follows:

  • [itemsep= -0.2em,topsep = 0.4em, align=left, labelsep=-0.2mm]

  • We design a multi-task learning paradigm to jointly learn multiple teacher’s knowledge, thus our model can improve the generalization performance by leveraging the knowledge complementary among different teachers.

  • We make the first attempt to investigate the effective training of Multi-task knowledge distillation for model compression, and also explore a Two Stage Multi-task Knowledge Distillation Model for web-scale Question Answering system.

  • We conduct experiments on large scale datasets for business scenario to verify the effectiveness of our proposed approach compared with different baseline methods.

The rest of the paper is organized as follows. After a summary of related work in Section 2, we describe the overall design of MKDM in Section 3. Then we describe our proposed model in details in Section 4. We conduct experiments for comprehensive evaluations in Section 5. Finally, section 6 concludes this paper and discuss future directions.

Figure 1: The Overall Architecture of The Proposed Multi-task Knowledge Distillation Model.

2 Related Work

In this section we briefly review two research areas related to our work: transfer learning and model compression.

2.1 Transfer Learning

Transfer learning is a method to transfer knowledge from one task to another, which has been widely used in various fields Li et al. (2016); Murez et al. (2018); Bansal et al. (2018); He et al. (2016). For example, Cao et al. Cao et al. (2018) proposed a novel adversarial transfer learning framework to make full use of task-shared boundaries information. Lv et al. Lv et al. (2018)

proposed a learning-to-rank based mutual promotion procedure to incrementally optimize the classifiers based on the unlabeled data in the target domain.

Peng et al. Peng et al. (2016) transferred the view-invariant representation of persons’ appearance from the source labeled dataset to the unlabeled target dataset by dictionary learning mechanisms.

In recent years, transfer learning has achieved amazing performance on many NLP tasks Howard and Ruder (2018); Peters et al. (2018); Devlin et al. (2018). These methods leveraged general-domain pre-training and novel fine-tuning techniques to prevent over-fitting even with small amount of labeled data and achieve state-of-the-art results. In Q&A system, these methods provide significant improvements. However, These pretrain-finetuning methods need large computation cost due to the large model size.

2.2 Model Compression

As the size of neural network parameters is getting larger and larger, how to industrially deploy and apply the model becomes an important problem. Low-rank approximation was a factorization method Zhang et al. (2015); Jaderberg et al. (2014); Denton et al. (2014), which used multiple low rank matrices to approximate the original matrix. The main idea of network pruning was to remove the relatively unimportant weights in the network, and then finetune the network LeCun et al. (1989); Hassibi and Stork (1993); He et al. (2017). Hinton et al. Hinton et al. (2015) proposed a knowledge distillation method (KD for short) for model compression. In their work, the output of the complex network is used as a soft target for the training of simple network. In this way, the knowledge of complex models can be transferred to simple models. Polino et al. Polino et al. (2018) proposed a quantized distillation method. In their work, they incorporated distillation loss, and expressed with respect to the teacher network, into the training of a smaller student network whose weights are quantized to a limited set of levels.

Our proposed method is also a knowledge distillation based method. We use a multi-task paradigm to joint learn different teachers’ knowledge, and distill the knowledge into a light-weight student.

3 The Overall Design of Our Model

Figure 1 shows the core idea of MKDM. It leverages multiple teachers to jointly train a single student in a unified framework. Firstly, several teacher models are trained using different hyper-parameters. Then we leverage these teacher models to predict soft labels on the training data. Thus each case in training data contains two parts: golden label (the ground truth label in for an instance) by human judges and multiple soft labels (the soft labels in predicted by different teacher models). At the training stage, the student model with multiple headers jointly learns the golden label and soft labels. At the inference stage, the final output is a weighted aggregation of all the student headers’ outputs. The intuition is very similar to the learning process of human being, i.e. human not only learn knowledge from single teacher, but learn from multiple teachers simultaneously. Unbiased and generalized knowledge can be gained from different teachers.

4 Our Approach

In this section, we first describe the proposed approach MKDM in detail222the code will be released soon., and then discuss the model training and prediction details.

4.1 Mkdm

MKDM is implemented from BERT Devlin et al. (2018)

. Our model consists of three layers: the encoder layer utilizes the lexicon to embed both the question and passage into a low embedding space; Transformer layer maps the lexicon embedding to contextual embedding; The multi-task student layer jointly learns multiple teachers’ together, and generate prediction output.

4.1.1 Encoder Layer

In Q&A system, each question and passage are described by a set of words. We take the word piece as the input just like BERT. is to denote all the instances, and each instance has a pair. Let be a question with word pieces, be a passage with word pieces, and is the bag-of-word representation of -th word piece. Each token representation is constructed by the sum of the corresponding token, segment and position embeddings. Let denote all the summed vectors in a continuous space.

We concatenate the , and as the first token, and add between Q and P. After that, we obtain the concatenation input of a given instance . With the encoder layer, we map into continuous representations .

4.1.2 Transformer Layer

We also use the bidirectional transformer encoder to map the lexicon embedding into a sequence of continuous contextual embedding . Different from the original BERT, to compress the model, we use a three-layer transformer blocks instead.

4.1.3 Multi-task Student Layer

To jointly learn the multiple teacher models, we design a multi-task layer. In our model, a Teacher Model Zoo is built with different hyper model parameters.

Our multi-task student layer consists of two parts, golden label task and soft label task:

Golden Label Task

Given instance , this task aims to learn the ground truth label. Following the BERT, we select ’s first token’s transformer hidden state

as the global representation of input. The probability that

is labeled as class is defined as follows:


where is a learnable parameter matrix, indicates whether is relevant or not. The objective function of golden label task is then defined as the cross-entropy:

Soft Label Task

For a given instance , teacher model can predict a score to indicate the probability that and are relevant. Take a teacher as example, the relevance probability of is defined as follows:


where is a learnable parameter matrix, is the relevance score.

The objective function of soft label task is defined as mean squared error as follows:


where is the predicted score of teacher for given pairs.

4.2 Training and Prediction

In order to learn parameters of MKDM model, we combine Equation (2) and Equation (4), and obtain our multi-task learning objective function as follows:


where is a loss weighted ratio, is the loss of -th teacher. Details of our learning algorithm is shown in Algorithm (1).

1:Initialize model }
2:iter = 0
5:     for  do
6:         for instance , compute the gradient using Equation (5)
7:         update model
8:     end for
9:until Converge
10:return ;
Algorithm 1 Framework of MKDM
Datasets Number of Samples
Average Question Length
Average Answer Length
DeepQA 1,000,000 5.86 43.74
Table 3: Statistics of experiment datasets.

At the inference stage, we use an aggregate operation to calculate the final result as follows:


where represent the -th student header’s output.

5 Experiment

5.1 Dataset

Our experimental dataset (called DeepQA) is randomly sampled from one commercial Q&A system’s large dataset. It contains 1 million Q&A label data covering various domains, such as health, tech, sports, etc. Each case consists of three parts, i.e. question, passage, and binary label (i.e. 0 or 1) by human judges indicating whether the question can be answered by the passage. The statistics are shown in Table 3.

5.2 Evaluation metrics

We use the following metrics for model performance evaluation:

  • [itemsep= -0.2em,topsep = 0.4em, align=left, labelsep=-0.2mm]

  • Accuracy: This metric equals to number of correct predictions divided by the total number of samples in test set.

  • Area Under Curve: This metric is one of the most widely used metrics to evaluate binary classification model performance. It equals to the probability that the classifier will rank a randomly chosen positive example higher than a randomly chosen negative example.

  • Queries Per Second: This metric indicates the numbers of cases to be processed per second. We use this metric to evaluate model inference speed.

5.3 Baselines

We compare our model with several strong baseline models to verify the effectiveness of our proposed approach. All the baseline methods are based on BERT pertained model:

  • [itemsep= -0.2em,topsep = 0.4em, align=left, labelsep=-0.2mm]

  • Original BERT: We use the BERT base fine-tuning model as one strong baseline, which consists of 12-layer transformer blocks, 768 hidden size, and 12 heads. Several BERT base fine-tuning models are trained using different hyper-parameters.

  • Single Student Model: 3 layers BERT base model is selected as the student model architecture and model parameters are initialized using the BERT base model weights. Different with MKDM, this student model learns from one single teacher model using knowledge distillation. The teacher model is the best model from the first baseline, i.e. Original BERT model.

  • Student Model ensemble: For each BERT base fine-tuning model from the first baseline, knowledge distillation is used to train a BERT 3 layer student model. We train 3 student models using 3 different teacher models, then these student models are ensembled by simply averaging the output scores.

Model Inference Speed Parameters Performance
Original BERT Model 52 427,721 80.89 88.72
Single Student Model 217 178,506 76.29 84.12
Student Model Ensemble (3) 217 / 3 178,506 * 3 76.77 84.43
MKDM 217 178,512 77.18 85.14
Table 4: Model Comparison Between our Methods and Baseline Methods. ACC, AUC denote accuracy and area under curve respectively (all AUC/ACC metrics in the table are percentage numbers with % omitted).

5.4 Parameter Settings

For all baselines and MKDM

, we implement on top of the PyTorch implementation of BERT

333 We optimize MKDM with a learning rate of and a batch size of 256. In all cases, the hidden size is set as 768. The number of self-attention heads is set as 12, and the feed-forward/filter size is set to 3072.

To compress original BERT model, we set the number of transformer blocks as 3. Teacher models in MKDM are identical to the teacher models of student ensemble model. All baselines and MKDM are not trained from scratch. We finetune the student models on pretrained BERT model weights.

5.5 Comparison Against Baselines

In this section, we conduct experiments to compare MKDM with baselines in three dimensions, i.e. inference speed, parameter size and performance. From the results shown in Table 4, it’s intuitive to have the following observations:

  • [itemsep= -0.2em,topsep = 0.4em, align=left, labelsep=-0.2mm]

  • It’s not surprising that original BERT model shows the best performance due to the sheer amount of parameters, but the inference speed is super slow and the memory consumption is huge.

  • Single student model obtains pretty good results regarding inference speed and memory capacity, but there are still some gaps compared to the original BERT model in terms of ACC, AUC.

  • Student model ensemble performs better than single student model. However, the inference speed and memory consumption increase in proportion to the number of student models ensembled.

  • Compared with single student model and student ensemble model, our MKDM achieves optimum in all three dimensions. Compared to the single student model, MKDM only needs small amount of additional memory consumption since majority of the parameters are shared across different tasks.

To conclude, MKDM performs better in three dimensions than two strong baseline compressed models with knowledge distillation (i.e. single student model, student ensemble model) on DeepQA dataset, and also further decreases performance gap with the original BERT model, which verifies the effectiveness of MKDM.

5.6 Effective Training of Mkdm Model

In this section, we perform further analysis about how to train MKDM more effectively.

5.6.1 The Impact of Pre-training Weights

BERT shows excellent results on plenty of NLP tasks by leveraging large amount of unsupervised data for pre-training to get better contextual representations. In MKDM model, our best practice is leveraging BERT pre-training weights to initialize the first three layers.

The results in Table 5 show the performance comparison between initializing with pre-training weights and random initializing.

Strategy Performance
Random Initializing Weights 67.32 77.18
Load Pre-training Weights 77.18 85.14
Table 5: The Impact of BERT Pre-training Weights.

From the results, we can see that model initialized from pre-training weights outperforms training from scratch. The relative performance improvement over ACC, AUC is around and which is significant. Meanwhile, during the training stage, the pre-training weights makes the model faster to converge.

Figure 2: The Overall Architecture of Our Two Stage Model.

5.6.2 The Impact of Different Transformer Layer Number

The most important architecture of BERT is the transformer block count. In MKDM, the number of transformer layer is set as 3. Here we investigate the impact of different numbers of transformer layer. The performance of MKDM is compared when the number of transformer layer , and the results are shown in Table 6.

Layer Count Inference Speed Performance
1 511 70.02 75.75
3 217 77.18 85.14
5 141 78.51 86.65
7 96 79.82 87.84
9 66 80.57 88.31
Table 6: The Comparison for Different Number of Transformer Layer.

From the results, we can draw the following observations:

  • [itemsep= -0.2em,topsep = 0.4em, align=left, labelsep=-0.2mm]

  • As the number of transformer layer increases, the AUC and ACC metrics increase as well, but the inference speed decreases. It’s easy to understand that more transformer layers bring in larger parameter size which could benefit feature representation for performance, but greatly hurt inference efficiency.

  • As the number of transformer layer increases, the performance gain between two consecutive trials decreases. That say, when layer count increases from to , the performance gain over ACC, AUC is and which is very huge improvement; while increases from to , to , to , the performance gains decrease from around to . Our thinking is that when transformer layer reaches a certain number, the model representation capability seems sufficient and there is no significant add-on value when add more layers.

Based on these results, we set the number of transformer layers as 3 for MKDM, since this setting has the highest performance/computation cost ratio which better meets the requirement for web-scale applications.

More interestingly, in real business scenario, as the data scale increases, the 3-layer MKDM also shows the potential to achieve comparable results with the original teacher models, which will be introduced in Section 5.7.

5.6.3 The Impact of Loss Weighted Ratio

Here we investigate the impact of the loss weighted ratio defined in Section 4.2, where . Specially, when set the ratio as , we only use the soft label headers to calculate the final output result. Table 7 shows the performance of MKDM against different value.

Loss Weighted Ratio Performance
0.1 75.48 83.39
0.3 75.73 83.75
0.5 76.25 84.10
0.7 76.56 84.38
0.9 77.18 85.14
1.0 76.30 84.33
Table 7: The Impact of Different Loss Weighted Ratio.

From the results, we obtain the following observations:

  • [itemsep= -0.2em,topsep = 0.4em, align=left, labelsep=-0.2mm]

  • The larger ratio, the better performance will be obtained (except when is ).

  • Without the golden label (i.e. is ), the performance decreases. Just like when human beings learn knowledge, we let him/her only learn from teachers but without reading any books. Obviously, In this case, they can’t master comprehensive knowledge.

5.7 Enhanced Student Model with Two-Stage Multi-Task Knowledge Distillation

In most real business scenarios, it is relatively easy to get large amount of unlabeled data. In MKDM, we only leverage labeled data for model training. In fact, based on MKDM paradigm, we can leverage not only labeled data for knowledge distillation, but also large amount of unlabeled data. Based on this idea, we further propose a Two Stage MKDM (TS-MKDM for short) approach (as shown in Figure 2):

  1. [itemsep= -0.2em,topsep = 0.4em, align=left, labelsep=-0.2mm]

  2. Multi-Task Knowledge Distillation for pre-training. That say, at the first stage, student model learns from teacher models’ soft labels as the optimization objective.

  3. Multi-Task Knowledge Distillation for fine-tuning. That say, at the second stage, just as the original MKDM model, student model jointly learns the golden label and teacher models’ soft labels.

To verify our idea, we collect two larger commercial datasets (called CommQA-Unlabeled and CommQA-Labeled):

  • [itemsep= -0.2em,topsep = 0.4em, align=left, labelsep=-0.2mm]

  • CommQA-Unlabeled: It includes around 100 million pairs collected from a commercial search engine (without labels). Firstly, for each question, top 10 relevant documents returned by the search engine are selected to form pairs; Then passages are further extracted from these documents to form triples; Finally pairs are used as experimental dataset.

  • CommQA-Labeled: It is a human labeled dataset, which is several times larger than DeepQA with more diversified data.

CommQA-unlabeled is used for the above pre-training stage, DeepQA and CommQA-Labeled are used in above fine-tuning stage respectively to evaluate the performance of TS-MKDM. Table 8 shows the comparison results between MKDM and TS-MKDM. From the results, we can observe the following findings:

  • [itemsep= -0.2em,topsep = 0.4em, align=left, labelsep=-0.2mm]

  • On both datasets, TS-MKDM outperforms MKDM by large margin, which proves that incorporating multi-task knowledge distillation for pre-training can further boost model performance.

  • Interestingly, by leveraging multi-task knowledge distillation on super large scale dataset for pre-training, the evaluation results on CommQA-Labeled dataset show that TS-MKDM model even exceeds the performance of teacher model (AUC 87.5 vs 86.5, ACC 79.22 vs 77.00). This further verifies TS-MKDM’s effectiveness.

Dataset DeepQA CommQA-Labeled
Original BERT 80.89 88.72 77.00 86.50
MKDM 77.18 85.14 77.32 85.71
TS-MKDM 78.47 86.36 79.22 87.50
Table 8: The Performance comparison between MKDM and TS-MKDM.

6 Conclusion and Future Work

In this paper, we propose a novel Multi-task Knowledge Distillation Model (MKDM) for model compression. A new multi-task paradigm is designed to jointly learn from multiple teacher models. Based on this method, our student model can learn more generalized knowledge from different teachers. Results show that our proposed method outperforms the baseline methods by great margin, along with significant speedup of model inference. We further perform extensive experiments to explore a Two Stage Multi-task Knowledge Distillation Model (TS-MKDM) based on MKDM. The result shows that in real industry scenario with super large scale data, TS-MKDM even outperforms the original teacher model.

In the future, on one side, we will investigate on heterogeneous student models (not transformer based models) to evaluate our multi-task knowledge distillation approach and further boost model agility. On the other side, we will extend our methods to more tasks, such like sentence classification, machine reading comprehension, etc.


  • Bansal et al. (2018) Ankan Bansal, Karan Sikka, Gaurav Sharma, Rama Chellappa, and Ajay Divakaran. 2018. Zero-shot object detection. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part I, pages 397–414.
  • Cao et al. (2018) Pengfei Cao, Yubo Chen, Kang Liu, Jun Zhao, and Shengping Liu. 2018.

    Adversarial transfer learning for chinese named entity recognition with self-attention mechanism.


    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018

    , pages 182–192.
  • Cimiano et al. (2014) Philipp Cimiano, Christina Unger, and John McCrae. 2014. Ontology-based interpretation of natural language. Synthesis Lectures on Human Language Technologies, 7(2):1–178.
  • Denton et al. (2014) Emily L. Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. 2014. Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 1269–1277.
  • Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  • Hassibi and Stork (1993) Babak Hassibi and David G. Stork. 1993. Second order derivatives for network pruning: Optimal brain surgeon. In S. J. Hanson, J. D. Cowan, and C. L. Giles, editors, Advances in Neural Information Processing Systems 5, pages 164–171. Morgan-Kaufmann.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pages 770–778.
  • He et al. (2017) Yihui He, Xiangyu Zhang, and Jian Sun. 2017. Channel pruning for accelerating very deep neural networks. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 1398–1406.
  • Hinton et al. (2015) Geoffrey E Hinton, Oriol Vinyals, and Jeffrey Dean. 2015. Distilling the knowledge in a neural network.

    arXiv: Machine Learning

  • Howard and Ruder (2018) Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, pages 328–339.
  • Huang et al. (2013) Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry P. Heck. 2013. Learning deep structured semantic models for web search using clickthrough data. In 22nd ACM International Conference on Information and Knowledge Management, CIKM’13, San Francisco, CA, USA, October 27 - November 1, 2013, pages 2333–2338.
  • Jaderberg et al. (2014) Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. 2014. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866.
  • LeCun et al. (1989) Yann LeCun, John S. Denker, and Sara A. Solla. 1989. Optimal brain damage. In Advances in Neural Information Processing Systems 2, [NIPS Conference, Denver, Colorado, USA, November 27-30, 1989], pages 598–605.
  • Li et al. (2016) Mu Li, Wangmeng Zuo, and David Zhang. 2016. Deep identity-aware transfer of facial attributes. CoRR, abs/1610.05586.
  • Lv et al. (2018) Jianming Lv, Weihang Chen, Qing Li, and Can Yang. 2018. Unsupervised cross-dataset person re-identification by transfer learning of spatial-temporal patterns. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 7948–7956.
  • Murez et al. (2018) Zak Murez, Soheil Kolouri, David J. Kriegman, Ravi Ramamoorthi, and Kyungnam Kim. 2018. Image to image translation for domain adaptation. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 4500–4509.
  • Palangi et al. (2014) Hamid Palangi, Li Deng, Yelong Shen, Jianfeng Gao, Xiaodong He, Jianshu Chen, Xinying Song, and Rabab K. Ward. 2014. Semantic modelling with long-short-term memory for information retrieval. CoRR, abs/1412.6629.
  • Peng et al. (2016) Peixi Peng, Tao Xiang, Yaowei Wang, Massimiliano Pontil, Shaogang Gong, Tiejun Huang, and Yonghong Tian. 2016. Unsupervised cross-dataset transfer learning for person re-identification. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 1306–1315.
  • Peters et al. (2018) Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), pages 2227–2237.
  • Polino et al. (2018) Antonio Polino, Razvan Pascanu, and Dan Alistarh. 2018. Model compression via distillation and quantization. CoRR, abs/1802.05668.
  • Radford (2018) Alec Radford. 2018. Improving language understanding by generative pre-training.
  • Robertson et al. (1999) Stephen E Robertson, Steve Walker, Micheline Beaulieu, and Peter Willett. 1999. Okapi at trec-7: automatic ad hoc, filtering, vlc and interactive track. Nist Special Publication SP, (500):253–264.
  • Salton et al. (1975) Gerard Salton, Anita Wong, and Chung-Shu Yang. 1975. A vector space model for automatic indexing. Communications of the ACM, 18(11):613–620.
  • Shen et al. (2014) Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Grégoire Mesnil. 2014. A latent semantic model with convolutional-pooling structure for information retrieval. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, CIKM 2014, Shanghai, China, November 3-7, 2014, pages 101–110.
  • Xue et al. (2008) Xiaobing Xue, Jiwoon Jeon, and W Bruce Croft. 2008. Retrieval models for question and answer archives. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pages 475–482. ACM.
  • Zhang et al. (2015) Xiangyu Zhang, Jianhua Zou, Xiang Ming, Kaiming He, and Jian Sun. 2015. Efficient and accurate approximations of nonlinear convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pages 1984–1992.