Ensembled CTR Prediction via Knowledge Distillation

by   Jieming Zhu, et al.

Recently, deep learning-based models have been widely studied for click-through rate (CTR) prediction and lead to improved prediction accuracy in many industrial applications. However, current research focuses primarily on building complex network architectures to better capture sophisticated feature interactions and dynamic user behaviors. The increased model complexity may slow down online inference and hinder its adoption in real-time applications. Instead, our work targets at a new model training strategy based on knowledge distillation (KD). KD is a teacher-student learning framework to transfer knowledge learned from a teacher model to a student model. The KD strategy not only allows us to simplify the student model as a vanilla DNN model but also achieves significant accuracy improvements over the state-of-the-art teacher models. The benefits thus motivate us to further explore the use of a powerful ensemble of teachers for more accurate student model training. We also propose some novel techniques to facilitate ensembled CTR prediction, including teacher gating and early stopping by distillation loss. We conduct comprehensive experiments against 12 existing models and across three industrial datasets. Both offline and online A/B testing results show the effectiveness of our KD-based training strategy.


page 1

page 2

page 3

page 4


Extracurricular Learning: Knowledge Transfer Beyond Empirical Distribution

Knowledge distillation has been used to transfer knowledge learned by a ...

Directed Acyclic Graph Factorization Machines for CTR Prediction via Knowledge Distillation

With the growth of high-dimensional sparse data in web-scale recommender...

BD-KD: Balancing the Divergences for Online Knowledge Distillation

Knowledge distillation (KD) has gained a lot of attention in the field o...

Towards domain generalisation in ASR with elitist sampling and ensemble knowledge distillation

Knowledge distillation has widely been used for model compression and do...

Distill2Vec: Dynamic Graph Representation Learning with Knowledge Distillation

Dynamic graph representation learning strategies are based on different ...

Unsupervised Deep Digital Staining For Microscopic Cell Images Via Knowledge Distillation

Staining is critical to cell imaging and medical diagnosis, which is exp...

How many Observations are Enough? Knowledge Distillation for Trajectory Forecasting

Accurate prediction of future human positions is an essential task for m...

Please sign up or login with your details

Forgot password? Click here to reset