Structural Knowledge Distillation

10/10/2020
by   Xinyu Wang, et al.
8

Knowledge distillation is a critical technique to transfer knowledge between models, typically from a large model (the teacher) to a smaller one (the student). The objective function of knowledge distillation is typically the cross-entropy between the teacher and the student's output distributions. However, for structured prediction problems, the output space is exponential in size; therefore, the cross-entropy objective becomes intractable to compute and optimize directly. In this paper, we derive a factorized form of the knowledge distillation objective for structured prediction, which is tractable for many typical choices of the teacher and student models. In particular, we show the tractability and empirical effectiveness of structural knowledge distillation between sequence labeling and dependency parsing models under four different scenarios: 1) the teacher and student share the same factorization form of the output structure scoring function; 2) the student factorization produces smaller substructures than the teacher factorization; 3) the teacher factorization produces smaller substructures than the student factorization; 4) the factorization forms from the teacher and the student are incompatible.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/10/2021

Does Knowledge Distillation Really Work?

Knowledge distillation is a popular technique for training a small stude...
research
03/27/2023

Improving Neural Topic Models with Wasserstein Knowledge Distillation

Topic modeling is a dominant method for exploring document collections o...
research
03/09/2022

Efficient Sub-structured Knowledge Distillation

Structured prediction models aim at solving a type of problem where the ...
research
11/08/2019

Deep geometric knowledge distillation with graphs

In most cases deep learning architectures are trained disregarding the a...
research
05/09/2023

DynamicKD: An Effective Knowledge Distillation via Dynamic Entropy Correction-Based Distillation for Gap Optimizing

The knowledge distillation uses a high-performance teacher network to gu...
research
06/23/2023

GKD: Generalized Knowledge Distillation for Auto-regressive Sequence Models

Knowledge distillation is commonly used for compressing neural networks ...
research
11/30/2022

Explicit Knowledge Transfer for Weakly-Supervised Code Generation

Large language models (LLMs) can acquire strong code-generation capabili...

Please sign up or login with your details

Forgot password? Click here to reset