Understanding the Overfitting of the Episodic Meta-training

06/29/2023
by   Siqi Hui, et al.
0

Despite the success of two-stage few-shot classification methods, in the episodic meta-training stage, the model suffers severe overfitting. We hypothesize that it is caused by over-discrimination, i.e., the model learns to over-rely on the superficial features that fit for base class discrimination while suppressing the novel class generalization. To penalize over-discrimination, we introduce knowledge distillation techniques to keep novel generalization knowledge from the teacher model during training. Specifically, we select the teacher model as the one with the best validation accuracy during meta-training and restrict the symmetric Kullback-Leibler (SKL) divergence between the output distribution of the linear classifier of the teacher model and that of the student model. This simple approach outperforms the standard meta-training process. We further propose the Nearest Neighbor Symmetric Kullback-Leibler (NNSKL) divergence for meta-training to push the limits of knowledge distillation techniques. NNSKL takes few-shot tasks as input and penalizes the output of the nearest neighbor classifier, which possesses an impact on the relationships between query embedding and support centers. By combining SKL and NNSKL in meta-training, the model achieves even better performance and surpasses state-of-the-art results on several benchmarks.

READ FULL TEXT

page 9

page 16

research
12/02/2020

Meta-KD: A Meta Knowledge Distillation Framework for Language Model Compression across Domains

Pre-trained language models have been applied to various NLP tasks with ...
research
11/12/2019

SimpleShot: Revisiting Nearest-Neighbor Classification for Few-Shot Learning

Few-shot learners aim to recognize new object classes based on a small n...
research
04/12/2022

DistPro: Searching A Fast Knowledge Distillation Process via Meta Optimization

Recent Knowledge distillation (KD) studies show that different manually ...
research
05/26/2023

Improving Knowledge Distillation via Regularizing Feature Norm and Direction

Knowledge distillation (KD) exploits a large well-trained model (i.e., t...
research
08/23/2022

FS-BAN: Born-Again Networks for Domain Generalization Few-Shot Classification

Conventional Few-shot classification (FSC) aims to recognize samples fro...
research
03/09/2022

Efficient Sub-structured Knowledge Distillation

Structured prediction models aim at solving a type of problem where the ...

Please sign up or login with your details

Forgot password? Click here to reset