KS-DETR: Knowledge Sharing in Attention Learning for Detection Transformer

02/22/2023
by   Kaikai Zhao, et al.
0

Scaled dot-product attention applies a softmax function on the scaled dot-product of queries and keys to calculate weights and then multiplies the weights and values. In this work, we study how to improve the learning of scaled dot-product attention to improve the accuracy of DETR. Our method is based on the following observations: using ground truth foreground-background mask (GT Fg-Bg Mask) as additional cues in the weights/values learning enables learning much better weights/values; with better weights/values, better values/weights can be learned. We propose a triple-attention module in which the first attention is a plain scaled dot-product attention, the second/third attention generates high-quality weights/values (with the assistance of GT Fg-Bg Mask) and shares the values/weights with the first attention to improve the quality of values/weights. The second and third attentions are removed during inference. We call our method knowledge-sharing DETR (KS-DETR), which is an extension of knowledge distillation (KD) in the way that the improved weights and values of the teachers (the second and third attentions) are directly shared, instead of mimicked, by the student (the first attention) to enable more efficient knowledge transfer from the teachers to the student. Experiments on various DETR-like methods show consistent improvements over the baseline methods on the MS COCO benchmark. Code is available at https://github.com/edocanonymous/KS-DETR.

READ FULL TEXT
research
02/25/2020

MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers

Pre-trained language models (e.g., BERT (Devlin et al., 2018) and its va...
research
04/25/2023

Class Attention Transfer Based Knowledge Distillation

Previous knowledge distillation methods have shown their impressive perf...
research
10/19/2020

Gaussian Constrained Attention Network for Scene Text Recognition

Scene text recognition has been a hot topic in computer vision. Recent m...
research
09/19/2022

EcoFormer: Energy-Saving Attention with Linear Complexity

Transformer is a transformative framework that models sequential data an...
research
06/11/2020

Implicit Kernel Attention

Attention compute the dependency between representations, and it encoura...
research
11/13/2022

Enhancing Few-shot Image Classification with Cosine Transformer

This paper addresses the few-shot image classification problem. One nota...
research
12/07/2021

Learning Pixel-Adaptive Weights for Portrait Photo Retouching

Portrait photo retouching is a photo retouching task that emphasizes hum...

Please sign up or login with your details

Forgot password? Click here to reset