XDBERT: Distilling Visual Information to BERT from Cross-Modal Systems to Improve Language Understanding

04/15/2022
by   Chan-Jan Hsu, et al.
0

Transformer-based models are widely used in natural language understanding (NLU) tasks, and multimodal transformers have been effective in visual-language tasks. This study explores distilling visual information from pretrained multimodal transformers to pretrained language encoders. Our framework is inspired by cross-modal encoders' success in visual-language tasks while we alter the learning objective to cater to the language-heavy characteristics of NLU. After training with a small number of extra adapting steps and finetuned, the proposed XDBERT (cross-modal distilled BERT) outperforms pretrained-BERT in general language understanding evaluation (GLUE), situations with adversarial generations (SWAG) benchmarks, and readability benchmarks. We analyze the performance of XDBERT on GLUE to show that the improvement is likely visually grounded.

READ FULL TEXT
research
10/23/2020

ST-BERT: Cross-modal Language Model Pre-training For End-to-end Spoken Language Understanding

Language model pre-training has shown promising results in various downs...
research
12/16/2022

Autoencoders as Cross-Modal Teachers: Can Pretrained 2D Image Transformers Help 3D Representation Learning?

The success of deep learning heavily relies on large-scale data with com...
research
03/30/2022

VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers

Breakthroughs in transformer-based models have revolutionized not only t...
research
06/06/2023

On the Difference of BERT-style and CLIP-style Text Encoders

Masked language modeling (MLM) has been one of the most popular pretrain...
research
12/08/2020

Parameter Efficient Multimodal Transformers for Video Representation Learning

The recent success of Transformers in the language domain has motivated ...
research
09/25/2021

Systematic Generalization on gSCAN: What is Nearly Solved and What is Next?

We analyze the grounded SCAN (gSCAN) benchmark, which was recently propo...
research
04/22/2022

Multimodal Adaptive Distillation for Leveraging Unimodal Encoders for Vision-Language Tasks

Cross-modal encoders for vision-language (VL) tasks are often pretrained...

Please sign up or login with your details

Forgot password? Click here to reset