VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers

by   Estelle Aflalo, et al.

Breakthroughs in transformer-based models have revolutionized not only the NLP field, but also vision and multimodal systems. However, although visualization and interpretability tools have become available for NLP models, internal mechanisms of vision and multimodal transformers remain largely opaque. With the success of these transformers, it is increasingly critical to understand their inner workings, as unraveling these black-boxes will lead to more capable and trustworthy models. To contribute to this quest, we propose VL-InterpreT, which provides novel interactive visualizations for interpreting the attentions and hidden representations in multimodal transformers. VL-InterpreT is a task agnostic and integrated tool that (1) tracks a variety of statistics in attention heads throughout all layers for both vision and language components, (2) visualizes cross-modal and intra-modal attentions through easily readable heatmaps, and (3) plots the hidden representations of vision and language tokens as they pass through the transformer layers. In this paper, we demonstrate the functionalities of VL-InterpreT through the analysis of KD-VLP, an end-to-end pretraining vision-language multimodal transformer-based model, in the tasks of Visual Commonsense Reasoning (VCR) and WebQA, two visual question answering benchmarks. Furthermore, we also present a few interesting findings about multimodal transformer behaviors that were learned through our tool.


page 5

page 6

page 7

page 8


XDBERT: Distilling Visual Information to BERT from Cross-Modal Systems to Improve Language Understanding

Transformer-based models are widely used in natural language understandi...

TxT: Crossmodal End-to-End Learning with Transformers

Reasoning over multiple modalities, e.g. in Visual Question Answering (V...

Interpreting Transformer's Attention Dynamic Memory and Visualizing the Semantic Information Flow of GPT

Recent advances in interpretability suggest we can project weights and h...

Towards Vision-Language Mechanistic Interpretability: A Causal Tracing Tool for BLIP

Mechanistic interpretability seeks to understand the neural mechanisms t...

exBERT: A Visual Analysis Tool to Explore Learned Representations in Transformers Models

Large language models can produce powerful contextual representations th...

Latent Attention Networks

Deep neural networks are able to solve tasks across a variety of domains...

Parameter Efficient Multimodal Transformers for Video Representation Learning

The recent success of Transformers in the language domain has motivated ...

Please sign up or login with your details

Forgot password? Click here to reset