DeViL: Decoding Vision features into Language

09/04/2023
by   Meghal Dani, et al.
0

Post-hoc explanation methods have often been criticised for abstracting away the decision-making process of deep neural networks. In this work, we would like to provide natural language descriptions for what different layers of a vision backbone have learned. Our DeViL method decodes vision features into language, not only highlighting the attribution locations but also generating textual descriptions of visual features at different layers of the network. We train a transformer network to translate individual image features of any vision layer into a prompt that a separate off-the-shelf language model decodes into natural language. By employing dropout both per-layer and per-spatial-location, our model can generalize training on image-text pairs to generate localized explanations. As it uses a pre-trained language model, our approach is fast to train, can be applied to any vision backbone, and produces textual descriptions at different layers of the vision network. Moreover, DeViL can create open-vocabulary attribution maps corresponding to words or phrases even outside the training scope of the vision model. We demonstrate that DeViL generates textual descriptions relevant to the image content on CC3M surpassing previous lightweight captioning models and attribution maps uncovering the learned concepts of the vision backbone. Finally, we show DeViL also outperforms the current state-of-the-art on the neuron-wise descriptions of the MILANNOTATIONS dataset. Code available at https://github.com/ExplainableML/DeViL

READ FULL TEXT
research
03/09/2022

NLX-GPT: A Model for Natural Language Explanations in Vision and Vision-Language Tasks

Natural language explanation (NLE) models aim at explaining the decision...
research
07/10/2023

Leveraging Multiple Descriptive Features for Robust Few-shot Image Learning

Modern image classification is based upon directly predicting model clas...
research
09/01/2023

Learned Visual Features to Textual Explanations

Interpreting the learned features of vision models has posed a longstand...
research
01/26/2022

Natural Language Descriptions of Deep Visual Features

Some neurons in deep networks specialize in recognizing highly specific ...
research
05/02/2020

ESPRIT: Explaining Solutions to Physical Reasoning Tasks

Neural networks lack the ability to reason about qualitative physics and...
research
04/13/2023

VISION DIFFMASK: Faithful Interpretation of Vision Transformers with Differentiable Patch Masking

The lack of interpretability of the Vision Transformer may hinder its us...
research
02/12/2015

Phrase-based Image Captioning

Generating a novel textual description of an image is an interesting pro...

Please sign up or login with your details

Forgot password? Click here to reset