Finding Structural Knowledge in Multimodal-BERT

by   Victor Milewski, et al.

In this work, we investigate the knowledge learned in the embeddings of multimodal-BERT models. More specifically, we probe their capabilities of storing the grammatical structure of linguistic data and the structure learned over objects in visual data. To reach that goal, we first make the inherent structure of language and visuals explicit by a dependency parse of the sentences that describe the image and by the dependencies between the object regions in the image, respectively. We call this explicit visual structure the scene tree, that is based on the dependency tree of the language description. Extensive probing experiments show that the multimodal-BERT models do not encode these scene trees.Code available at <>.


Multimodal Interactions Using Pretrained Unimodal Models for SIMMC 2.0

This paper presents our work on the Situated Interactive MultiModal Conv...

Can Linguistic Knowledge Improve Multimodal Alignment in Vision-Language Pretraining?

The multimedia community has shown a significant interest in perceiving ...

Probing BERT in Hyperbolic Spaces

Recently, a variety of probing tasks are proposed to discover linguistic...

MuKEA: Multimodal Knowledge Extraction and Accumulation for Knowledge-based Visual Question Answering

Knowledge-based visual question answering requires the ability of associ...

Developing Universal Dependency Treebanks for Magahi and Braj

In this paper, we discuss the development of treebanks for two low-resou...

Diacritics Restoration using BERT with Analysis on Czech language

We propose a new architecture for diacritics restoration based on contex...

Indication as Prior Knowledge for Multimodal Disease Classification in Chest Radiographs with Transformers

When a clinician refers a patient for an imaging exam, they include the ...

Please sign up or login with your details

Forgot password? Click here to reset