Finding Structural Knowledge in Multimodal-BERT

by   Victor Milewski, et al.

In this work, we investigate the knowledge learned in the embeddings of multimodal-BERT models. More specifically, we probe their capabilities of storing the grammatical structure of linguistic data and the structure learned over objects in visual data. To reach that goal, we first make the inherent structure of language and visuals explicit by a dependency parse of the sentences that describe the image and by the dependencies between the object regions in the image, respectively. We call this explicit visual structure the scene tree, that is based on the dependency tree of the language description. Extensive probing experiments show that the multimodal-BERT models do not encode these scene trees.Code available at <>.



page 2


Multimodal Interactions Using Pretrained Unimodal Models for SIMMC 2.0

This paper presents our work on the Situated Interactive MultiModal Conv...

Probing BERT in Hyperbolic Spaces

Recently, a variety of probing tasks are proposed to discover linguistic...

MuKEA: Multimodal Knowledge Extraction and Accumulation for Knowledge-based Visual Question Answering

Knowledge-based visual question answering requires the ability of associ...

Developing Universal Dependency Treebanks for Magahi and Braj

In this paper, we discuss the development of treebanks for two low-resou...

Indication as Prior Knowledge for Multimodal Disease Classification in Chest Radiographs with Transformers

When a clinician refers a patient for an imaging exam, they include the ...

Diacritics Restoration using BERT with Analysis on Czech language

We propose a new architecture for diacritics restoration based on contex...

Transductive Visual Verb Sense Disambiguation

Verb Sense Disambiguation is a well-known task in NLP, the aim is to fin...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, contextualized embeddings have become increasingly important. Embeddings created by the BERT model and its variants have been used to get state-of-the-art performance in many tasks (Devlin et al., 2019; Liu et al., 2019; Yang et al., 2019; Radford and Narasimhan, 2018; Radford et al., 2019; Brown et al., 2020). Several multimodal-BERT models have been developed that learn multimodal contextual embeddings through training jointly on linguistic data and visual data (Lu et al., 2019; Su et al., 2019; Li et al., 2019; Chen et al., 2020). They achieve state-of-the-art results across many tasks and benchmarks, such as Visual Question Answering (Goyal et al., 2017), image and text retrieval (Lin et al., 2014), and Visual Commonsense Reasoning (Suhr et al., 2019).111From here on we refer to the text-only BERT models as ’BERT’ and the multimodal-BERT models as ’multimodal-BERTs’.

BERT and multimodal-BERTs are blackbox models that are not easily interpretable. It is not trivial to know what knowledge is encoded in the models and their embeddings. A common method for getting insight into the embeddings of both textual and visual content is probing.

Language utterances have an inherent grammatical structure that contributes to their meaning. Natural images have a characteristic spatial structure that likewise allows humans to interpret their meaning. In this paper we hypothesize that the textual and visual embeddings learned from images that are paired with their descriptions encode structural knowledge of both the language and the visual data. Our goal is to reveal this structural knowledge with the use of probing. More specifically, in order to perform this probing, we first make the inherent structure of language and visuals explicit by a mapping between a dependency parse of the sentences that describe the image and by the dependency between the object regions in the image, respectively. Because the language truthfully describes the image, and inspired by Draschkow and Võ (2017), we define a visual structure that correlates with the dependency tree structure and that arranges object regions in the image in a tree structure. We call this visual dependency tree the scene tree. An example of this mapping to the scene tree is visualized in Figure 1.

Figure 1: Example of the mapping from the linguistic dependency tree to the visual tree. The borders of the regions in the image have the same color as the phrase they are attached to. The rows below the image are the textual tree depth (in black), the visual tree depth (in red), the phrase index, and the words in the sentence.

The aligned dependency tree and scene tree allow us to conduct a large set of experiments aimed at discovering encoded structures in neural representations obtained from multimodal-BERTs. By making use of the structural probes proposed by Hewitt and Manning (2019), we compare the dependency trees learned by models with or without provided image features. Furthermore, we investigate if scene trees are learned in the object region embeddings.

Research Questions

In this study, we aim to answer the following research questions.

  • RQ 1: Do the textual embeddings trained with a multimodal-BERT retain their structural knowledge?
    Sub-RQ 1.1: To what extent does the joint training in a multimodal-BERT influence the structures learned in the textual embeddings?

  • RQ 2: Do the visual embeddings trained with a multimodal-BERT learn to encode a scene tree?

In a broader framework this study might contribute to better representation learning inspired by how humans acquire language in a perceptual context. It stimulates the learning of representations that are compositional in nature and are jointly influenced by the structure of language and the corresponding structure of objects in visuals.

2 Related Work

Probing studies

Several studies have been performed that aim at analyzing BERT and multimodal-BERTs. For BERT, probes are designed that explore gender bias (Bhardwaj et al., 2021), relational knowledge (Wallat et al., 2020), linguistic knowledge for downstream tasks (Liu et al., 2019), part-of-speech knowledge (Hewitt and Liang, 2019; Hewitt et al., 2021), and for sentence and dependency structures (Tenney et al., 2019; Hewitt and Manning, 2019). These studies have shown that BERT latently learns to encode linguistic structures in its textual embeddings. Basaj et al. (2021) made a first attempt at converting the probes to the visual modality and evaluated the information stored in the features created by visual models trained with self-supervision.

For multimodal-BERTs, one study by Parcalabescu et al. (2021) investigates how well these models learn to count objects in images and how well they generalize to new quantities. They found that the multimodal-BERTs overfit the dataset bias and fail to generalize to out-of-distribution quantities. Frank et al. (2021) found that visual information is much more used for textual tasks than textual information is used for visual tasks when using multimodal models. These findings suggest more needed research into other capabilities of and knowledge in multimodal-BERT embeddings. We build on this line of work but aim to discover structures encoded in the textual and visual embeddings learned with multimodal-BERTs. This is a first step towards finding an aligned structure between text and images. Future work could exploit this to make textual information more useful for visual tasks.

Structures in visual data

There is large research interest in identifying structural properties of images e.g., scene graph annotation of the visual genome dataset (Krishna et al., 2016). In the field of psychology, research towards scene grammars (Draschkow and Võ, 2017) evidences that humans assign certain grammatical structures to the visual world. Furthermore, some studies investigate the grounding of textual structures in images, such as syntax learners (Shi et al., 2019) and visually grounded grammar inducers Zhao and Titov (2020). Here the complete image is used, without considering object regions and their composing structure, to aid in predicting linguistic structures.

Closer to our work, Elliott and Keller (2013) introduced visual dependency relations (VDR), where spatial relations are created between object in the image. The VDR can also be created by locating the object and subject in a caption and matching it with object annotations in the image (Elliott and de Vries, 2015). Our scene tree differs, since it makes use of the entire dependency tree of the caption to create the visual structure.

3 Background


Many variations of the BERT model implement a transformer architecture to process both visual and linguistic data, e.g., images and sentences. These Multimodal-BERTs can be categorized into two groups: single-stream and dual-stream encoders. In the former, a regular BERT architecture processes the concatenated input of the textual description and the image through a transformer stack. This allows for an "unconstrained fusion of cross-modal features" (Bugliarello et al., 2021). Some examples of these models are ViL-BERT (Su et al., 2019), VisualBERT (Li et al., 2019), and UNITER (Chen et al., 2020).

In the dual-stream models, the visual and linguistic features are first processed separately by different transformer stacks, followed by several transformer layers with alternating intra-modal and inter-modal interactions. For the inter-modal interactions, the query-key-value matrices modeling the multi-head self-attention are computed, and then the key-value matrices are exchanged between the modalities. This limits the interactions between the modalities but increases the expressive power with separate parameters. Examples of such dual-stream models are ViLBERT (Lu et al., 2019), LXMERT (Tan and Bansal, 2019), and ERNIE-ViL (Yu et al., 2021).222The ERNIE-ViL model is trained with scene graphs of the visual genome dataset. We do not probe this model as there is an overlap between the training data of ERNIE-ViL and our evaluation data.

4 Method

4.1 Tree Structures

In the probing experiments we assume that the structural knowledge of a sentence is made explicit by its dependency tree structure and that likewise the structural knowledge of an image is represented by a tree featuring the dependencies between object regions. Further, we assume that the nodes of a tree (words in the dependency tree of the sentence, phrase labels in the region dependency tree of the image) are represented as embeddings obtained from a layer in BERT or in a multimodal-BERT.

To generate the depths and distances values from the tree, we use properties of the embedding representation space (Mikolov et al., 2013)

. For example, similar types of relations between embeddings have a similar distance between them, such as counties and their capital city. The properties we use are that the length (the norm) of a vector which describes the depth in a tree and the distance between nodes that can be translated as the distance between vectors.

Generating distance values

For the distance labels, a matrix is required, with each describing the distance between nodes and . To fill the matrix, we iterate over all possible pairs of nodes. For nodes and , it is computed by starting at node in the tree and traverse it until node is reached while ensuring a minimum distance. This is achieved by using the breadth-first search algorithm.

Generating depth values

For the depth labels, we generate a vector , with the number of nodes in the tree. There is a single node that is the root of the tree, to which we assign a depth of zero. The depth increases at every level below.

4.2 Constructing the Trees

0:  Language dependency tree , with the set of for words in a sentence and the set of edges such that each , where is a child node of
0:  Set of phrases , each describes one or more regions and covers multiple words
0:  Image
0:  Scene tree
1:  , set of Nodes in Scene Tree
2:  , set of Edges in Scene Tree
3:  , set Image as root node
4:  , set root node depth as 0
8:  for  do
12:  for  ordered by  do
14:     while True do
18:        if  then
21:           break while loop
22:        else
24:  return  
Algorithm 1

Language dependency tree

We use the dependency tree as linguistic structure. The tree annotations are according to the Stanford dependency guidelines (De Marneffe and Manning, 2008). They can either be provided as gold-standard in the dataset, or generated using the spacy dependency parser (Honnibal et al., 2020).

Scene tree

Draschkow and Võ (2017) found that there are commonalities between words in language and objects in scenes, allowing to construct a scene grammar. Furthermore, Zhao and Titov (2020) have shown that an image provides clues that improve grammar induction. In line with these works, we want a visual structure that aligns with a linguistic representation like the dependency tree.

As visual structure, a scene graph could be used for the relations between regions (Krishna et al., 2016). However, the unconstrained graph is difficult to align with the dependency tree. Therefore, we propose a novel visual structure, the scene tree, that is created by mapping a textual dependency tree to the object regions of an image. An example of such a mapping for an image-sentence pair is given in Figure 1. This process requires a tree for the sentence and paired data for images and sentences.

Each node in the scene tree directly matches one or more visual regions. The node description is a phrase that covers multiple words in the sentence (or nodes in the dependency tree). The output of this method is a tree that contains the phrase trees that directly correspond to the regions. The algorithm is completely described as pseudo-code in Algorithm 1.

The algorithm starts by initializing the scene tree. We set the full image as the root node. For each phrase that describes an image region, we select the dependency tree node (or word with a ) that is closest to the root and assign this a phrase ID. This creates a mapping between the phrases (Phrase IDs) and dependency tree nodes (Text IDs) , and its reverse . We assign each phrase an initial depth, based on the word it maps to in . On line 12, the loop over the phrases that describe the object regions starts, to find the direct parent for each phrase so it can be added to the new scene tree. For each phrase , we select the matching dependency tree node the from . From we follow the chain of parent nodes, until an ancestor is found that points back to a phrase (using ) that is already a member of the scene tree. Phrase is added to the tree as child of . The completed tree of phrases is our scene tree.

4.3 Embeddings

Textual embeddings

For each sentence , every word becomes a node in the tree, such that we have a sequence of nodes . To obtain the textual embeddings , we do a wordpiece tokenization (Wu et al., 2016) and pass the sentence into BERT. Depending on the requested layer, we take the output of that BERT layer as the embeddings. For nodes with multiple embeddings because of the wordpiece tokenization, we take the average of those embeddings.

To obtain the textual embeddings for a multimodal-BERT, we use the same process but also provide visual features. When an image is present, we enter the visual features (as described in the next paragraph), otherwise, a single masked all-zero feature is entered.

Visual embeddings

For sentence with image , the sequence of nodes consists of the number of regions plus the full image. The visual embeddings are obtained by passing the raw Faster R-CNN features (Ren et al., 2015) into the multimodal-BERT. Depending on the requested layer, we take the output of that multimodal-BERT layer as the embeddings.

4.4 Structural Probes

Here we shortly describe the structural probes as defined by Hewitt and Manning (2019). Originally designed for text, we use these probes to map from an embedding space (either textual embeddings or visual embeddings) to depth or distance values as defined in Section 4.1.

Distance probe

Given a sequence of nodes (words or objects) and their embeddings , where identifies the sequence and the embedding size, we predict a matrix of

distances. First, we define a linear transformation

with the probe rank, such that is a positive semi-definite, symmetric matrix. By first transforming a vector with matrix , we get its norm like this: . To get the squared distance between two nodes and in sequence , we compute the difference between node embeddings and and take the norm following equation 1:


The only parameters of the distance probe are now the transformation matrix , which can easily be implemented as a fully connected linear layer. Identical to the work by Hewitt and Manning (2019)

, the probe is trained through stochastic gradient descent.

Depth probe

For the depth probe, we transform the embedding of each node to their norm, so we can construct the vector . This imposes a total order on the elements and results in the depths. We compute the squared vector norm with the following equation:


5 Experimental Setup

5.1 Data

By using a text-only dataset, we can test how the textual embeddings of the multimodal-BERTs perform compared to the BERT model, without the interference from the visual embeddings. This allows us to see how much information the multimodal-BERTs encode in the visual embeddings.

Therefore, we use the Penn Treebank (PTB3) (Marcus et al., 1999). It is commonly used for dependency parsing (also by Hewitt and Manning (2019) from whom we borrow the probes) and consists of gold-standard dependency tree annotations according to the Stanford dependency guidelines (De Marneffe and Manning, 2008). We use the default training/validation/testing split, that is, the subsets 2-21 for training, 22 for validation and 23 for testing of the Wall Street Journal sentences. This provides us with 39.8k/1.7k/2.4k sentences for the splits, respectively.

The second dataset is the Flickr30k dataset (Young et al., 2014)

, which consists of multimodal image captioning data. It has five caption annotations for each of the 30k images. An additional benefit of this dataset are the existing extensions, specifically the Flickr30k-Entities (F30E)

(Plummer et al., 2015). In F30E all the phrases in the captions are annotated and match with region annotations in the image. This paired dataset is used to create the scene trees proposed in Section 4.2.

The Flickr30k dataset does not provide gold-standard dependency trees. Therefore, the transformer based Spacy dependency parser (Honnibal et al., 2020) is used to generate silver-standard dependency trees according to the Stanford dependency guidelines (De Marneffe and Manning, 2008). The dataset consists of 30k images, with (mostly) 5 captions each, resulting in 148.9k/5k/5k sentences for the training/validation/testing splits, respectively.

5.2 Models

We use two different multimodal-BERTs, one single-stream and one dual-stream model. As implementation for the multimodal-BERTs, we make use of the Volta library (Bugliarello et al., 2021)

. Here, all the models are implemented and trained under a controlled and unified setup with regard to hyperparameters and training data. Based on the performance under this unified setup on the Flickr30k image-sentence matching task, we have chosen the best performing models: ViLBERT

(Lu et al., 2019) as single-stream model and UNITER (Chen et al., 2020) as dual-stream model.

When probing the textual embeddings, we also use a text-only BERT-base model (from here on referred to as BERT) (Devlin et al., 2019). Hewitt and Manning (2019) use the same model, allowing for easy comparability. The implementation used is from the HuggingFace Transformer library (Wolf et al., 2020).


For our setup and metrics, we follow the setup from Hewitt and Manning (2019)

. The batch size is set to 32 and we train for a maximum of 40 epochs. Early stopping is used to terminate training after no improvement on the validation L1-loss for 5 epochs.

5.3 Metrics

(a) BERT
Figure 2: Comparison for the depth probe on the PTB3 test set, with textual embeddings.
(a) BERT
Figure 3: Comparison for the distance probe on the PTB3 test set, with textual embeddings.
(a) BERT
Figure 4: Comparison for the depth probe on the Flickr30k test set, with textual embeddings.
(a) BERT
Figure 5: Comparison for the distance probe on the Flickr30k test set, with textual embeddings.
(a) BERT
(b) UNITER - only text
(c) ViLBERT - only text
Figure 6: Ablation comparison for the depth probe on the Flickr30k test set while just providing textual embeddings to the multimodal-BERTs.
(a) BERT
(b) UNITER - only text
(c) ViLBERT - only text
Figure 7: Ablation comparison for the distance probe on the Flickr30k test set while just providing textual embeddings to the multimodal-BERTs.

The main metric used for both the distance and the depth probes is the Spearman rank coefficient correlation. This indicates if the predicted depth vector of the nodes, or the predicted distance matrix of the nodes, correlate with the gold-standard (or silver) depths and distances generated according to the method in Section 4.4. The Spearman correlation is computed for each length sequence separately. We take the average over the scores of the lengths between 5 and 50 and call this the Distance Spearman (DSpr.) for the distance probe and the Norm Spearman (NSpr.) for the depth probe.333Just as done by Hewitt and Manning (2019).

For the depth probes, we also use the root accuracy (root_acc). This computes the accuracy of predicting the root of the sequence. This metric is only applicable for the textual embeddings, due to our method of generating the visual tree, where the root is always the full image at the start of the sequence.

For the distance probe, we make use of the undirected unlabelled attachment score (UUAS). This directly tests how accurate the predicted tree is compared to the ground-truth (or silver) tree by computing the accuracy of predicted connections between nodes in the tree. It does not consider the label for the connection or the direction of the connection (Jurafsky and Martin, 2021).

Baseline comparisons

We design one baseline for the textual data and two for the visual data. For the textual baseline, we use the initial word piece textual embeddings (from either BERT or a multimodal-BERT) before inserting them into the transformer stack. We simply refer to it as baseline.

The first visual baseline implements the raw Faster R-CNN features (Ren et al., 2015) of each object region. However, they have a larger dimension than the BERT embeddings. We refer to it as R-CNN baseline. The second baseline uses the visual embeddings before they are fed to the transformer stack. This is a mapping from the Faster R-CNN features to the BERT embedding size. We refer to it as baseline.

5.4 Hypotheses

First, we want to determine the probe rank of the linear transformation used on the textual or the visual embeddings. Based on results by Hewitt and Manning (2019), we set the probe rank for BERT to 128. We run a comparison with several probe ranks on UNITER and ViLBERT to find the optimal setting for the textual and visual embeddings. The results are shown and discussed in Appendix A. We use a rank of 128 for all our following experiments.

Rq 1

The multimodal-BERT models are pre-trained on language data. We assume that the resulting embeddings integrate structural grammatical knowledge and hypothesize that this knowledge will not be forgotten during multimodal training.

To determine if training on multimodal data affects the quality of predicting the dependency tree when trained solely with textual data, we train the probes with BERT and both multimodal-BERTs and evaluate on the PTB3 dataset (Marcus et al., 1999).

Sub-RQ 1.1

We expect that more interaction between the regions and the text will have a stronger impact. Some dependency attachments that are hard to predict might require visual knowledge. Next to the effect on the linguistic knowledge, we also want to discover if the multimodal data helps the multimodal-BERTs in learning structural knowledge. We run the probes on Flickr30k dataset (Young et al., 2014) with the textual embeddings for all our models. Furthermore, we compare these to the difference in scores on the PTB3 dataset (Marcus et al., 1999).

Rq 2

The Multimodal-BERTs learn highly contextualized embeddings. Therefore, we hypothesize that a model should be able to discover important interactions between object regions in the image. To see if the model has learned to encode the scene tree in the visual region embeddings, we run the probes on the Flickr30k dataset (Young et al., 2014) with the visual embeddings. Furthermore, to see if the scene tree is learned mainly through joint interaction with the textual embeddings, we compare the scores between the single-stream model UNITER (with many cross-modal interactions) and the dual-stream model ViLBERT (with limited cross-modal interactions).

6 Results and Discussion

This discussion is based on the results from the test split. The results on the validation split (see Appendix B), lead to the same observations.

RQ 1: Do the textual embeddings trained with a multimodal-BERT retain their structural knowledge?

To answer RQ 1, we report the results for both structural probes on the PTB3 dataset. Here we only use the textual embeddings, since no visual features are available. The results for the depth probe are in Figure 2, and for the distance probe in Figure 3.

The results of both multimodal-BERTs (Figures 1(c) and 2(c) for ViLBERT and Figures 1(b) and 2(b) for UNITER) in terms of NSpr. and Root Acc are very comparable showing similar curves and scores. For both, the seventh layer is the best performing one. The shape of the curves across the layers is similar to those for the BERT model in Figures 1(a) and 2(a). However, the scores of the multimodal-BERTs drop significantly. While the multimodal-BERTs were initialized with weights from BERT, they were trained longer on additional multimodal data with a different multimodal objective. This shows that the multimodal training hampers the storing of grammatical structural knowledge in the resulting embeddings.

Sub-RQ 1.1: To what extent does the joint training in a multimodal-BERT influence the structures learned in the textual embeddings?

For this experiment, we compare the effect of having visual features present when using the structural probes on the textual embeddings. We run the probes on Flickr30k. The results for the depth probe are in Figure 4, and for the distance probe in Figure 5.

First, we see that for all models (BERT and multimodal-BERTs) the scores increase compared to the results on the PTB3 dataset (see discussion of RQ 1), but still follow a similar trend across the layers. The latter is most likely due to the complexity of the sentences and language of the PTB3 dataset, which is simpler for the captions. For ViLBERT, there is a drop in performance for the earlier layers. We believe this is caused by the early stopping method firing early with these settings. Another explanation is that it is more difficult for the dual-stream model to use the additional parameters.

BERT outperforms the multimodal-BERTs on PTB3, however, this is not the case on Flickr30k. For the depth probe (Figure 4) and the UUAS metric on the distance probe (Figure 5), the results obtained on these two datasets are almost equal. This can be due to the additional pretraining of the multimodal-BERTs on similar captioning sentences. Another explanation is that, during such pretraining, the models learned to store relevant information in the visual embeddings.

We run an additional experiment where we use the pretrained multimodal-BERT, but while probing we only provide the sentence to the model, and mask out the image. The results for the depth probe are in Figure 6, and for the distance probe in Figure 7. Here we can see that the results are almost identical to when we provide the model with the visual embeddings. This indicates that the model does not have any benefit from the visual data when predicting the structures for textual embeddings, and it seems that the model uses the extra parameters of the vision layers to store knowledge about the text.

RQ 2: Do the visual embeddings trained with a multimodal-BERT learn to encode a scene tree?

We aim to find the layer with the most structural knowledge learned when applied to multimodal data. See the results in Figures LABEL:fig:layer_flickr_visdep and LABEL:fig:layer_flickr_visdist.

Regarding the results for the depth probe (Figure LABEL:fig:layer_flickr_visdep), the scores between layers fluctuate inconsistently. The scores do improve slightly over the baselines, indicating that the multimodal-BERT encodes some knowledge of depth in the layers.

With regard to the distance probe (Figure LABEL:fig:layer_flickr_visdist), the trend in the curves across the layers indicate that this is a type of knowledge that can be learned for the regions. The multimodal-BERTs seem to disregard scene trees. There is a strong downward trend across the layers. Furthermore, all the scores are much lower than the baseline and the R-CNN baseline scores. This lack of learning of the scene tree can be caused by the chosen training objective of the multimodal-BERTs. These objectives require an abstract type of information, where only basic features are needed to predict the masked items.

For the distance probe, there is a noticeable difference between the single-stream (Figure LABEL:fig:rank_flickr_visdist_unit) and the dual-stream (Figure LABEL:fig:rank_flickr_visdist_vil) models, where single stream models benefit from the multimodal interactions to retain structural knowledge. For UNITER, the scores in the first layers are very close to the baseline, showing that the single stream interaction benefits the memorizing of the scene tree structure.

7 Conclusion and Future Work

We made a first attempt at investigating whether the current Multimodal-BERT models encode structural grammatical knowledge in their textual embeddings, in a similar way as text-only BERT models encode this knowledge. Furthermore, we were the first to investigate the existence of encoded structural compositional knowledge of the object regions in image embeddings. For this purpose, we created a novel scene tree structure that is mapped from the textual dependency tree of the paired caption. We discovered that the multimodal-BERTs encode less structural grammatical knowledge than BERT. However, with image features present, it is still possible to achieve similar results. The cause for this requires more research.

While tree depths from the scene tree are not natively present in the features, we found that this could be a potential method of finding connections and distances between regions, already decently predicted with the Faster R-CNN features. The Multimodal-BERT models are currently trained with an objective that does not enforce the learning or storing of these types of structural information. Hence we assume that the models learn to encode more abstract knowledge in their features.

Our work opens possibilities to further research on scene trees as a joint representation of object compositions in an image and the grammatical structure of its caption. Furthermore, we recommend investigating the training of multimodal-BERTs with objectives that enforce the encoding of structural knowledge.


We would like to thank Desmond Elliott, Djamé Seddah, and Liesbeth Allein for feedback on the paper. Victor Milewski and Marie-Francine Moens were funded by the European Research Council (ERC) Advanced Grant CALCULUS (grant agreement No. 788506). Miryam de Lhoneux was funded by the Swedish Research Council (grant 2020-00437).


  • D. Basaj, W. Oleszkiewicz, I. Sieradzki, M. Górszczak, B. Rychalska, T. Trzcinski, and B. Zielinski (2021) Explaining self-supervised image representations with visual probing. In

    International Joint Conference on Artificial Intelligence

    Cited by: §2.
  • R. Bhardwaj, N. Majumder, and S. Poria (2021) Investigating gender bias in bert. Cognitive Computation, pp. 1–11. Cited by: §2.
  • T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. (2020) Language models are few-shot learners. Advances in neural information processing systems 33, pp. 1877–1901. Cited by: §1.
  • E. Bugliarello, R. Cotterell, N. Okazaki, and D. Elliott (2021) Multimodal pretraining unmasked: a meta-analysis and a unified framework of vision-and-language berts. Transactions of the Association for Computational Linguistics 9, pp. 978–994. Cited by: §3, §5.2.
  • Y. Chen, L. Li, L. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng, and J. Liu (2020) Uniter: universal image-text representation learning. In

    European conference on computer vision

    pp. 104–120. Cited by: §1, §3, §5.2.
  • M. De Marneffe and C. D. Manning (2008) The stanford typed dependencies representation. In Coling 2008: proceedings of the workshop on cross-framework and cross-domain parser evaluation, pp. 1–8. Cited by: §4.2, §5.1, §5.1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Link, Document Cited by: §1, §5.2.
  • D. Draschkow and M. L. Võ (2017) Scene grammar shapes the way we interact with objects, strengthens memories, and speeds search. Scientific reports 7 (1), pp. 1–12. Cited by: §1, §2, §4.2.
  • D. Elliott and A. P. de Vries (2015) Describing images using inferred visual dependency representations. In ACL, Cited by: §2.
  • D. Elliott and F. Keller (2013) Image description using visual dependency representations. In EMNLP, Cited by: §2.
  • S. Frank, E. Bugliarello, and D. Elliott (2021) Vision-and-language or vision-for-language? on cross-modal influence in multimodal transformers. In

    Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    Punta Cana, Dominican Republic, pp. (to appear). External Links: Link Cited by: §2.
  • Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh (2017) Making the V in VQA matter: elevating the role of image understanding in Visual Question Answering. In

    Conference on Computer Vision and Pattern Recognition (CVPR)

    Cited by: §1.
  • J. Hewitt, K. Ethayarajh, P. Liang, and C. Manning (2021) Conditional probing: measuring usable information beyond a baseline. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online and Punta Cana, Dominican Republic, pp. 1626–1639. External Links: Link Cited by: §2.
  • J. Hewitt and P. Liang (2019) Designing and interpreting probes with control tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 2733–2743. Cited by: §2.
  • J. Hewitt and C. D. Manning (2019) A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4129–4138. External Links: Link, Document Cited by: §1, §2, §4.4, §4.4, §5.1, §5.2, §5.2, §5.4, footnote 3.
  • M. Honnibal, I. Montani, S. Van Landeghem, and A. Boyd (2020) spaCy: Industrial-strength Natural Language Processing in Python. Zenodo. External Links: Document, Link Cited by: §4.2, §5.1.
  • D. Jurafsky and J. H. Martin (2021) Speech and language processing (3rd (draft) ed.). Stanford Univ. Cited by: §5.3.
  • R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L. Li, D. A. Shamma, M. S. Bernstein, and L. Fei-Fei (2016) Visual genome: connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123, pp. 32–73. Cited by: §2, §4.2.
  • L. H. Li, M. Yatskar, D. Yin, C. Hsieh, and K. Chang (2019) Visualbert: a simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557. Cited by: §1, §3.
  • T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014)

    Microsoft coco: common objects in context

    In European conference on computer vision, pp. 740–755. Cited by: §1.
  • N. F. Liu, M. Gardner, Y. Belinkov, M. E. Peters, and N. A. Smith (2019) Linguistic knowledge and transferability of contextual representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 1073–1094. External Links: Link, Document Cited by: §2.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §1.
  • J. Lu, D. Batra, D. Parikh, and S. Lee (2019) ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NeurIPS, Cited by: §1, §3, §5.2.
  • M. P. Marcus, B. Santorini, M. A. Marcinkiewicz, and A. Taylor (1999) Treebank-3. Linguistic Data Consortium, Philadelphia 14. Cited by: §5.1, §5.4, §5.4.
  • T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: §4.1.
  • L. Parcalabescu, A. Gatt, A. Frank, and I. Calixto (2021) Seeing past words: testing the cross-modal capabilities of pretrained V&L models on counting tasks. In Proceedings of the 1st Workshop on Multimodal Semantic Representations (MMSR), Groningen, Netherlands (Online), pp. 32–44. External Links: Link Cited by: §2.
  • B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik (2015) Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pp. 2641–2649. Cited by: §5.1.
  • A. Radford and K. Narasimhan (2018) Improving language understanding by generative pre-training. Cited by: §1.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. (2019) Language models are unsupervised multitask learners. OpenAI blog 1 (8), pp. 9. Cited by: §1.
  • S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Advances in neural information processing systems 28, pp. 91–99. Cited by: §4.3, §5.3.
  • H. Shi, J. Mao, K. Gimpel, and K. Livescu (2019) Visually Grounded Neural Syntax Acquisition. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 1842–1861. Note: regarding visually grounded syntax acquistion. This was the basis for the Visually Grounded Compound PCFGs paper. They mention that given enough image-sentence pairs, one can learn that certain words match certain elements in the image. This can be extended to groups of words, forming the constituent. Therfore, they propose automated language learning with a focus on syntactic structures. Their VG-NSL first learns a latent structure. Next, the visual and textual representation are matched into a joined embedding space. No human-labeled trees or tags are used. A concreteness score is defined that scores based on the matching with the images. They show they have less issues with random initialization of weights, plus it works just as well on much less training data. It also works in multiple languages. They propose some possible extensions: considering structured representations of both images, disentangle their representations, extend to other linguistic tasks. External Links: Link, Document Cited by: §2.
  • W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, and J. Dai (2019) Vl-bert: pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530. Cited by: §1, §3.
  • A. Suhr, S. Zhou, A. Zhang, I. Zhang, H. Bai, and Y. Artzi (2019) A corpus for reasoning about natural language grounded in photographs. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 6418–6428. External Links: Link, Document Cited by: §1.
  • H. H. Tan and M. Bansal (2019) LXMERT: learning cross-modality encoder representations from transformers. In EMNLP, Cited by: §3.
  • I. Tenney, P. Xia, B. Chen, A. Wang, A. Poliak, R. T. McCoy, N. Kim, B. Van Durme, S. R. Bowman, D. Das, et al. (2019) What do you learn from context? probing for sentence structure in contextualized word representations. arXiv preprint arXiv:1905.06316. Cited by: §2.
  • J. Wallat, J. Singh, and A. Anand (2020) BERTnesia: investigating the capture and forgetting of knowledge in BERT. In

    Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP

    Online, pp. 174–183. External Links: Link, Document Cited by: §2.
  • T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush (2020) Transformers: state-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, pp. 38–45. External Links: Link Cited by: §5.2.
  • Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al. (2016)

    Google’s neural machine translation system: bridging the gap between human and machine translation

    arXiv preprint arXiv:1609.08144. Cited by: §4.3.
  • Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le (2019) Xlnet: generalized autoregressive pretraining for language understanding. Advances in neural information processing systems 32. Cited by: §1.
  • P. Young, A. Lai, M. Hodosh, and J. Hockenmaier (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2, pp. 67–78. Cited by: §5.1, §5.4, §5.4.
  • F. Yu, J. Tang, W. Yin, Y. Sun, H. Tian, H. Wu, and H. Wang (2021) ERNIE-vil: knowledge enhanced vision-language representations through scene graphs. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, pp. 3208–3216. Cited by: §3.
  • Y. Zhao and I. Titov (2020) Visually Grounded Compound PCFGs. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 4369–4379. Note: they propose visual grounded grammar induction. They use a set of different losses to improve over almost all of the label types, whereas, the older models often only perform better on the NPs, and much worse on VP. To alleviate the bias from the visual data (which mostly learns the NPs), they propose a loss on unlabeled text. This can learn a strong grammar induction. furthermore, they use compound PCFG to also train on a language modelling loss.This also solves an issue of previous methods where Monte Carlo policy gradients were used to find the expected rewards for the training scheme. (older methods used REINFORCE, this is not needed in this work). External Links: Link, Document Cited by: §2, §4.2.

Appendix A Tuning Probe Rank

To find the dimensionality needed for the multimodal-BERTS, we made a comparison between several probes. The results for the textual embeddings are in Figures LABEL:fig:rank_flickr_dep and 11. Here we see that the probe rank does not have any significant effect of changing the performance of the models. Therefore, we decided it is best to follow the optimal rank found for the BERT model: 128.

The results for the visual embeddings are in Figures LABEL:fig:rank_flickr_visdep and LABEL:fig:rank_flickr_visdist. Here we also see only very small changes. Therefore, we also keep the probe rank at 128 for the visual features.

Appendix B Results on Validation Split

The same graphs as for our experiments discussed in Section 6 using the validation set instead of the test set. The graphs created for the test set are very similar to those the validation set. The results lead to an identical conclusion. One difference is the performance of the ViLBERT model. On the textual features, the score for earlier layers is again comparable with the other models. This indicates that the early stopping indead fired to early.

Furthermore, ViLBERT is less capable to predict the scene trees, which confirms the hypothesis that inter-modal interaction is needed to learn the structural knowledge that is implicitly present in the image and its captions.

(a) BERT
Figure 14: Comparison for the depth probe on the PTB3 validation set, with textual embeddings.
(a) BERT
Figure 15: Comparison for the distance probe on the PTB3 validation set, with textual embeddings.
(a) BERT
Figure 16: Comparison for the depth probe on the Flickr30k validation set, with textual embeddings.
(a) BERT
Figure 17: Comparison for the distance probe on the Flickr30k validation set, with textual embeddings.
(a) BERT
(b) UNITER - only text
(c) ViLBERT - only text
Figure 18: Ablation comparison for the depth probe on the Flickr30k validation set while just providing textual embeddings to the multimodal-BERTs.
(a) BERT
(b) UNITER - only text
(c) ViLBERT - only text
Figure 19: Ablation comparison for the distance probe on the Flickr30k validation set while just providing textual embeddings to the multimodal-BERTs.