Motivated by pre-trained models like BERT [devlin2018bert] and ERNIE [sun2019ernie] which have significantly improved the performance on many NLP tasks, researchers ([sun2019videobert] [lu2019vilbert] [li2019unicoder] [su2019vl] [zhou2019unified] [qi2020imagebert] [li2019visualbert] [chen2019uniter]) have noticed the importance of pre-training for vision-language tasks, e.g., Visual Question Answering(VQA) [antol2015vqa] and Visual Commonsense Reasoning (VCR) [zellers2019recognition].
Existing vision-language pre-training methods attempt to learn joint representations through visual grounding tasks on large image-text datasets, including Masked Language Modelling based on randomly-masked sub-words and Image-Text Matching at the whole image/text level. However, based on randomly-masking and predicting the sub-words, current models do not distinguish common words and words describing the detailed semantics[johnson2015image], e.g., objects("man", "boat"), attributes of objects("boat is white"), relationships between objects("man standing on boat").
These methods neglect the importance of constructing detailed semantic alignments across vision and language, therefore the trained models can not well represent fine-grained semantics required by some real-world scenes. As shown in Figure 1, the detailed semantics are essential to distinguish the listed scenes which differ in objects, attributes and relationships. Hence better joint vision-language representations should characterize detailed semantic alignments across the modalities.
Inspired by the knowledge masking strategy of ERNIE 1.0 [sun2019ernie] which aimed at learning more structured knowledge by masking phrases and named entities rather than individual sub-words, we propose ERNIE-ViL, which incorporates knowledge from scene graphs [johnson2015image], to construct better and more robust representations for vision-language joint modelling. Through constructing Scene Graph Prediction pre-training tasks, ERNIE-ViL puts more emphasis on detailed semantic alignments across vision and language. Concretely, we implement these pre-training tasks based on predicting different types of nodes in the scene graph parsed from sentence. The key insight lies in that during the pre-training phase, these tasks force the model to accurately predict the masked scene elements with the context of the observed information of both modalities, thus the model can learn the connections across the modalities. Through the Scene Graph Prediction tasks, ERNIE-ViL learns the detailed semantic alignments across vision-language modalities.
We pre-train ERNIE-ViL on two large commonly-used image-text out-of-domain datasets, Conceptual Captions [sharma2018conceptual] and SBU captions [ordonez2011im2text]. To evaluate the performance of ERNIE-ViL, we conduct experiments on various vision-language tasks, (1) visual question answering (VQA 2.0 [antol2015vqa]), (2) visual commonsense reasoning (VCR [zellers2019recognition]) (3) region-to-phrase grounding (RefCOCO+ [kazemzadeh2014referitgame]
) (4) image-text/text-image retrieval (Flickr30K[young2014image]). On all these tasks, ERNIE-ViL obtains significant improvements compared to those methods pretrained on out-of-domain datasets. And for fair comparison with the models pretrained on both out-of-domain and in-domain datasets, we further pretrain ERNIE-ViL on MS-COCO [lin2014microsoft] and Visual-Genome [krishna2017visual] (in-domain datasets for downstream tasks). It achieves the state-of-the-art performances on all downstream tasks. On the region-to-phrase grounding task, which needs detailed semantic alignments, we achieve an improvement of over 2.0% on both the testsets. Also we obtain the best single model performance and ranked the 1st place one the leaderboard with an absolute improvement of 3.7% on the Q->AR task compared to the former best model. Our code and pre-trained models will be publicly available.
Overall, our proposed method make three contributions:
To the best of our knowledge, ERNIE-ViL is the first work that introduces structure knowledge to enhance vision-language pre-training.
ERNIE-ViL constructs Scene Graph Prediction tasks during the pre-training of vision-language joint representations, putting more emphasis on the alignments of detailed semantics across modalities.
ERNIE-ViL achieves state-of-the-art performances on 5 downstream cross-modal tasks and rank the 1st place on the VCR leaderboard.
|A tan dog and a little girl kiss.||A black dog playing with a purple toy.||A dog chasing another dog by a lake.|
|The little girl is kissing the brown cat.||A black dog playing with a green toy .||Two dogs standing in a lake.|
2 Related Works
2.1 Cross-modal Pre-training
Inspired by text pre-training models [devlin2018bert], many cross-modal pre-training models for vision-language have been proposed. These researchers put their efforts mainly on three aspects, which are model architecture, pre-training tasks and pre-training data.
These latest works are based on different variables of Transformers. Most of them ( [li2019unicoder] [su2019vl] [zhou2019unified] [qi2020imagebert] [li2019visualbert] [huang2020pixel]) use a uniform cross-modal Transformer modelling both image and text representations, while the others like ViLBERT [lu2019vilbert] and LXMERT [tan2019lxmert] are based on two-stream cross-modal Transformers, which bring more specific representations for image and text.
Inspired by the pre-train tasks in text pre-training models, Masked Language Model and similar Masked Region Prediction tasks [lu2019vilbert] are utilized in cross-modal pre-training. And similar to Next-Sentence Prediction, Image-Text Matching [lu2019vilbert][su2019vl][chen2019uniter] task is also widely used. However, only based on randomly masking and predicting sub-words, these methods do not distinguish the common words and words described the detailed semantics. In this manner, the fine-grained semantic alignments across modalities cannot be well characterized in those learned joint representations.
Unlike text pre-training models that can leverage tremendous natural language data, vision-language tasks require high-quality text-image aligned data that are hard to obtain. Conceptual Captions[sharma2018conceptual] and SBU Captions[ordonez2011im2text] are the most widely-used datasets for image-text pre-training, with 3.0M and 1.0M image descriptions respectively. These two datasets are out-of-domain for vision-language downstream tasks, while some existing works [chen2019uniter][huang2020pixel] try to incorpate in-domain datasets, such as MS-COCO and Visual-Genome, which are highly correlation with downstream tasks.
2.2 Scene Graph
The scene graph contains structured knowledge of visual scenes, include the present objects, attributes of objects, and relationships between objects. As a beneficial prior knowledge describing the detailed semantics of the image and caption for the visual scene, scene graphs have led to many state-of-the-art models in image captioning[yang2019auto], image retrieval [wu2019unified], VQA [zhang2019empirical] and image generation [johnson2018image].
Various methods have been proposed to parse scene graphs from images[zellers2018neural] [xu2017scene] and texts[schuster2015generating][anderson2016spice] [wang2018scene]. And scene graphs automatically parsed from the text have benefit several image-text multi-modal tasks. SPICE [anderson2016spice]
proposed a new evaluation metric utilizing scene graphs parsed from captions for image captioning. UniVSE[wu2019unified] improved the robustness in defending text-domain adversarial attacks for cross-domain tasks. SGAE [yang2019auto] using scene graphs as internal structure bridging the gap of image and language modal to improving the performance of image captioning.
In this section, we will first introduce the model architecture of ERNIE-ViL. And then we will illustrate our newly-proposed Scene Graph Prediction pretrain-training tasks. Finally, pre-training with Scene Graph Prediction tasks in ERINE-ViL will be introduced.
3.1 Model Architecture
The vision-language model aims at learning the joint representations that integrates information of both modalities and the alignments across the modalities. The inputs to the joint model are usually a sentence and an image.
3.1.1 Input Embedding
Given a sequence of words and an image, we first introduce the methods to embed the inputs to the feature space.
We adopt the similar word pre-prossessing method as BERT [devlin2018bert]. The input sentence is tokenized into sub-word tokens using WordPiece approach. Special tokens such as [CLS] and [SEP] are also added to the tokenized text sequence to make the text sequence . The final embedding for each sub-word token is generated by combining its original word embedding, segment embedding and sequence position embedding.
For the image, we first use a pre-trained object detector to detect the salient image regions from the image. And the pooling features before multi-class classification layer are utilized as the region feature. We also encode the location features for each region via a 5-dimensional vector, where and denote the coordinate of the bottom-left and top-right corner while and are the width and height of the input image. We also add a special feature [IMG] that denotes the representation the entire image (i.e. mean-pooled visual features with a spatial encoding corresponding to the entire image) to make the final region sequence .
3.1.2 Vision-Language Encoder
Given the embedding of image regions and the words for the sentence , we use two-stream cross-modal Transformers to joint model the intra-modal and inter-modal representations. Following ViLBERT [lu2019vilbert], ERNIE-ViL consists of two parallel BERT-style models operating over image regions and text segments.
The model outputs embeddings for each input of both the image and text: , , and . We take and as the holistic image and text representations.
3.2 Scene Graph Prediction
As conditioned on both the sentence and the image, we could accurately reconstruct the objects(cat), attributes(white), and relationships (on top of) even if these elements are missing. However, only given the sentence, we could only reconstruct the elements with the same type as the origin tokens but without aligning them with image. When the objects, attributes or relationships are masked in the sentence, the model cannot accurately reconstruct them without the help of the image.
Scene graph encodes structured knowledge of visual scenes, including the present objects, attributes of objects, and relationships between objects, which are quite essential in differing scenes. Therefore, we construct Scene Graph Prediction tasks, i.e., Object Prediction, Attribute Prediction and Relationship Prediction Task. These tasks force the model to construct alignments across vision and language on more detailed semantics . Concretely, as shown in Figure 2, based on scene graph parsed from the text, we construct three prediction tasks according to the different node types in the scene graph, i.e. objects, attributes and relationships.
Scene graph parsing
Given the text sentence , we parse it into a scene graph [anderson2016spice], which denotes as , where is the set of object mentioned in , is the set of hyper-edges representing relationships nodes between object nodes and is the set of relationships mentioned in . is the set of attribute nodes associated with object nodes, where is the set of attributes mentioned in . Scene graph describes the objects in more details with various attributes associated and relationships between objects. Thus integrating the knowledge of scene graph can benefit learning a more detailed joint representations for the vision-language. In this paper, the Scene Graph Parser provided by [anderson2016spice] is adopted to parse the text to scene graph. For a more intuitive understanding, we illustrate a specific case for the parsed scene graph from the text in Table 1.
|sentence:||A woman in a blue dress is putting her little white|
|cat on top of a brown car in front of her house.|
|objects:||dress, woman, cat, car, house|
|relationships:||woman in dress, woman putting cat, cat on-top-of car, car in-front-of house|
|attributes:||blue dress, white cat, little cat, brown car|
3.2.1 Object Prediction
Objects are the dominant elements of the visual scenes, thus playing an important role in constructing the representation of semantics. Predicting the objects forces the model to build the vision-language connections at object level.
Firstly, for the all the objects nodes in the scene graph, we randomly select 30% of them to mask. And for each selected object node , we replace it with the special token [MASK]
in probability of 80%, another random token in probability of 10%, and keep it in probability of 10%. Note that the objects are actually correspond to the sub-sequences of text in the sentence, therefore the object masking are implemented by masking the corresponding sub-sequences in the text.
For Object Prediction, ERNIE-ViL tries to recover these masked object tokens, which is denoted as , based on the observation of their surrounding words and all image regions , by minimizing the negative log-likelihood:
3.2.2 Attribute Prediction
Attributes characterize the detail information of the visual objects, such as color or shape of the objects, therefore encoding the detailed information in the visual scenes from another aspect.
Similarly, we randomly select 30% of all the attribute nodes in the scene graph, and the mask strategy is the same as that in Object Prediction. Since the attribute nodes in the scene graph are attached to objects, we keep the associated object while masking out the attribute node in each selected .
Given object words in attribute pair , Attribute Prediction is to recover the masked tokens of attribute nodes, predicting the probability for each masked attribute word . Based on the observation of the object tokens , other surrounding words and all image regions v, Attribute Prediction tries to minimize the negative log-likelihood:
3.2.3 Relationship Prediction
Relationships describe the actions (semantic) or relative position (geometry) between the objects of the visual scenes, which contributes to distinguish scenes with same objects but different relationships.
When performing the mask strategy of selected relationship triplets , we keep the objects and mask out the relationship node . Thus, ERNIE-ViL constructs the Relationship Prediction task to learn the connections for the relationships across vision-language modalities. Specifically, given object tokens in relationship triplet , this task tries to recover the masked relationship tokens, predicting the probability for each masked relation tokens . Thus the context for the prediction is the given object tokens , other surrounding words from the text and all image regions :
3.3 Pre-training with Scene Graph Prediction
Simliar to ViLBERT[lu2019vilbert], ERNIE-ViL also adopts Masked Language Modelling(MLM) to capture the syntactic and lexical information in the text. Moreover, Masked Region Prediction and Image-text matching are utilized for visual modality and cross-modality respectively. The losses for these tasks are summed while pre-training.
4.1 Training ERNIE-ViL
We use the Conceptual Captions (CC) dataset [sharma2018conceptual] and SBU dataset [ordonez2011im2text] as pre-training data. CC is a collection of 3.3 million image-caption pairs automatically scraped from alt-text enabled web images and SBU is a similar vision-language dataset which has 1.0 million image-caption pairs. Since some links had become broken by the time we downloaded the data, we only download about 3.0 million pairs for CC dataset and 0.8 million for SBU dataset. Notice that CC and SBU are image-caption pairs automatically collected from the web and have no intersections with the down-stream task datasets, thus act as out-of-domain datasets for training vision-language models.
For each image-text pair in the training , the pre-processing is performed as follows. For the image, we adopt Faster R-CNN [ren2015faster] (with ResNet-101 [he2016deep] backbone) pre-trained on the Visual-Genome dataset to select salient image regions and extract region features. More specifically, regions with class detection probability exceeds a confidence threshold of 0.2 are selected and 10 to 36 boxes are kept. And for each kept region, the mean-pooled convolutional representation is used as the feature for it. For the text, we parse the scene graphs from the sentences using the Scene Graph Parser and adopt WordPieces to tokenize the sentence following BERT.
For the masking strategies, we randomly mask 15% tokens, 30 % scene graph nodes, and 15 % image regions. While for the token and region prediction tasks, only the item in the positive pairs will be predicted.
We train ERNIE-ViL on two model scale settings: ERNIE-ViL-base and ERNIE-ViL-large, which mainly differ in model depth for text stream. The detailed setting are shown in Table 2. We initialize the text stream with the parameters from ERNIE 2.0 model [sun2019ernie2], and train ERINE-VL with a total batch size of 512 for at least 500k steps on 8 V100 GPUs . And Adam optimizer with initial learning rates of 1e-4 and a learning rate linear decay schedule is utilized.
|Text Stream||Image Stream||Cross Stream|
4.2 Downstream Tasks
4.2.1 Visual Commonsense Reasoning (VCR)
The Visual Commonsense Reasoning (VCR) [zellers2019recognition] task contains two sub-tasks: visual question answering (QA) and answer justification (QAR), which are both multiple choice problems. The holistic setting (QAR) requires both the chosen answer and the chosen rationale to be correct. The VCR dataset consists of 290k multiple choice QA problems derived from 110k movie scenes. In visual question answering (QA) task, we concatenate the question and each candidate answer for the language modality and keep the image for the visual modality. We take dot product of final hidden state of and to predict matching score for each answer semantically matched with the visual content with an additional FC layer. For the answer justification (QAR) task, we use the same setting as visual question answering (QA) task.
Similar with UNITER [chen2019uniter]
, a second-stage pre-training is utilized using VCR dataset. And then we fine-tune VCR model over 6 epochs with a batch size of 64 and initial learning rate of 2e-5 which decays by 0.1 at the 2th and 4th epoch.
4.2.2 Visual Question Answering (VQA)
The VQA task requires answering natural language questions about images. VQA 2.0 dataset [antol2015vqa] contains 204k images and 1.1M questions about these images. Following [anderson2018bottom], we treat VQA as a multi-label classification task – assigning a soft target score to each answer based on its relevancy to the 10 human answer responses. We take dot product of final hidden state of and to map this representation into 3,129 possible answers with an additional two layer MLP. The model is optimized with a binary cross-entropy loss on the soft target scores. We fine-tune VQA model over 12 epochs with a batch size of 256 and initial learning rate of 4e-5 which decays by 0.1 at the end of epoch 6 and epoch 9. At inference, we simply take a softmax.
4.2.3 Grounding Referring Expressions
The referring expression task is to localize an image region given a natural language reference. We evaluate the task on RefCOCO+ dataset [kazemzadeh2014referitgame]. In this paper, we use the bounding box proposals provided by [yu2018mattnet] pre-trained on the COCO dataset. We do the prediction for each region using its final hidden state with an additional FC layer while each region is labelled by computing the IoU with the ground truth box with a threshold of 0.5. We use a binary cross-entropy loss on the target label for each region and fine-tune RefCOCO+ model over 20 epochs with a batch size of 256, initial learning rate of 4e-5 which decays by 0.1 at the end of epoch 12, 16. At inference, we take the region with highest scoring as the prediction. The output of predicted bounding box is regarded as correct if the IoU between the predicted box and the ground truth box is higher than 0.5.
4.2.4 Image Retrieval & Text Retrieval
Caption-based image retrieval is the task of identifying an image from a pool given a caption describing its content. Flickr30K [young2014image] contains 31,000 images collected from Flickr website where 5 captions are available for each image. Following the same split in [lee2018stacked], we use 1,000 images for validation and 1,000 images for testing and the rest for training.
We take dot product of final hidden state of and to predict matching score for each text and image is matched with an additional FC layer. We utilize circle loss [sun2020circle] with random negative samples for each image-text pair. We set for all settings. We trained 40 epochs on Flickr30K dataset with the initial learning rate 5e-6 and decays at end of epoch 24 and epoch 32.
We compare our pre-training ERNIE-ViL model against other cross-modal pre-training models. As shown in Table 3, with scene graph knowledge-enhanced, ERNIE-ViL achieves state-of-the-art results on all downstream tasks.
Pre-trained on the same out-of-domain datasets (CC, SBU), ERNIE-ViL acquires significant improvements on VCR, Image Retrieval and Text Retrieval compared to Unicoder-VL [li2019unicoder]. Specifically, its absolute improvement of 3.60% of R@1 for Image Retrieval and 2.50% of R@1 for Text Retrieval on Flickr30K demonstrates the effectiveness of detailed semantic alignments across vision and language. As compared to ViLBERT which uses the same two-steam cross-modal Transformers architecture, ERNIE-ViL obtains better results on all downstream tasks. Notice that ERNIE-ViL is pre-trained only on out-of-domain datasets, therefore for the fair comparison with those models pretrained with out-of-domain and in-domain datasets, we further pre-train ERINE-ViL with in-domain datasets (Visual-Genome, MS-COCO). As illustrated in Table 3, ERINE-VIL-large acheives better downstream task performances on 5 tasks compared to UNITER, 12-in-1 [Lu_2020_CVPR] , OSCAR[li2020oscar] and VILLA[gan2020large].
|ViLBERT||72.42 (73.3)||74.47 (74.6)||54.04 (54.8)||72.34||78.52||62.61|
|Unicoder-VL||72.6 (73.4)||74.5 (74.4)||54.4 (54.9)||-||-||-|
||VLBERT-base||73.8 (-)||74.4 (-)||55.2 (-)||71.60||77.72||60.99|
||VLBERT-large||75.5 (75.8)||77.9 (78.4)||58.9 (59.7)||72.59||78.57||62.30|
Out-of-domain + in-domain
|UNITER-base||74.56 (75.0)||77.03 (77.2)||57.76 (58.2)||75.31||81.30||65.58|
||VILLA-base||75.54 (76.4)||78.78 (79.1)||59.75 (60.6)||76.05||81.65||65.70|
||UNITER-large||77.22 (77.3)||80.49 (80.8)||62.59 (62.8)||75.90||81.45||66.70|
||VILLA-large||78.45 (78.9)||82.57 (82.8)||65.18 (65.7)||76.17||81.54||66.84|
|ERNIE-ViL-base||76.37 (77.0)||79.65 (80.3)||61.24 (62.1)||76.66||82.83||67.24|
||ERNIE-ViL-large||78.52 (79.2)||83.37 (83.5)||65.81 (66.3)||76.59||82.95||68.39|
Out-of-domain + in-domain
|ERNIE-ViL-large||78.62 (-)||83.42 (-)||65.95(-)||77.99||84.00||68.84|
Out-of-domain + in-domain
Out-of-domain + in-domain
|1||a black dog about to catch a flying disc .|
|2||here is a picture of an army standing in one row and on an island .|
|3||a man with a white shirt and red tie is talking to another man in a kitchen .|
|4||two teen boys in school clothes are walking with something in a garbage bag .|
|5||five children are on a carnival ride under a clown face .|
|6||two dolphins jumping into the water .|
|7||a man in a brown shirt is cutting a piece of cake .|
|8||a brown dog walks towards another animal hiding in the grass .|
To validate the effectiveness of incorporating knowledge from scene graph, we conduct the language cloze test conditioned on the visual modality.
In the cloze test, language tokens represent detailed semantics (objects, attributes and relationships) are masked from the text and the model is required to infer them with the context from both text and image. To build the dataset, we sampled 15,000 image-text pairs from Flickr30K dataset and in total each of 5,000 object tokens, attributes and relationships are selected. And for the prediction, acc@1 and acc@5 are adopted as the evalation metric. The comparison of prediction results between baseline model, which is pre-trained without Masked Scene Graph prediction task, and proposed ERNIE-ViL is illustrated in Table 4. An absolute improvement acc@1 of 2.12% for objects, 1.31% for relationships and 6.00% for attributes demonstrates that ERNIE-ViL learned better alignments for detailed semantics across modalities.
Moreover, we also illustrate some cases in Table 5, and the top possible predictions are shown in the right columns. As in case 1-5, the baseline model cannot make the right predictions as it didn’t learn accurate alignments of detailed semantics without distinguishing common words and detailed semantics while pre-training. While in case 6, the baseline model can predict the reasonable tokens but with lower confidence. However, ERNIE-ViL may also predict incorrect tokens in case 7-8 due to the fact that the detailed semantics ("yellow", "brown" in case 7 and "dog", "animal" in case 8) in the visual space are quite similar.
We proposed ERNIE-ViL approach to learn the joint representations of vision and language. In addition to conventional MLM for cross-modal pre-training, we introduce Scene graph Prediction to characterize the detailed semantic alignments across vision and language. Experiment results on various downstream tasks demonstrate the improvements of incorporating knowledge from scene graph during cross-modal pre-training. For future work, scene graph extracted from images could also be incorporated into cross-modal pre-training. Moreover, Graph Neural Networks to integrate more structured knowledge could be considered as well.