Hypernymization of named entity-rich captions for grounding-based multi-modal pretraining

04/25/2023
by   Giacomo Nebbia, et al.
0

Named entities are ubiquitous in text that naturally accompanies images, especially in domains such as news or Wikipedia articles. In previous work, named entities have been identified as a likely reason for low performance of image-text retrieval models pretrained on Wikipedia and evaluated on named entities-free benchmark datasets. Because they are rarely mentioned, named entities could be challenging to model. They also represent missed learning opportunities for self-supervised models: the link between named entity and object in the image may be missed by the model, but it would not be if the object were mentioned using a more common term. In this work, we investigate hypernymization as a way to deal with named entities for pretraining grounding-based multi-modal models and for fine-tuning on open-vocabulary detection. We propose two ways to perform hypernymization: (1) a “manual” pipeline relying on a comprehensive ontology of concepts, and (2) a “learned” approach where we train a language model to learn to perform hypernymization. We run experiments on data from Wikipedia and from The New York Times. We report improved pretraining performance on objects of interest following hypernymization, and we show the promise of hypernymization on open-vocabulary detection, specifically on classes not seen during training.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/22/2021

Namesakes: Ambiguously Named Entities from Wikipedia and News

We present Namesakes, a dataset of ambiguously named entities obtained f...
research
07/05/2023

Named Entity Inclusion in Abstractive Text Summarization

We address the named entity omission - the drawback of many current abst...
research
06/08/2023

Multi-Modal Classifiers for Open-Vocabulary Object Detection

The goal of this paper is open-vocabulary object detection (OVOD) x2013 ...
research
05/13/2022

ViT5: Pretrained Text-to-Text Transformer for Vietnamese Language Generation

We present ViT5, a pretrained Transformer-based encoder-decoder model fo...
research
02/22/2023

Open-domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia Entities

Large-scale multi-modal pre-training models such as CLIP and PaLI exhibi...
research
05/14/2016

Occurrence Statistics of Entities, Relations and Types on the Web

The problem of collecting reliable estimates of occurrence of entities o...
research
06/21/2022

Transformer-Based Multi-modal Proposal and Re-Rank for Wikipedia Image-Caption Matching

With the increased accessibility of web and online encyclopedias, the am...

Please sign up or login with your details

Forgot password? Click here to reset