Transformer-Based Multi-modal Proposal and Re-Rank for Wikipedia Image-Caption Matching

06/21/2022
by   Nicola Messina, et al.
1

With the increased accessibility of web and online encyclopedias, the amount of data to manage is constantly increasing. In Wikipedia, for example, there are millions of pages written in multiple languages. These pages contain images that often lack the textual context, remaining conceptually floating and therefore harder to find and manage. In this work, we present the system we designed for participating in the Wikipedia Image-Caption Matching challenge on Kaggle, whose objective is to use data associated with images (URLs and visual data) to find the correct caption among a large pool of available ones. A system able to perform this task would improve the accessibility and completeness of multimedia content on large online encyclopedias. Specifically, we propose a cascade of two models, both powered by the recent Transformer model, able to efficiently and effectively infer a relevance score between the query image data and the captions. We verify through extensive experimentation that the proposed two-model approach is an effective way to handle a large pool of images and captions while maintaining bounded the overall computational complexity at inference time. Our approach achieves remarkable results, obtaining a normalized Discounted Cumulative Gain (nDCG) value of 0.53 on the private leaderboard of the Kaggle challenge.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/19/2021

A Picture is Worth a Thousand Words: A Unified System for Diverse Captions and Rich Images Generation

A creative image-and-text generative AI system mimics humans' extraordin...
research
12/12/2012

Product/Brand extraction from WikiPedia

In this paper we describe the task of extracting product and brand pages...
research
04/16/2021

Concadia: Tackling image accessibility with context

Images have become an integral part of online media. This has enhanced s...
research
04/03/2023

Grand Challenge On Detecting Cheapfakes

Cheapfake is a recently coined term that encompasses non-AI ("cheap") ma...
research
04/25/2023

Hypernymization of named entity-rich captions for grounding-based multi-modal pretraining

Named entities are ubiquitous in text that naturally accompanies images,...
research
09/21/2022

Show, Interpret and Tell: Entity-aware Contextualised Image Captioning in Wikipedia

Humans exploit prior knowledge to describe images, and are able to adapt...

Please sign up or login with your details

Forgot password? Click here to reset