Mr. Right: Multimodal Retrieval on Representation of ImaGe witH Text

09/28/2022
by   Cheng-An Hsieh, et al.
0

Multimodal learning is a recent challenge that extends unimodal learning by generalizing its domain to diverse modalities, such as texts, images, or speech. This extension requires models to process and relate information from multiple modalities. In Information Retrieval, traditional retrieval tasks focus on the similarity between unimodal documents and queries, while image-text retrieval hypothesizes that most texts contain the scene context from images. This separation has ignored that real-world queries may involve text content, image captions, or both. To address this, we introduce Multimodal Retrieval on Representation of ImaGe witH Text (Mr. Right), a novel and comprehensive dataset for multimodal retrieval. We utilize the Wikipedia dataset with rich text-image examples and generate three types of text-based queries with different modality information: text-related, image-related, and mixed. To validate the effectiveness of our dataset, we provide a multimodal training paradigm and evaluate previous text retrieval and image retrieval frameworks. The results show that proposed multimodal retrieval can improve retrieval performance, but creating a well-unified document representation with texts and images is still a challenge. We hope Mr. Right allows us to broaden current retrieval systems better and contributes to accelerating the advancement of multimodal learning in the Information Retrieval.

READ FULL TEXT

page 2

page 15

page 17

page 18

page 20

page 21

page 22

research
04/12/2022

Probabilistic Compositional Embeddings for Multimodal Image Retrieval

Existing works in image retrieval often consider retrieving images with ...
research
03/02/2021

WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning

The milestone improvements brought about by deep representation learning...
research
05/02/2023

Multimodal Neural Databases

The rise in loosely-structured data available through text, images, and ...
research
01/30/2018

The New Modality: Emoji Challenges in Prediction, Anticipation, and Retrieval

Over the past decade, emoji have emerged as a new and widespread form of...
research
06/23/2021

PatentNet: A Large-Scale Incomplete Multiview, Multimodal, Multilabel Industrial Goods Image Database

In deep learning area, large-scale image datasets bring a breakthrough i...
research
07/09/2021

Memes in the Wild: Assessing the Generalizability of the Hateful Memes Challenge Dataset

Hateful memes pose a unique challenge for current machine learning syste...
research
03/08/2022

Where Does the Performance Improvement Come From? – A Reproducibility Concern about Image-Text Retrieval

This paper seeks to provide the information retrieval community with som...

Please sign up or login with your details

Forgot password? Click here to reset