DeepAI AI Chat
Log In Sign Up

Caption Enriched Samples for Improving Hateful Memes Detection

09/22/2021
by   Efrat Blaier, et al.
0

The recently introduced hateful meme challenge demonstrates the difficulty of determining whether a meme is hateful or not. Specifically, both unimodal language models and multimodal vision-language models cannot reach the human level of performance. Motivated by the need to model the contrast between the image content and the overlayed text, we suggest applying an off-the-shelf image captioning tool in order to capture the first. We demonstrate that the incorporation of such automatic captions during fine-tuning improves the results for various unimodal and multimodal models. Moreover, in the unimodal case, continuing the pre-training of language models on augmented and original caption pairs, is highly beneficial to the classification accuracy.

READ FULL TEXT

page 1

page 3

page 5

page 8

page 9

08/17/2022

ILLUME: Rationalizing Vision-Language Models by Interacting with their Jabber

Bootstrapping from pre-trained language models has been proven to be an ...
07/09/2022

Towards Multimodal Vision-Language Models Generating Non-Generic Text

Vision-language models can assess visual context in an image and generat...
05/28/2023

FuseCap: Leveraging Large Language Models to Fuse Visual Data into Enriched Image Captions

Image captioning is a central task in computer vision which has experien...
08/09/2015

Image Representations and New Domains in Neural Image Captioning

We examine the possibility that recent promising results in automatic ca...
01/16/2023

Multimodal Side-Tuning for Document Classification

In this paper, we propose to exploit the side-tuning framework for multi...
05/24/2023

Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models

Recently, growing interest has been aroused in extending the multimodal ...
05/24/2023

Exploring Diverse In-Context Configurations for Image Captioning

After discovering that Language Models (LMs) can be good in-context few-...