DeepAI AI Chat
Log In Sign Up

Caption Enriched Samples for Improving Hateful Memes Detection

by   Efrat Blaier, et al.

The recently introduced hateful meme challenge demonstrates the difficulty of determining whether a meme is hateful or not. Specifically, both unimodal language models and multimodal vision-language models cannot reach the human level of performance. Motivated by the need to model the contrast between the image content and the overlayed text, we suggest applying an off-the-shelf image captioning tool in order to capture the first. We demonstrate that the incorporation of such automatic captions during fine-tuning improves the results for various unimodal and multimodal models. Moreover, in the unimodal case, continuing the pre-training of language models on augmented and original caption pairs, is highly beneficial to the classification accuracy.


page 1

page 3

page 5

page 8

page 9


ILLUME: Rationalizing Vision-Language Models by Interacting with their Jabber

Bootstrapping from pre-trained language models has been proven to be an ...

Towards Multimodal Vision-Language Models Generating Non-Generic Text

Vision-language models can assess visual context in an image and generat...

FuseCap: Leveraging Large Language Models to Fuse Visual Data into Enriched Image Captions

Image captioning is a central task in computer vision which has experien...

Image Representations and New Domains in Neural Image Captioning

We examine the possibility that recent promising results in automatic ca...

Multimodal Side-Tuning for Document Classification

In this paper, we propose to exploit the side-tuning framework for multi...

Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models

Recently, growing interest has been aroused in extending the multimodal ...

Exploring Diverse In-Context Configurations for Image Captioning

After discovering that Language Models (LMs) can be good in-context few-...