Robustness of Fusion-based Multimodal Classifiers to Cross-Modal Content Dilutions

11/04/2022
by   Gaurav Verma, et al.
0

As multimodal learning finds applications in a wide variety of high-stakes societal tasks, investigating their robustness becomes important. Existing work has focused on understanding the robustness of vision-and-language models to imperceptible variations on benchmark tasks. In this work, we investigate the robustness of multimodal classifiers to cross-modal dilutions - a plausible variation. We develop a model that, given a multimodal (image + text) input, generates additional dilution text that (a) maintains relevance and topical coherence with the image and existing text, and (b) when added to the original text, leads to misclassification of the multimodal input. Via experiments on Crisis Humanitarianism and Sentiment Detection tasks, we find that the performance of task-specific fusion-based multimodal classifiers drops by 23.3 and 22.5 Metric-based comparisons with several baselines and human evaluations indicate that our dilutions show higher relevance and topical coherence, while simultaneously being more effective at demonstrating the brittleness of the multimodal classifiers. Our work aims to highlight and encourage further research on the robustness of deep multimodal models to realistic variations, especially in human-facing societal applications. The code and other resources are available at https://claws-lab.github.io/multimodal-robustness/.

READ FULL TEXT

page 3

page 8

research
06/19/2023

Cross-Modal Attribute Insertions for Assessing the Robustness of Vision-and-Language Learning

The robustness of multimodal deep learning models to realistic changes i...
research
07/21/2021

DRDF: Determining the Importance of Different Multimodal Information with Dual-Router Dynamic Framework

In multimodal tasks, we find that the importance of text and image modal...
research
10/12/2022

Hate-CLIPper: Multimodal Hateful Meme Classification based on Cross-modal Interaction of CLIP Features

Hateful memes are a growing menace on social media. While the image and ...
research
04/19/2019

EmbraceNet: A robust deep learning architecture for multimodal classification

Classification using multimodal data arises in many machine learning app...
research
05/09/2023

A Review of Vision-Language Models and their Performance on the Hateful Memes Challenge

Moderation of social media content is currently a highly manual task, ye...
research
06/03/2023

Provable Dynamic Fusion for Low-Quality Multimodal Data

The inherent challenge of multimodal fusion is to precisely capture the ...
research
10/16/2020

New Ideas and Trends in Deep Multimodal Content Understanding: A Review

The focus of this survey is on the analysis of two modalities of multimo...

Please sign up or login with your details

Forgot password? Click here to reset