Mass-Producing Failures of Multimodal Systems with Language Models

06/21/2023
by   Shengbang Tong, et al.
0

Deployed multimodal systems can fail in ways that evaluators did not anticipate. In order to find these failures before deployment, we introduce MultiMon, a system that automatically identifies systematic failures – generalizable, natural-language descriptions of patterns of model failures. To uncover systematic failures, MultiMon scrapes a corpus for examples of erroneous agreement: inputs that produce the same output, but should not. It then prompts a language model (e.g., GPT-4) to find systematic patterns of failure and describe them in natural language. We use MultiMon to find 14 systematic failures (e.g., "ignores quantifiers") of the CLIP text-encoder, each comprising hundreds of distinct inputs (e.g., "a shelf with a few/many books"). Because CLIP is the backbone for most state-of-the-art multimodal systems, these inputs produce failures in Midjourney 5.1, DALL-E, VideoFusion, and others. MultiMon can also steer towards failures relevant to specific use cases, such as self-driving cars. We see MultiMon as a step towards evaluation that autonomously explores the long tail of potential system failures. Code for MULTIMON is available at https://github.com/tsb0601/MultiMon.

READ FULL TEXT

page 2

page 6

page 8

page 9

page 10

page 23

page 24

page 26

research
02/09/2023

Zeno: An Interactive Framework for Behavioral Evaluation of Machine Learning

Machine learning models with high accuracy on test data can still produc...
research
11/17/2022

Ignore Previous Prompt: Attack Techniques For Language Models

Transformer-based large language models (LLMs) provide a powerful founda...
research
09/02/2021

Multimodal Conditionality for Natural Language Generation

Large scale pretrained language models have demonstrated state-of-the-ar...
research
06/27/2023

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

In human conversations, individuals can indicate relevant regions within...
research
02/24/2022

Capturing Failures of Large Language Models via Human Cognitive Biases

Large language models generate complex, open-ended outputs: instead of o...
research
08/10/2023

Testing GPT-4 with Wolfram Alpha and Code Interpreter plug-ins on math and science problems

This report describes a test of the large language model GPT-4 with the ...
research
07/27/2023

Understanding Silent Failures in Medical Image Classification

To ensure the reliable use of classification systems in medical applicat...

Please sign up or login with your details

Forgot password? Click here to reset