Diagnosing and Rectifying Vision Models using Language

02/08/2023
by   Yuhui Zhang, et al.
0

Recent multi-modal contrastive learning models have demonstrated the ability to learn an embedding space suitable for building strong vision classifiers, by leveraging the rich information in large-scale image-caption datasets. Our work highlights a distinct advantage of this multi-modal embedding space: the ability to diagnose vision classifiers through natural language. The traditional process of diagnosing model behaviors in deployment settings involves labor-intensive data acquisition and annotation. Our proposed method can discover high-error data slices, identify influential attributes and further rectify undesirable model behaviors, without requiring any visual data. Through a combination of theoretical explanation and empirical verification, we present conditions under which classifiers trained on embeddings from one modality can be equivalently applied to embeddings from another modality. On a range of image datasets with known error slices, we demonstrate that our method can effectively identify the error slices and influential attributes, and can further use language to rectify failure modes of the classifier.

READ FULL TEXT

page 15

page 25

page 26

page 27

research
08/22/2023

Ceci n'est pas une pomme: Adversarial Illusions in Multi-Modal Embeddings

Multi-modal encoders map images, sounds, texts, videos, etc. into a sing...
research
11/21/2022

Unifying Vision-Language Representation Space with Single-tower Transformer

Contrastive learning is a form of distance learning that aims to learn i...
research
06/08/2023

Multi-Modal Classifiers for Open-Vocabulary Object Detection

The goal of this paper is open-vocabulary object detection (OVOD) x2013 ...
research
07/01/2023

ProbVLM: Probabilistic Adapter for Frozen Vison-Language Models

Large-scale vision-language models (VLMs) like CLIP successfully find co...
research
09/01/2022

Universal Multi-Modality Retrieval with One Unified Embedding Space

This paper presents Vision-Language Universal Search (VL-UnivSearch), wh...
research
04/23/2019

Multi-modal 3D Shape Reconstruction Under Calibration Uncertainty using Parametric Level Set Methods

We consider the problem of 3D shape reconstruction from multi-modal data...
research
07/26/2023

Plug and Pray: Exploiting off-the-shelf components of Multi-Modal Models

The rapid growth and increasing popularity of incorporating additional m...

Please sign up or login with your details

Forgot password? Click here to reset