AnoVL: Adapting Vision-Language Models for Unified Zero-shot Anomaly Localization

08/30/2023
by   Hanqiu Deng, et al.
0

Contrastive Language-Image Pre-training (CLIP) models have shown promising performance on zero-shot visual recognition tasks by learning visual representations under natural language supervision. Recent studies attempt the use of CLIP to tackle zero-shot anomaly detection by matching images with normal and abnormal state prompts. However, since CLIP focuses on building correspondence between paired text prompts and global image-level representations, the lack of patch-level vision to text alignment limits its capability on precise visual anomaly localization. In this work, we introduce a training-free adaptation (TFA) framework of CLIP for zero-shot anomaly localization. In the visual encoder, we innovate a training-free value-wise attention mechanism to extract intrinsic local tokens of CLIP for patch-level local description. From the perspective of text supervision, we particularly design a unified domain-aware contrastive state prompting template. On top of the proposed TFA, we further introduce a test-time adaptation (TTA) mechanism to refine anomaly localization results, where a layer of trainable parameters in the adapter is optimized using TFA's pseudo-labels and synthetic noise-corrupted tokens. With both TFA and TTA adaptation, we significantly exploit the potential of CLIP for zero-shot anomaly localization and demonstrate the effectiveness of our proposed methods on various datasets.

READ FULL TEXT

page 1

page 3

page 4

page 5

page 7

research
03/26/2023

WinCLIP: Zero-/Few-Shot Anomaly Classification and Segmentation

Visual anomaly classification and segmentation are vital for automating ...
research
07/03/2023

UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding

Vision-language tasks, such as VQA, SNLI-VE, and VCR are challenging bec...
research
04/13/2023

High-Fidelity Zero-Shot Texture Anomaly Localization Using Feature Correspondence Analysis

We propose a novel method for Zero-Shot Anomaly Localization that levera...
research
03/20/2023

CLIP goes 3D: Leveraging Prompt Tuning for Language Grounded 3D Recognition

Vision-Language models like CLIP have been widely adopted for various ta...
research
06/02/2022

Prefix Conditioning Unifies Language and Label Supervision

Vision-language contrastive learning suggests a new learning paradigm by...
research
10/11/2022

CLIP also Understands Text: Prompting CLIP for Phrase Understanding

Contrastive Language-Image Pretraining (CLIP) efficiently learns visual ...
research
06/02/2023

Unifying (Machine) Vision via Counterfactual World Modeling

Leading approaches in machine vision employ different architectures for ...

Please sign up or login with your details

Forgot password? Click here to reset