Audio-free Prompt Tuning for Language-Audio Models

09/15/2023
by   Yiming Li, et al.
0

Contrastive Language-Audio Pretraining (CLAP) is pre-trained to associate audio features with human language, making it a natural zero-shot classifier to recognize unseen sound categories. To adapt CLAP to downstream tasks, prior works inevitably require labeled domain audios, which limits their scalability under data scarcity and deprives them of the capability to detect novel classes as the original CLAP. In this work, by leveraging the modality alignment in CLAP, we propose an efficient audio-free prompt tuning scheme aimed at optimizing a few prompt tokens from texts instead of audios, which regularizes the model space to avoid overfitting the seen classes as well. Based on this, a multi-grained prompt design is further explored to fuse global and local information. Experiments on several tasks demonstrate that our approach can boost the CLAP and outperform other training methods on model performance and training efficiency. While conducting zero-shot inference on unseen categories, it still shows better transferability than the vanilla CLAP. Moreover, our method is flexible enough even if only knowing the downstream class names. The code will be released soon.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/09/2022

CLAP: Learning Audio Concepts From Natural Language Supervision

Mainstream Audio Analytics models are trained to learn under the paradig...
research
09/11/2023

Natural Language Supervision for General-Purpose Audio Representations

Audio-Language models jointly learn multimodal text and audio representa...
research
09/15/2022

Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language Models

Pre-trained vision-language models (e.g., CLIP) have shown promising zer...
research
09/14/2023

PRE: Vision-Language Prompt Learning with Reparameterization Encoder

Large pre-trained vision-language models such as CLIP have demonstrated ...
research
02/02/2023

CLIPood: Generalizing CLIP to Out-of-Distributions

Out-of-distribution (OOD) generalization, where the model needs to handl...
research
10/11/2022

Contrastive Training Improves Zero-Shot Classification of Semi-structured Documents

We investigate semi-structured document classification in a zero-shot se...
research
03/21/2023

Efficient Feature Distillation for Zero-shot Detection

The large-scale vision-language models (e.g., CLIP) are leveraged by dif...

Please sign up or login with your details

Forgot password? Click here to reset