AudioCLIP: Extending CLIP to Image, Text and Audio

06/24/2021
by   Andrey Guzhov, et al.
0

In the past, the rapidly evolving field of sound classification greatly benefited from the application of methods from other domains. Today, we observe the trend to fuse domain-specific tasks and approaches together, which provides the community with new outstanding models. In this work, we present an extension of the CLIP model that handles audio in addition to text and images. Our proposed model incorporates the ESResNeXt audio-model into the CLIP framework using the AudioSet dataset. Such a combination enables the proposed model to perform bimodal and unimodal classification and querying, while keeping CLIP's ability to generalize to unseen datasets in a zero-shot inference fashion. AudioCLIP achieves new state-of-the-art results in the Environmental Sound Classification (ESC) task, out-performing other approaches by reaching accuracies of 90.07 Further it sets new baselines in the zero-shot ESC-task on the same datasets 68.78 Finally, we also assess the cross-modal querying performance of the proposed model as well as the influence of full and partial training on the results. For the sake of reproducibility, our code is published.

READ FULL TEXT
research
05/03/2023

Unsupervised Improvement of Audio-Text Cross-Modal Representations

Recent advances in using language models to obtain cross-modal audio-tex...
research
06/21/2023

A Multimodal Prototypical Approach for Unsupervised Sound Classification

In the context of environmental sound classification, the adaptability o...
research
04/15/2020

ESResNet: Environmental Sound Classification Based on Visual Domain Models

Environmental Sound Classification (ESC) is an active research area in t...
research
04/23/2021

ESResNe(X)t-fbsp: Learning Robust Time-Frequency Transformation of Audio

Environmental Sound Classification (ESC) is a rapidly evolving field tha...
research
09/17/2023

Zero- and Few-shot Sound Event Localization and Detection

Sound event localization and detection (SELD) systems estimate direction...
research
04/05/2022

MetaAudio: A Few-Shot Audio Classification Benchmark

Currently available benchmarks for few-shot learning (machine learning w...
research
10/22/2020

Urban Sound Classification : striving towards a fair comparison

Urban sound classification has been achieving remarkable progress and is...

Please sign up or login with your details

Forgot password? Click here to reset