Zero-Shot Audio Captioning via Audibility Guidance

09/07/2023
by   Tal Shaharabany, et al.
0

The task of audio captioning is similar in essence to tasks such as image and video captioning. However, it has received much less attention. We propose three desiderata for captioning audio – (i) fluency of the generated text, (ii) faithfulness of the generated text to the input audio, and the somewhat related (iii) audibility, which is the quality of being able to be perceived based only on audio. Our method is a zero-shot method, i.e., we do not learn to perform captioning. Instead, captioning occurs as an inference process that involves three networks that correspond to the three desired qualities: (i) A Large Language Model, in our case, for reasons of convenience, GPT-2, (ii) A model that provides a matching score between an audio file and a text, for which we use a multimodal matching network called ImageBind, and (iii) A text classifier, trained using a dataset we collected automatically by instructing GPT-4 with prompts designed to direct the generation of both audible and inaudible sentences. We present our results on the AudioCap dataset, demonstrating that audibility guidance significantly enhances performance compared to the baseline, which lacks this objective.

READ FULL TEXT

page 1

page 4

research
07/22/2022

Zero-Shot Video Captioning with Evolving Pseudo-Tokens

We introduce a zero-shot video captioning method that employs two frozen...
research
12/14/2020

Audio Captioning using Pre-Trained Large-Scale Language Model Guided by Audio-based Similar Caption Retrieval

The goal of audio captioning is to translate input audio into its descri...
research
03/30/2023

Prefix tuning for automated audio captioning

Audio captioning aims to generate text descriptions from environmental s...
research
09/06/2023

Parameter Efficient Audio Captioning With Faithful Guidance Using Audio-text Shared Latent Representation

There has been significant research on developing pretrained transformer...
research
03/11/2023

ZeroNLG: Aligning and Autoencoding Domains for Zero-Shot Multimodal and Multilingual Natural Language Generation

Natural Language Generation (NLG) accepts input data in the form of imag...
research
04/06/2023

Efficient Audio Captioning Transformer with Patchout and Text Guidance

Automated audio captioning is multi-modal translation task that aim to g...
research
07/05/2023

Zero-Shot Dense Video Captioning by Jointly Optimizing Text and Moment

Dense video captioning, a task of localizing meaningful moments and gene...

Please sign up or login with your details

Forgot password? Click here to reset