Audio Difference Captioning Utilizing Similarity-Discrepancy Disentanglement

08/23/2023
by   Daiki Takeuchi, et al.
0

We proposed Audio Difference Captioning (ADC) as a new extension task of audio captioning for describing the semantic differences between input pairs of similar but slightly different audio clips. The ADC solves the problem that conventional audio captioning sometimes generates similar captions for similar audio clips, failing to describe the difference in content. We also propose a cross-attention-concentrated transformer encoder to extract differences by comparing a pair of audio clips and a similarity-discrepancy disentanglement to emphasize the difference in the latent space. To evaluate the proposed methods, we built an AudioDiffCaps dataset consisting of pairs of similar but slightly different audio clips with human-annotated descriptions of their differences. The experiment with the AudioDiffCaps dataset showed that the proposed methods solve the ADC task effectively and improve the attention weights to extract the difference by visualizing them in the transformer encoder.

READ FULL TEXT

page 1

page 3

page 4

research
09/15/2023

Audio Difference Learning for Audio Captioning

This study introduces a novel training paradigm, audio difference learni...
research
07/20/2022

Introducing Auxiliary Text Query-modifier to Content-based Audio Retrieval

The amount of audio data available on public websites is growing rapidly...
research
07/09/2020

Multi-task Regularization Based on Infrequent Classes for Audio Captioning

Audio captioning is a multi-modal task, focusing on using natural langua...
research
09/18/2023

RECAP: Retrieval-Augmented Audio Captioning

We present RECAP (REtrieval-Augmented Audio CAPtioning), a novel and eff...
research
09/06/2023

Parameter Efficient Audio Captioning With Faithful Guidance Using Audio-text Shared Latent Representation

There has been significant research on developing pretrained transformer...
research
05/31/2019

What does a Car-ssette tape tell?

Captioning has attracted much attention in image and video understanding...
research
05/14/2018

The Spot the Difference corpus: a multi-modal corpus of spontaneous task oriented spoken interactions

This paper describes the Spot the Difference Corpus which contains 54 in...

Please sign up or login with your details

Forgot password? Click here to reset