Contrastive Latent Space Reconstruction Learning for Audio-Text Retrieval

09/16/2023
by   Kaiyi Luo, et al.
0

Cross-modal retrieval (CMR) has been extensively applied in various domains, such as multimedia search engines and recommendation systems. Most existing CMR methods focus on image-to-text retrieval, whereas audio-to-text retrieval, a less explored domain, has posed a great challenge due to the difficulty to uncover discriminative features from audio clips and texts. Existing studies are restricted in the following two ways: 1) Most researchers utilize contrastive learning to construct a common subspace where similarities among data can be measured. However, they considers only cross-modal transformation, neglecting the intra-modal separability. Besides, the temperature parameter is not adaptively adjusted along with semantic guidance, which degrades the performance. 2) These methods do not take latent representation reconstruction into account, which is essential for semantic alignment. This paper introduces a novel audio-text oriented CMR approach, termed Contrastive Latent Space Reconstruction Learning (CLSR). CLSR improves contrastive representation learning by taking intra-modal separability into account and adopting an adaptive temperature control strategy. Moreover, the latent representation reconstruction modules are embedded into the CMR framework, which improves modal interaction. Experiments in comparison with some state-of-the-art methods on two audio-text datasets have validated the superiority of CLSR.

READ FULL TEXT
research
09/18/2019

Deep Latent Space Learning for Cross-modal Mapping of Audio and Visual Signals

We propose a novel deep training algorithm for joint representation of a...
research
07/28/2023

Improving Audio-Text Retrieval via Hierarchical Cross-Modal Interaction and Auxiliary Captions

Most existing audio-text retrieval (ATR) methods focus on constructing c...
research
07/11/2022

LaT: Latent Translation with Cycle-Consistency for Video-Text Retrieval

Video-text retrieval is a class of cross-modal representation learning p...
research
05/02/2023

TMR: Text-to-Motion Retrieval Using Contrastive 3D Human Motion Synthesis

In this paper, we present TMR, a simple yet effective approach for text ...
research
11/21/2022

Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations

Most video-and-language representation learning approaches employ contra...
research
05/18/2023

CLAPSpeech: Learning Prosody from Text Context with Contrastive Language-Audio Pre-training

Improving text representation has attracted much attention to achieve ex...
research
09/05/2022

Design of the topology for contrastive visual-textual alignment

Pre-training weakly related image-text pairs in the contrastive style sh...

Please sign up or login with your details

Forgot password? Click here to reset