M-SpeechCLIP: Leveraging Large-Scale, Pre-Trained Models for Multilingual Speech to Image Retrieval

11/02/2022
by   Layne Berry, et al.
0

This work investigates the use of large-scale, pre-trained models (CLIP and HuBERT) for multilingual speech-image retrieval. For non-English speech-image retrieval, we outperform the current state-of-the-art performance by a wide margin when training separate models for each language, and show that a single model which processes speech in all three languages still achieves retrieval scores comparable with the prior state-of-the-art. We identify key differences in model behavior and performance between English and non-English settings, presumably attributable to the English-only pre-training of CLIP and HuBERT. Finally, we show that our models can be used for mono- and cross-lingual speech-text retrieval and cross-lingual speech-speech retrieval, despite never having seen any parallel speech-text or speech-speech data during training.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/02/2023

DistilXLSR: A Light Weight Cross-Lingual Speech Representation Model

Multilingual self-supervised speech representation models have greatly e...
research
01/10/2022

Why-So-Deep: Towards Boosting Previously Trained Models for Visual Place Recognition

Deep learning-based image retrieval techniques for the loop closure dete...
research
09/04/2023

NLLB-CLIP – train performant multilingual image retrieval model on a budget

Today, the exponential rise of large models developed by academic and in...
research
10/03/2022

SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model

Data-driven speech processing models usually perform well with a large a...
research
03/28/2022

Large-scale Bilingual Language-Image Contrastive Learning

This paper is a technical report to share our experience and findings bu...
research
05/04/2022

Same Neurons, Different Languages: Probing Morphosyntax in Multilingual Pre-trained Models

The success of multilingual pre-trained models is underpinned by their a...
research
04/09/2018

Vision as an Interlingua: Learning Multilingual Semantic Embeddings of Untranscribed Speech

In this paper, we explore the learning of neural network embeddings for ...

Please sign up or login with your details

Forgot password? Click here to reset