IPA-CLIP: Integrating Phonetic Priors into Vision and Language Pretraining

03/06/2023
by   Chihaya Matsuhira, et al.
0

Recently, large-scale Vision and Language (V&L) pretraining has become the standard backbone of many multimedia systems. While it has shown remarkable performance even in unseen situations, it often performs in ways not intuitive to humans. Particularly, they usually do not consider the pronunciation of the input, which humans would utilize to understand language, especially when it comes to unknown words. Thus, this paper inserts phonetic prior into Contrastive Language-Image Pretraining (CLIP), one of the V&L pretrained models, to make it consider the pronunciation similarity among its pronunciation inputs. To achieve this, we first propose a phoneme embedding that utilizes the phoneme relationships provided by the International Phonetic Alphabet (IPA) chart as a phonetic prior. Next, by distilling the frozen CLIP text encoder, we train a pronunciation encoder employing the IPA-based embedding. The proposed model named IPA-CLIP comprises this pronunciation encoder and the original CLIP encoders (image and text). Quantitative evaluation reveals that the phoneme distribution on the embedding space represents phonetic relationships more accurately when using the proposed phoneme embedding. Furthermore, in some multimodal retrieval tasks, we confirm that the proposed pronunciation encoder enhances the performance of the text encoder and that the pronunciation encoder handles nonsense words in a more phonetic manner than the text encoder. Finally, qualitative evaluation verifies the correlation between the pronunciation encoder and human perception regarding pronunciation similarity.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/01/2022

CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP

Training a text-to-image generator in the general domain (e.g., Dall.e, ...
research
02/08/2022

CALM: Contrastive Aligned Audio-Language Multirate and Multimodal Representations

Deriving multimodal representations of audio and lexical inputs is a cen...
research
06/06/2023

On the Difference of BERT-style and CLIP-style Text Encoders

Masked language modeling (MLM) has been one of the most popular pretrain...
research
12/08/2022

Structured Vision-Language Pretraining for Computational Cooking

Vision-Language Pretraining (VLP) and Foundation models have been the go...
research
03/24/2023

VILA: Learning Image Aesthetics from User Comments with Vision-Language Pretraining

Assessing the aesthetics of an image is challenging, as it is influenced...
research
11/30/2020

Multimodal Pretraining Unmasked: Unifying the Vision and Language BERTs

Large-scale pretraining and task-specific fine-tuning is now the standar...
research
12/21/2022

Contrastive Language-Vision AI Models Pretrained on Web-Scraped Multimodal Data Exhibit Sexual Objectification Bias

Nine language-vision AI models trained on web scrapes with the Contrasti...

Please sign up or login with your details

Forgot password? Click here to reset