Contrastive Visual Semantic Pretraining Magnifies the Semantics of Natural Language Representations

03/14/2022
by   Robert Wolfe, et al.
0

We examine the effects of contrastive visual semantic pretraining by comparing the geometry and semantic properties of contextualized English language representations formed by GPT-2 and CLIP, a zero-shot multimodal image classifier which adapts the GPT-2 architecture to encode image captions. We find that contrastive visual semantic pretraining significantly mitigates the anisotropy found in contextualized word embeddings from GPT-2, such that the intra-layer self-similarity (mean pairwise cosine similarity) of CLIP word embeddings is under .25 in all layers, compared to greater than .95 in the top layer of GPT-2. CLIP word embeddings outperform GPT-2 on word-level semantic intrinsic evaluation tasks, and achieve a new corpus-based state of the art for the RG65 evaluation, at .88. CLIP also forms fine-grained semantic representations of sentences, and obtains Spearman's rho = .73 on the SemEval-2017 Semantic Textual Similarity Benchmark with no fine-tuning, compared to no greater than rho = .45 in any layer of GPT-2. Finally, intra-layer self-similarity of CLIP sentence embeddings decreases as the layer index increases, finishing at .25 in the top layer, while the self-similarity of GPT-2 sentence embeddings formed using the EOS token increases layer-over-layer and never falls below .97. Our results indicate that high anisotropy is not an inevitable consequence of contextualization, and that visual semantic pretraining is beneficial not only for ordering visual representations, but also for encoding useful semantic representations of language, both on the word level and the sentence level.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/13/2018

Enhanced Word Representations for Bridging Anaphora Resolution

Most current models of word representations(e.g.,GloVe) have successfull...
research
03/27/2019

Learning semantic sentence representations from visually grounded language without lexical knowledge

Current approaches to learning semantic representations of sentences oft...
research
10/06/2020

Using Sentences as Semantic Representations in Large Scale Zero-Shot Learning

Zero-shot learning aims to recognize instances of unseen classes, for wh...
research
06/16/2023

CMLM-CSE: Based on Conditional MLM Contrastive Learning for Sentence Embeddings

Traditional comparative learning sentence embedding directly uses the en...
research
04/11/2019

Unified Visual-Semantic Embeddings: Bridging Vision and Language with Structured Meaning Representations

We propose the Unified Visual-Semantic Embeddings (Unified VSE) for lear...
research
03/14/2022

VAST: The Valence-Assessing Semantics Test for Contextualizing Language Models

VAST, the Valence-Assessing Semantics Test, is a novel intrinsic evaluat...
research
05/23/2022

Markedness in Visual Semantic AI

We evaluate the state-of-the-art multimodal "visual semantic" model CLIP...

Please sign up or login with your details

Forgot password? Click here to reset