LAViTeR: Learning Aligned Visual and Textual Representations Assisted by Image and Caption Generation

09/04/2021
by   Mohammad Abuzar Shaikh, et al.
0

Pre-training visual and textual representations from large-scale image-text pairs is becoming a standard approach for many downstream vision-language tasks. The transformer-based models learn inter and intra-modal attention through a list of self-supervised learning tasks. This paper proposes LAViTeR, a novel architecture for visual and textual representation learning. The main module, Visual Textual Alignment (VTA) will be assisted by two auxiliary tasks, GAN-based image synthesis and Image Captioning. We also propose a new evaluation metric measuring the similarity between the learnt visual and textual embedding. The experimental results on two public datasets, CUB and MS-COCO, demonstrate superior visual and textual representation alignment in the joint feature embedding space

READ FULL TEXT

page 1

page 8

page 11

page 12

page 13

research
03/17/2022

UNIMO-2: End-to-End Unified Vision-Language Grounded Learning

Vision-Language Pre-training (VLP) has achieved impressive performance o...
research
01/11/2020

MHSAN: Multi-Head Self-Attention Network for Visual Semantic Embedding

Visual-semantic embedding enables various tasks such as image-text retri...
research
12/01/2019

Integrate Image Representation to Text Model on Sentence Level: a Semi-supervised Framework

Integrating visual features has been proved useful in language represent...
research
10/11/2022

Analyzing Text Representations under Tight Annotation Budgets: Measuring Structural Alignment

Annotating large collections of textual data can be time consuming and e...
research
12/14/2021

CLIP-Lite: Information Efficient Visual Representation Learning from Textual Annotations

We propose CLIP-Lite, an information efficient method for visual represe...
research
12/09/2021

Self-Supervised Image-to-Text and Text-to-Image Synthesis

A comprehensive understanding of vision and language and their interrela...
research
04/04/2020

Evaluating Multimodal Representations on Visual Semantic Textual Similarity

The combination of visual and textual representations has produced excel...

Please sign up or login with your details

Forgot password? Click here to reset