Self-Supervised Image-to-Text and Text-to-Image Synthesis

12/09/2021
by   Anindya Sundar Das, et al.
0

A comprehensive understanding of vision and language and their interrelation are crucial to realize the underlying similarities and differences between these modalities and to learn more generalized, meaningful representations. In recent years, most of the works related to Text-to-Image synthesis and Image-to-Text generation, focused on supervised generative deep architectures to solve the problems, where very little interest was placed on learning the similarities between the embedding spaces across modalities. In this paper, we propose a novel self-supervised deep learning based approach towards learning the cross-modal embedding spaces; for both image to text and text to image generations. In our approach, we first obtain dense vector representations of images using StackGAN-based autoencoder model and also dense vector representations on sentence-level utilizing LSTM based text-autoencoder; then we study the mapping from embedding space of one modality to embedding space of the other modality utilizing GAN and maximum mean discrepancy based generative networks. We, also demonstrate that our model learns to generate textual description from image data as well as images from textual data both qualitatively and quantitatively.

READ FULL TEXT

page 9

page 10

research
03/25/2021

Vectorization and Rasterization: Self-Supervised Learning for Sketch and Handwriting

Self-supervised learning has gained prominence due to its efficacy at le...
research
06/10/2021

Cross-Modal Discrete Representation Learning

Recent advances in representation learning have demonstrated an ability ...
research
09/04/2021

LAViTeR: Learning Aligned Visual and Textual Representations Assisted by Image and Caption Generation

Pre-training visual and textual representations from large-scale image-t...
research
03/29/2016

Learning a Predictable and Generative Vector Representation for Objects

What is a good vector representation of an object? We believe that it sh...
research
06/27/2023

Semi-supervised Multimodal Representation Learning through a Global Workspace

Recent deep learning models can efficiently combine inputs from differen...
research
10/15/2019

Target-Oriented Deformation of Visual-Semantic Embedding Space

Multimodal embedding is a crucial research topic for cross-modal underst...
research
11/28/2022

CLIP2GAN: Towards Bridging Text with the Latent Space of GANs

In this work, we are dedicated to text-guided image generation and propo...

Please sign up or login with your details

Forgot password? Click here to reset