T-VSE: Transformer-Based Visual Semantic Embedding

05/17/2020
by   Muhammet Bastan, et al.
0

Transformer models have recently achieved impressive performance on NLP tasks, owing to new algorithms for self-supervised pre-training on very large text corpora. In contrast, recent literature suggests that simple average word models outperform more complicated language models, e.g., RNNs and Transformers, on cross-modal image/text search tasks on standard benchmarks, like MS COCO. In this paper, we show that dataset scale and training strategy are critical and demonstrate that transformer-based cross-modal embeddings outperform word average and RNN-based embeddings by a large margin, when trained on a large dataset of e-commerce product image-title pairs.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/10/2022

Transformer-based Cross-Modal Recipe Embeddings with Large Batch Training

In this paper, we present a cross-modal recipe retrieval framework, Tran...
research
10/20/2021

VLDeformer: Vision-Language Decomposed Transformer for Fast Cross-Modal Retrieval

Cross-model retrieval has emerged as one of the most important upgrades ...
research
07/26/2021

Don't Sweep your Learning Rate under the Rug: A Closer Look at Cross-modal Transfer of Pretrained Transformers

Self-supervised pre-training of large-scale transformer models on text c...
research
01/11/2023

EXIF as Language: Learning Cross-Modal Associations Between Images and Camera Metadata

We learn a visual representation that captures information about the cam...
research
04/28/2022

CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers

The development of the transformer-based text-to-image models are impede...
research
09/17/2020

Cross-Modal Alignment with Mixture Experts Neural Network for Intral-City Retail Recommendation

In this paper, we introduce Cross-modal Alignment with mixture experts N...
research
05/12/2022

A Computational Acquisition Model for Multimodal Word Categorization

Recent advances in self-supervised modeling of text and images open new ...

Please sign up or login with your details

Forgot password? Click here to reset