Dual-Path Convolutional Image-Text Embedding with Instance Loss

11/15/2017
by   Zhedong Zheng, et al.
0

Matching images and sentences demands a fine understanding of both modalities. In this paper, we propose a new system to discriminatively embed the image and text to a shared visual-textual space. In this field, most existing works apply the ranking loss to pull the positive image / text pairs close and push the negative pairs apart from each other. However, directly deploying the ranking loss is hard for network learning, since it starts from the two heterogeneous features to build inter-modal relationship. To address this problem, we propose the instance loss which explicitly considers the intra-modal data distribution. It is based on an unsupervised assumption that each image / text group can be viewed as a class. So the network can learn the fine granularity from every image/text group. The experiment shows that the instance loss offers better weight initialization for the ranking loss, so that more discriminative embeddings can be learned. Besides, existing works usually apply the off-the-shelf features, i.e., word2vec and fixed visual feature. So in a minor contribution, this paper constructs an end-to-end dual-path convolutional network to learn the image and text representations. End-to-end learning allows the system to directly learn from the data and fully utilize the supervision. On two generic retrieval datasets (Flickr30k and MSCOCO), experiments demonstrate that our method yields competitive accuracy compared to state-of-the-art methods. Moreover, in language based person retrieval, we improve the state of the art by a large margin. The code has been made publicly available.

READ FULL TEXT

page 5

page 8

page 10

page 11

page 12

research
11/15/2017

Dual-Path Convolutional Image-Text Embedding

This paper considers the task of matching images and sentences. The chal...
research
02/23/2020

Deep Multimodal Image-Text Embeddings for Automatic Cross-Media Retrieval

This paper considers the task of matching images and sentences by learni...
research
07/11/2022

Intra-Modal Constraint Loss For Image-Text Retrieval

Cross-modal retrieval has drawn much attention in both computer vision a...
research
07/16/2022

Learning Granularity-Unified Representations for Text-to-Image Person Re-identification

Text-to-image person re-identification (ReID) aims to search for pedestr...
research
08/27/2018

Deep Stochastic Attraction and Repulsion Embedding for Image Based Localization

This paper tackles the problem of large-scale image-based localization w...
research
08/28/2019

Adversarial Representation Learning for Text-to-Image Matching

For many computer vision applications such as image captioning, visual q...
research
04/02/2019

FKIMNet: A Finger Dorsal Image Matching Network Comparing Component (Major, Minor and Nail) Matching with Holistic (Finger Dorsal) Matching

Current finger knuckle image recognition systems, often require users to...

Please sign up or login with your details

Forgot password? Click here to reset