-
Multi-Modal Reasoning Graph for Scene-Text Based Fine-Grained Image Classification and Retrieval
Scene text instances found in natural images carry explicit semantic inf...
read it
-
VSE++: Improving Visual-Semantic Embeddings with Hard Negatives
We present a new technique for learning visual-semantic embeddings for c...
read it
-
Using Text to Teach Image Retrieval
Image retrieval relies heavily on the quality of the data modeling and t...
read it
-
How to Read Paintings: Semantic Art Understanding with Multi-Modal Retrieval
Automatic art analysis has been mostly focused on classifying artworks i...
read it
-
Image-to-Image Retrieval by Learning Similarity between Scene Graphs
As a scene graph compactly summarizes the high-level content of an image...
read it
-
Understanding Image Retrieval Re-Ranking: A Graph Neural Network Perspective
The re-ranking approach leverages high-confidence retrieved samples to r...
read it
-
Learning Visually-Grounded Semantics from Contrastive Adversarial Samples
We study the problem of grounding distributional representations of text...
read it
Multi-Modal Retrieval using Graph Neural Networks
Most real world applications of image retrieval such as Adobe Stock, which is a marketplace for stock photography and illustrations, need a way for users to find images which are both visually (i.e. aesthetically) and conceptually (i.e. containing the same salient objects) as a query image. Learning visual-semantic representations from images is a well studied problem for image retrieval. Filtering based on image concepts or attributes is traditionally achieved with index-based filtering (e.g. on textual tags) or by re-ranking after an initial visual embedding based retrieval. In this paper, we learn a joint vision and concept embedding in the same high-dimensional space. This joint model gives the user fine-grained control over the semantics of the result set, allowing them to explore the catalog of images more rapidly. We model the visual and concept relationships as a graph structure, which captures the rich information through node neighborhood. This graph structure helps us learn multi-modal node embeddings using Graph Neural Networks. We also introduce a novel inference time control, based on selective neighborhood connectivity allowing the user control over the retrieval algorithm. We evaluate these multi-modal embeddings quantitatively on the downstream relevance task of image retrieval on MS-COCO dataset and qualitatively on MS-COCO and an Adobe Stock dataset.
READ FULL TEXT
Comments
There are no comments yet.