Mixer: Image to Multi-Modal Retrieval Learning for Industrial Application

05/06/2023
by   Zida Cheng, et al.
0

Cross-modal retrieval, where the query is an image and the doc is an item with both image and text description, is ubiquitous in e-commerce platforms and content-sharing social media. However, little research attention has been paid to this important application. This type of retrieval task is challenging due to the facts: 1) domain gap exists between query and doc. 2) multi-modality alignment and fusion. 3) skewed training data and noisy labels collected from user behaviors. 4) huge number of queries and timely responses while the large-scale candidate docs exist. To this end, we propose a novel scalable and efficient image query to multi-modal retrieval learning paradigm called Mixer, which adaptively integrates multi-modality data, mines skewed and noisy data more efficiently and scalable to high traffic. The Mixer consists of three key ingredients: First, for query and doc image, a shared encoder network followed by separate transformation networks are utilized to account for their domain gap. Second, in the multi-modal doc, images and text are not equally informative. So we design a concept-aware modality fusion module, which extracts high-level concepts from the text by a text-to-image attention mechanism. Lastly, but most importantly, we turn to a new data organization and training paradigm for single-modal to multi-modal retrieval: large-scale classification learning which treats single-modal query and multi-modal doc as equivalent samples of certain classes. Besides, the data organization follows a weakly-supervised manner, which can deal with skewed data and noisy labels inherited in the industrial systems. Learning such a large number of categories for real-world multi-modality data is non-trivial and we design a specific learning strategy for it. The proposed Mixer achieves SOTA performance on public datasets from industrial retrieval systems.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/10/2021

SwAMP: Swapped Assignment of Multi-Modal Pairs for Cross-Modal Retrieval

We tackle the cross-modal retrieval problem, where the training is only ...
research
04/03/2017

AMC: Attention guided Multi-modal Correlation Learning for Image Search

Given a user's query, traditional image search systems rank images accor...
research
08/20/2021

Knowledge Perceived Multi-modal Pretraining in E-commerce

In this paper, we address multi-modal pretraining of product data in the...
research
07/30/2021

Product1M: Towards Weakly Supervised Instance-Level Product Retrieval via Cross-modal Pretraining

Nowadays, customer's demands for E-commerce are more diversified, which ...
research
08/06/2023

E-CLIP: Towards Label-efficient Event-based Open-world Understanding by CLIP

Contrasting Language-image pertaining (CLIP) has recently shown promisin...
research
09/19/2022

Tree-based Text-Vision BERT for Video Search in Baidu Video Advertising

The advancement of the communication technology and the popularity of th...
research
03/19/2017

Weakly-supervised DCNN for RGB-D Object Recognition in Real-World Applications Which Lack Large-scale Annotated Training Data

This paper addresses the problem of RGBD object recognition in real-worl...

Please sign up or login with your details

Forgot password? Click here to reset