FashionLOGO: Prompting Multimodal Large Language Models for Fashion Logo Embeddings

08/17/2023
by   Yulin Su, et al.
0

Logo embedding plays a crucial role in various e-commerce applications by facilitating image retrieval or recognition, such as intellectual property protection and product search. However, current methods treat logo embedding as a purely visual problem, which may limit their performance in real-world scenarios. A notable issue is that the textual knowledge embedded in logo images has not been adequately explored. Therefore, we propose a novel approach that leverages textual knowledge as an auxiliary to improve the robustness of logo embedding. The emerging Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in both visual and textual understanding and could become valuable visual assistants in understanding logo images. Inspired by this observation, our proposed method, FashionLOGO, aims to utilize MLLMs to enhance fashion logo embedding. We explore how MLLMs can improve logo embedding by prompting them to generate explicit textual knowledge through three types of prompts, including image OCR, brief captions, and detailed descriptions prompts, in a zero-shot setting. We adopt a cross-attention transformer to enable image embedding queries to learn supplementary knowledge from textual embeddings automatically. To reduce computational costs, we only use the image embedding model in the inference stage, similar to traditional inference pipelines. Our extensive experiments on three real-world datasets demonstrate that FashionLOGO learns generalized and robust logo embeddings, achieving state-of-the-art performance in all benchmark datasets. Furthermore, we conduct comprehensive ablation studies to demonstrate the performance improvements resulting from the introduction of MLLMs.

READ FULL TEXT

page 2

page 6

page 7

research
09/19/2023

Language as the Medium: Multimodal Video Classification through text only

Despite an exciting new wave of multimodal machine learning models, curr...
research
09/02/2023

Zero-Shot Recommendations with Pre-Trained Large Language Models for Multimodal Nudging

We present a method for zero-shot recommendation of multimodal non-stati...
research
03/30/2016

Unsupervised Visual Sense Disambiguation for Verbs using Multimodal Embeddings

We introduce a new task, visual sense disambiguation for verbs: given an...
research
05/23/2023

EDIS: Entity-Driven Image Search over Multimodal Web Content

Making image retrieval methods practical for real-world search applicati...
research
06/15/2019

Joint Visual-Textual Embedding for Multimodal Style Search

We introduce a multimodal visual-textual search refinement method for fa...
research
05/10/2023

Combo of Thinking and Observing for Outside-Knowledge VQA

Outside-knowledge visual question answering is a challenging task that r...
research
05/24/2023

SETI: Systematicity Evaluation of Textual Inference

We propose SETI (Systematicity Evaluation of Textual Inference), a novel...

Please sign up or login with your details

Forgot password? Click here to reset