PV2TEA: Patching Visual Modality to Textual-Established Information Extraction

06/01/2023
by   Hejie Cui, et al.
4

Information extraction, e.g., attribute value extraction, has been extensively studied and formulated based only on text. However, many attributes can benefit from image-based extraction, like color, shape, pattern, among others. The visual modality has long been underutilized, mainly due to multimodal annotation difficulty. In this paper, we aim to patch the visual modality to the textual-established attribute information extractor. The cross-modality integration faces several unique challenges: (C1) images and textual descriptions are loosely paired intra-sample and inter-samples; (C2) images usually contain rich backgrounds that can mislead the prediction; (C3) weakly supervised labels from textual-established extractors are biased for multimodal training. We present PV2TEA, an encoder-decoder architecture equipped with three bias reduction schemes: (S1) Augmented label-smoothed contrast to improve the cross-modality alignment for loosely-paired image and text; (S2) Attention-pruning that adaptively distinguishes the visual foreground; (S3) Two-level neighborhood regularization that mitigates the label textual bias via reliability estimation. Empirical results on real-world e-Commerce datasets demonstrate up to 11.74 increase over unimodal baselines.

READ FULL TEXT

page 4

page 8

page 13

research
03/07/2022

Multi-Modal Attribute Extraction for E-Commerce

To improve users' experience as they navigate the myriad of options offe...
research
12/21/2021

Extending CLIP for Category-to-image Retrieval in E-commerce

E-commerce provides rich multimodal data that is barely leveraged in pra...
research
02/20/2023

CISum: Learning Cross-modality Interaction to Enhance Multimodal Semantic Coverage for Multimodal Summarization

Multimodal summarization (MS) aims to generate a summary from multimodal...
research
05/17/2022

MATrIX – Modality-Aware Transformer for Information eXtraction

We present MATrIX - a Modality-Aware Transformer for Information eXtract...
research
11/29/2017

Multimodal Attribute Extraction

The broad goal of information extraction is to derive structured informa...
research
12/18/2020

On Modality Bias in the TVQA Dataset

TVQA is a large scale video question answering (video-QA) dataset based ...
research
06/15/2019

Joint Visual-Textual Embedding for Multimodal Style Search

We introduce a multimodal visual-textual search refinement method for fa...

Please sign up or login with your details

Forgot password? Click here to reset