M3PT: A Multi-Modal Model for POI Tagging

06/16/2023
by   Jingsong Yang, et al.
0

POI tagging aims to annotate a point of interest (POI) with some informative tags, which facilitates many services related to POIs, including search, recommendation, and so on. Most of the existing solutions neglect the significance of POI images and seldom fuse the textual and visual features of POIs, resulting in suboptimal tagging performance. In this paper, we propose a novel Multi-Modal Model for POI Tagging, namely M3PT, which achieves enhanced POI tagging through fusing the target POI's textual and visual features, and the precise matching between the multi-modal representations. Specifically, we first devise a domain-adaptive image encoder (DIE) to obtain the image embeddings aligned to their gold tags' semantics. Then, in M3PT's text-image fusion module (TIF), the textual and visual representations are fully fused into the POIs' content embeddings for the subsequent matching. In addition, we adopt a contrastive learning strategy to further bridge the gap between the representations of different modalities. To evaluate the tagging models' performance, we have constructed two high-quality POI tagging datasets from the real-world business scenario of Ali Fliggy. Upon the datasets, we conducted the extensive experiments to demonstrate our model's advantage over the baselines of uni-modality and multi-modality, and verify the effectiveness of important components in M3PT, including DIE, TIF and the contrastive learning strategy.

READ FULL TEXT
research
01/30/2023

Tagging before Alignment: Integrating Multi-Modal Tags for Video-Text Retrieval

Vision-language alignment learning for video-text retrieval arouses a lo...
research
09/30/2021

Multi-Modal Sarcasm Detection Based on Contrastive Attention Mechanism

In the past decade, sarcasm detection has been intensively conducted in ...
research
01/16/2023

Learning Aligned Cross-modal Representations for Referring Image Segmentation

Referring image segmentation aims to segment the image region of interes...
research
12/13/2021

ITA: Image-Text Alignments for Multi-Modal Named Entity Recognition

Recently, Multi-modal Named Entity Recognition (MNER) has attracted a lo...
research
04/20/2019

Saliency-Guided Attention Network for Image-Sentence Matching

This paper studies the task of matching image and sentence, where learni...
research
11/30/2021

Multi-modal Text Recognition Networks: Interactive Enhancements between Visual and Semantic Features

Linguistic knowledge has brought great benefits to scene text recognitio...
research
08/02/2021

Multimodal Feature Fusion for Video Advertisements Tagging Via Stacking Ensemble

Automated tagging of video advertisements has been a critical yet challe...

Please sign up or login with your details

Forgot password? Click here to reset