Cross-Modal Alignment with Mixture Experts Neural Network for Intral-City Retail Recommendation

09/17/2020
by   Po Li, et al.
14

In this paper, we introduce Cross-modal Alignment with mixture experts Neural Network (CameNN) recommendation model for intral-city retail industry, which aims to provide fresh foods and groceries retailing within 5 hours delivery service arising for the outbreak of Coronavirus disease (COVID-19) pandemic around the world. We propose CameNN, which is a multi-task model with three tasks including Image to Text Alignment (ITA) task, Text to Image Alignment (TIA) task and CVR prediction task. We use pre-trained BERT to generate the text embedding and pre-trained InceptionV4 to generate image patch embedding (each image is split into small patches with the same pixels and treat each patch as an image token). Softmax gating networks follow to learn the weight of each transformer expert output and choose only a subset of experts conditioned on the input. Then transformer encoder is applied as the share-bottom layer to learn all input features' shared interaction. Next, mixture of transformer experts (MoE) layer is implemented to model different aspects of tasks. At top of the MoE layer, we deploy a transformer layer for each task as task tower to learn task-specific information. On the real word intra-city dataset, experiments demonstrate CameNN outperform baselines and achieve significant improvements on the image and text representation. In practice, we applied CameNN on CVR prediction in our intra-city recommender system which is one of the leading intra-city platforms operated in China.

READ FULL TEXT
08/16/2019

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training

We propose Unicoder-VL, a universal encoder that aims to learn joint rep...
03/03/2020

XGPT: Cross-modal Generative Pre-Training for Image Captioning

While many BERT-based cross-modal pre-trained models produce excellent r...
07/21/2021

DRDF: Determining the Importance of Different Multimodal Information with Dual-Router Dynamic Framework

In multimodal tasks, we find that the importance of text and image modal...
05/17/2020

T-VSE: Transformer-Based Visual Semantic Embedding

Transformer models have recently achieved impressive performance on NLP ...
12/14/2021

ACE-BERT: Adversarial Cross-modal Enhanced BERT for E-commerce Retrieval

Nowadays on E-commerce platforms, products are presented to the customer...
04/15/2021

Integration of Pre-trained Networks with Continuous Token Interface for End-to-End Spoken Language Understanding

Most End-to-End (E2E) SLU networks leverage the pre-trained ASR networks...
09/09/2021

Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss

Employing large-scale pre-trained model CLIP to conduct video-text retri...