S-Omninet: Structured Data Enhanced Universal Multimodal Learning Architecture

07/01/2023
by   Ye Xue, et al.
0

Multimodal multitask learning has attracted an increasing interest in recent years. Singlemodal models have been advancing rapidly and have achieved astonishing results on various tasks across multiple domains. Multimodal learning offers opportunities for further improvements by integrating data from multiple modalities. Many methods are proposed to learn on a specific type of multimodal data, such as vision and language data. A few of them are designed to handle several modalities and tasks at a time. In this work, we extend and improve Omninet, an architecture that is capable of handling multiple modalities and tasks at a time, by introducing cross-cache attention, integrating patch embeddings for vision inputs, and supporting structured data. The proposed Structured-data-enhanced Omninet (S-Omninet) is a universal model that is capable of learning from structured data of various dimensions effectively with unstructured data through cross-cache attention, which enables interactions among spatial, temporal, and structured features. We also enhance spatial representations in a spatial cache with patch embeddings. We evaluate the proposed model on several multimodal datasets and demonstrate a significant improvement over the baseline, Omninet.

READ FULL TEXT

page 4

page 6

page 7

research
05/26/2023

LANISTR: Multimodal Learning from Structured and Unstructured Data

Multimodal large-scale pretraining has shown impressive performance gain...
research
07/02/2020

Relating by Contrasting: A Data-efficient Framework for Multimodal Generative Models

Multimodal learning for generative models often refers to the learning o...
research
09/10/2019

Multimodal Embeddings from Language Models

Word embeddings such as ELMo have recently been shown to model word sema...
research
01/11/2017

Attention-Based Multimodal Fusion for Video Description

Currently successful methods for video description are based on encoder-...
research
10/06/2021

Unsupervised Multimodal Language Representations using Convolutional Autoencoders

Multimodal Language Analysis is a demanding area of research, since it i...
research
05/15/2020

Adaptive Transformers for Learning Multimodal Representations

The usage of transformers has grown from learning about language semanti...
research
09/08/2019

MULE: Multimodal Universal Language Embedding

Existing vision-language methods typically support two languages at a ti...

Please sign up or login with your details

Forgot password? Click here to reset