On the Benefits of Early Fusion in Multimodal Representation Learning

11/14/2020
by   George Barnum, et al.
11

Intelligently reasoning about the world often requires integrating data from multiple modalities, as any individual modality may contain unreliable or incomplete information. Prior work in multimodal learning fuses input modalities only after significant independent processing. On the other hand, the brain performs multimodal processing almost immediately. This divide between conventional multimodal learning and neuroscience suggests that a detailed study of early multimodal fusion could improve artificial multimodal representations. To facilitate the study of early multimodal fusion, we create a convolutional LSTM network architecture that simultaneously processes both audio and visual inputs, and allows us to select the layer at which audio and visual information combines. Our results demonstrate that immediate fusion of audio and visual inputs in the initial C-LSTM layer results in higher performing networks that are more robust to the addition of white noise in both audio and visual inputs.

READ FULL TEXT

page 6

page 8

page 11

page 12

page 13

page 14

page 15

page 16

research
11/22/2017

Integrating both Visual and Audio Cues for Enhanced Video Caption

Video caption refers to generating a descriptive sentence for a specific...
research
05/31/2019

Multimodal Joint Emotion and Game Context Recognition in League of Legends Livestreams

Video game streaming provides the viewer with a rich set of audio-visual...
research
11/13/2015

Symbol Grounding Association in Multimodal Sequences with Missing Elements

In this paper, we extend a symbolic association framework for being able...
research
04/30/2019

Multimodal Classification of Urban Micro-Events

In this paper we seek methods to effectively detect urban micro-events. ...
research
08/26/2020

Training Multimodal Systems for Classification with Multiple Objectives

We learn about the world from a diverse range of sensory information. Au...
research
03/15/2023

Autonomous Soundscape Augmentation with Multimodal Fusion of Visual and Participant-linked Inputs

Autonomous soundscape augmentation systems typically use trained models ...
research
07/25/2023

MAEA: Multimodal Attribution for Embodied AI

Understanding multimodal perception for embodied AI is an open question ...

Please sign up or login with your details

Forgot password? Click here to reset