Towards Learning Cross-Modal Perception-Trace Models

10/18/2019
by   Achim Rettinger, et al.
0

Representation learning is a key element of state-of-the-art deep learning approaches. It enables to transform raw data into structured vector space embeddings. Such embeddings are able to capture the distributional semantics of their context, e.g. by word windows on natural language sentences, graph walks on knowledge graphs or convolutions on images. So far, this context is manually defined, resulting in heuristics which are solely optimized for computational performance on certain tasks like link-prediction. However, such heuristic models of context are fundamentally different to how humans capture information. For instance, when reading a multi-modal webpage (i) humans do not perceive all parts of a document equally: Some words and parts of images are skipped, others are revisited several times which makes the perception trace highly non-sequential; (ii) humans construct meaning from a document's content by shifting their attention between text and image, among other things, guided by layout and design elements. In this paper we empirically investigate the difference between human perception and context heuristics of basic embedding models. We conduct eye tracking experiments to capture the underlying characteristics of human perception of media documents containing a mixture of text and images. Based on that, we devise a prototypical computational perception-trace model, called CMPM. We evaluate empirically how CMPM can improve a basic skip-gram embedding approach. Our results suggest, that even with a basic human-inspired computational perception model, there is a huge potential for improving embeddings since such a model does inherently capture multiple modalities, as well as layout and design elements.

READ FULL TEXT

page 2

page 3

research
04/20/2017

Knowledge Fusion via Embeddings from Text, Knowledge Graphs, and Images

We present a baseline approach for cross-modal knowledge fusion. Differe...
research
10/18/2017

Learning Social Image Embedding with Deep Multimodal Attention Networks

Learning social media data embedding by deep models has attracted extens...
research
02/28/2023

Joint Representations of Text and Knowledge Graphs for Retrieval and Evaluation

A key feature of neural models is that they can produce semantic vector ...
research
03/27/2019

Image search using multilingual texts: a cross-modal learning approach between image and text

Multilingual (or cross-lingual) embeddings represent several languages i...
research
01/08/2022

Coherence-Based Distributed Document Representation Learning for Scientific Documents

Distributed document representation is one of the basic problems in natu...
research
06/01/2017

Grounding Symbols in Multi-Modal Instructions

As robots begin to cohabit with humans in semi-structured environments, ...

Please sign up or login with your details

Forgot password? Click here to reset