Evaluating Multimodal Representations on Visual Semantic Textual Similarity

04/04/2020
by   Oier Lopez de Lacalle, et al.
0

The combination of visual and textual representations has produced excellent results in tasks such as image captioning and visual question answering, but the inference capabilities of multimodal representations are largely untested. In the case of textual representations, inference tasks such as Textual Entailment and Semantic Textual Similarity have been often used to benchmark the quality of textual representations. The long term goal of our research is to devise multimodal representation techniques that improve current inference capabilities. We thus present a novel task, Visual Semantic Textual Similarity (vSTS), where such inference ability can be tested directly. Given two items comprised each by an image and its accompanying caption, vSTS systems need to assess the degree to which the captions in context are semantically equivalent to each other. Our experiments using simple multimodal representations show that the addition of image representations produces better inference, compared to text-only representations. The improvement is observed both when directly computing the similarity between the representations of the two items, and when learning a siamese network based on vSTS training data. Our work shows, for the first time, the successful contribution of visual information to textual inference, with ample room for benchmarking more complex multimodal representation options.

READ FULL TEXT

page 1

page 7

research
09/11/2018

Evaluating Multimodal Representations on Sentence Similarity: vSTS, Visual Semantic Textual Similarity Dataset

In this paper we introduce vSTS, a new dataset for measuring textual sim...
research
06/10/2019

Multimodal Logical Inference System for Visual-Textual Entailment

A large amount of research about multimodal inference across text and vi...
research
09/04/2021

LAViTeR: Learning Aligned Visual and Textual Representations Assisted by Image and Caption Generation

Pre-training visual and textual representations from large-scale image-t...
research
11/19/2015

Order-Embeddings of Images and Language

Hypernymy, textual entailment, and image captioning can be seen as speci...
research
05/22/2022

The Case for Perspective in Multimodal Datasets

This paper argues in favor of the adoption of annotation practices for m...
research
04/18/2018

Quantifying the visual concreteness of words and topics in multimodal datasets

Multimodal machine learning algorithms aim to learn visual-textual corre...
research
09/12/2016

Examining Representational Similarity in ConvNets and the Primate Visual Cortex

We compare several ConvNets with different depth and regularization tech...

Please sign up or login with your details

Forgot password? Click here to reset