Words are all you need? Capturing human sensory similarity with textual descriptors

06/08/2022
by   Raja Marjieh, et al.
2

Recent advances in multimodal training use textual descriptions to significantly enhance machine understanding of images and videos. Yet, it remains unclear to what extent language can fully capture sensory experiences across different modalities. A well-established approach for characterizing sensory experiences relies on similarity judgments, namely, the degree to which people perceive two distinct stimuli as similar. We explore the relation between human similarity judgments and language in a series of large-scale behavioral studies (N=1,823 participants) across three modalities (images, audio, and video) and two types of text descriptors: simple word tags and free-text captions. In doing so, we introduce a novel adaptive pipeline for tag mining that is both efficient and domain-general. We show that our prediction pipeline based on text descriptors exhibits excellent performance, and we compare it against a comprehensive array of 611 baseline models based on vision-, audio-, and video-processing architectures. We further show that the degree to which textual descriptors and models predict human similarity varies across and within modalities. Taken together, these studies illustrate the value of integrating machine learning and cognitive science approaches to better understand the similarities and differences between human and machine representations. We present an interactive visualization at https://words-are-all-you-need.s3.amazonaws.com/index.html for exploring the similarity between stimuli as experienced by humans and different methods reported in the paper.

READ FULL TEXT

page 10

page 14

page 15

page 21

page 24

page 30

page 31

page 32

research
02/09/2022

Predicting Human Similarity Judgments Using Large Language Models

Similarity judgments provide a well-established method for accessing men...
research
10/19/2021

Inter-Sense: An Investigation of Sensory Blending in Fiction

This study reports on the semantic organization of English sensory descr...
research
06/16/2020

AVLnet: Learning Audio-Visual Language Representations from Instructional Videos

Current methods for learning visually grounded language from videos ofte...
research
05/29/2023

VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

Vision and text have been fully explored in contemporary video-text foun...
research
02/17/2022

Word Embeddings for Automatic Equalization in Audio Mixing

In recent years, machine learning has been widely adopted to automate th...
research
10/19/2021

Exploring the Sensory Spaces of English Perceptual Verbs in Natural Language Data

In this study, we explore how language captures the meaning of words, in...
research
03/13/2021

A Survey on Multimodal Disinformation Detection

Recent years have witnessed the proliferation of fake news, propaganda, ...

Please sign up or login with your details

Forgot password? Click here to reset