Semantic sentence similarity: size does not always matter

by   Danny Merkx, et al.

This study addresses the question whether visually grounded speech recognition (VGS) models learn to capture sentence semantics without access to any prior linguistic knowledge. We produce synthetic and natural spoken versions of a well known semantic textual similarity database and show that our VGS model produces embeddings that correlate well with human semantic similarity judgements. Our results show that a model trained on a small image-caption database outperforms two models trained on much larger databases, indicating that database size is not all that matters. We also investigate the importance of having multiple captions per image and find that this is indeed helpful even if the total number of images is lower, suggesting that paraphrasing is a valuable learning signal. While the general trend in the field is to create ever larger datasets to train models on, our findings indicate other characteristics of the database can just as important important.



There are no comments yet.


page 1

page 2

page 3

page 4


Learning semantic sentence representations from visually grounded language without lexical knowledge

Current approaches to learning semantic representations of sentences oft...

Learning to Recognise Words using Visually Grounded Speech

We investigated word recognition in a Visually Grounded Speech model. Th...

Joint Learning of Distributed Representations for Images and Texts

This technical report provides extra details of the deep multimodal simi...

Language learning using Speech to Image retrieval

Humans learn language by interaction with their environment and listenin...

Ologs: a categorical framework for knowledge representation

In this paper we introduce the olog, or ontology log, a category-theoret...

Representations of language in a model of visually grounded speech signal

We present a visually grounded model of speech perception which projects...

Code Repositories


Neural network implementation of a speech to image system. Networks are trained to embed images and corresponding captions to the same vector space.

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.