DeepAI AI Chat
Log In Sign Up

A Corpus for Reasoning About Natural Language Grounded in Photographs

by   Alane Suhr, et al.

We introduce a new dataset for joint reasoning about language and vision. The data contains 107,296 examples of English sentences paired with web photographs. The task is to determine whether a natural language caption is true about a photograph. We present an approach for finding visually complex images and crowdsourcing linguistically diverse captions. Qualitative analysis shows the data requires complex reasoning about quantities, comparisons, and relationships between objects. Evaluation of state-of-the-art visual reasoning methods shows the data is a challenge for current methods.


page 1

page 3

page 4

page 15

page 16


Touchdown: Natural Language Navigation and Spatial Reasoning in Visual Street Environments

We study the problem of jointly reasoning about language and vision thro...

Visual Reasoning with Natural Language

Natural language provides a widely accessible and expressive interface f...

Object Ordering with Bidirectional Matchings for Visual Reasoning

Visual reasoning with compositional natural language instructions, e.g.,...

How Large Are Lions? Inducing Distributions over Quantitative Attributes

Most current NLP systems have little knowledge about quantitative attrib...

A cookbook of translating English to Xapi

The Xapagy cognitive architecture had been designed to perform narrative...

QuaRel: A Dataset and Models for Answering Questions about Qualitative Relationships

Many natural language questions require recognizing and reasoning with q...

TaxiNLI: Taking a Ride up the NLU Hill

Pre-trained Transformer-based neural architectures have consistently ach...