Today, large language models (LLMs) are taught to use new tools by provi...
Existing image editing tools, while powerful, typically disregard the
un...
Many pixelwise dense prediction tasks-depth estimation and semantic
segm...
In the last year alone, a surge of new benchmarks to measure composition...
Diligently gathered human demonstrations serve as the unsung heroes
empo...
Recent accelerations in multi-modal applications have been made possible...
Compositional reasoning is a hallmark of human visual intelligence; yet
...
Deploying large language models (LLMs) is challenging because they are m...
Large multimodal datasets have been instrumental in recent breakthroughs...
We introduce VOCALExplore, a system designed to support users in buildin...
We introduce EQUI-VOCAL: a new system that automatically synthesizes que...
A fundamental characteristic common to both human vision and natural lan...
Prior work has identified a resilient phenomenon that threatens the
perf...
Modern multi-agent reinforcement learning frameworks rely on centralized...
Recent video question answering benchmarks indicate that state-of-the-ar...
Prior benchmarks have analyzed models' answers to questions about videos...
Over the last decade, Computer Vision, the branch of Artificial Intellig...
Active learning promises to alleviate the massive data needs of supervis...
Visual events are a composition of temporal actions involving actors
spa...
Datasets extracted from social networks and online forums are often pron...
With the emergence of conversational artificial intelligence (AI) agents...
Action recognition has typically treated actions and activities as monol...
Typical active learning strategies are designed for tasks, such as
class...
Scene graph prediction --- classifying the set of objects and predicates...
Visual knowledge bases such as Visual Genome power numerous applications...
Generative models often use human evaluations to measure the perceived
q...
Generative models often use human evaluations to determine and justify
p...
Though image-to-sequence generation models have become overwhelmingly po...
Images are not simply sets of objects: each image represents a web of
in...
Most natural videos contain numerous events. For example, in a video of ...
Recent progress on image captioning has made it possible to generate nov...
Microtask crowdsourcing is increasingly critical to the creation of extr...
Visual relationships capture a wide variety of interactions between pair...
Despite progress in perceptual tasks such as image classification, compu...
Microtask crowdsourcing has enabled dataset advances in social science a...