We introduce EgoSchema, a very long-form video question-answering datase...
The growing size of datasets and deep learning models has made faster an...
There has been a longstanding belief that generation can facilitate a tr...
The recent emergence of Large Language Models based on the Transformer
a...
Temporal action localization (TAL) requires long-form reasoning to predi...
This technical report describes the SViT approach for the Ego4D Point of...
Recent action recognition models have achieved impressive results by
int...
The recently proposed Conformer model has become the de facto backbone m...
While today's video recognition systems parse snapshots or short clips
a...
Generative Adversarial Networks (GANs) are a class of generative models ...
In this paper, we study Multiscale Vision Transformers (MViT) as a unifi...
Evidence from cognitive psychology suggests that understanding
spatio-te...
We present Multiscale Vision Transformers (MViT) for video and image
rec...
Human trajectory forecasting is an inherently multi-modal problem.
Uncer...
Human movement is goal-directed and influenced by the spatial layout of ...
We tackle the problem of Human Locomotion Forecasting, a task for jointl...
We study the use of knowledge distillation to compress the U-net
archite...
We investigate the effect and usefulness of spontaneity in speech (i.e.
...
We present a new task that predicts future locations of people observed ...
Cellular Automata (CA) theory is a discrete model that represents the st...