Large-scale visual language models are widely used as pre-trained models...
The core problem in zero-shot open vocabulary detection is how to align
...
Attention-based models are appealing for multimodal processing because i...
We aim to learn to temporally localize object state changes and the
corr...
Building models that can be rapidly adapted to numerous tasks using only...
Human actions often induce changes of object states such as "cutting an
...
Real-world data is high-dimensional: a book, image, or musical performan...
The ability to learn universal audio representations that can solve dive...
The recently-proposed Perceiver model obtains good results on several do...
Whilst there are perhaps only a few scientific methods, there seem to be...
We present a multimodal framework to learn general audio representations...
Beam search is the go-to method for decoding auto-regressive machine
tra...
Most successful self-supervised learning methods are trained to align th...
Our objective is language-based search of large-scale image and video
da...
Self-supervised pretraining has been shown to yield powerful representat...
Recently multimodal transformer models have gained popularity because th...
This paper introduces a manually annotated video dataset of unusual acti...
Videos are a rich source of multi-modal supervision. In this work, we le...
We apply a generative segmental model of task structure, guided by narra...
There are thousands of actively spoken languages on Earth, but a single
...
Annotating videos is cumbersome, expensive and not scalable. Yet, many s...
The objective of this paper is to be able to separate a video into its
n...
Learning text-video embeddings usually requires a dataset of video clips...
Recent work has uncovered the interesting (and somewhat surprising) find...
In this paper we investigate learning visual models for the steps of ord...
True video understanding requires making sense of non-lambertian scenes ...
Automatic generation of textual video descriptions that are time-aligned...
Spatio-temporal action detection in videos is typically addressed in a
f...
Discriminative clustering has been successfully applied to a number of
w...
We propose SEARNN, a novel training algorithm for recurrent neural netwo...
Many human activities involve object manipulations aiming to modify the
...
In this paper, we propose several improvements on the block-coordinate
F...
We address the problem of automatically learning the main steps to compl...