While recent supervised methods for reference-based object counting cont...
Our objective is open-world object counting in images, where the target
...
The objective of this paper is an automatic Audio Description (AD) model...
Large-scale, weakly-supervised speech recognition models, such as Whispe...
Large-scale pretrained models, especially those trained from vision-lang...
The objective of this paper is an efficient training method for video ta...
Building models that can be rapidly adapted to numerous tasks using only...
The objective of this paper is a temporal alignment network that ingests...
Visual-language pre-training has shown great success for learning joint
...
The objective of this paper is visual-only self-supervised video
represe...
The objective of this paper is self-supervised learning from video, in
p...
The objective of this paper is self-supervised learning of spatio-tempor...
For effective human-robot interaction, it is important that a robotic
as...
Human pose forecasting is an important problem in computer vision with
a...