This report presents the technical details of our submission on the EGO4...
Vision-language models are growing in popularity and public visibility t...
The objective of this paper is an automatic Audio Description (AD) model...
Large-scale, weakly-supervised speech recognition models, such as Whispe...
Our goal in this paper is the adaptation of image-text models for long v...
Vision-language models can encode societal biases and stereotypes, but t...
Our objective in this work is video-text retrieval - in particular a joi...
Our objective in this work is the long range understanding of the narrat...
The goal of this paper is to label all the animal individuals present in...