We propose a novel multimodal video benchmark - the Perception Test - to...
Attention-based models are appealing for multimodal processing because i...
Generic motion understanding from video involves not only tracking objec...
The ability to learn universal audio representations that can solve dive...
We introduce a human-compatible reinforcement-learning approach to a
coo...
We describe the 2020 edition of the DeepMind Kinetics human action datas...
Videos are a rich source of multi-modal supervision. In this work, we le...
There are thousands of actively spoken languages on Earth, but a single
...
Annotating videos is cumbersome, expensive and not scalable. Yet, many s...