
-
Perceiver: General Perception with Iterative Attention
Biological systems understand the world by simultaneously processing hig...
read it
-
Betrayed by Motion: Camouflaged Object Discovery via Motion Segmentation
The objective of this paper is to design a computational architecture th...
read it
-
QuerYD: A video dataset with high-quality textual and audio narrations
We introduce QuerYD, a new large-scale dataset for retrieval and event l...
read it
-
A Short Note on the Kinetics-700-2020 Human Action Dataset
We describe the 2020 edition of the DeepMind Kinetics human action datas...
read it
-
Self-supervised Co-training for Video Representation Learning
The objective of this paper is visual-only self-supervised video represe...
read it
-
Watch, read and lookup: learning to spot signs from multiple supervisors
The focus of this work is sign spotting - given a video of an isolated s...
read it
-
Layered Neural Rendering for Retiming People in Video
We present a method for retiming people in an ordinary, natural video—ma...
read it
-
Adaptive Text Recognition through Visual Matching
In this work, our objective is to address the problems of generalization...
read it
-
Seeing wake words: Audio-visual Keyword Spotting
The goal of this work is to automatically determine whether and when a w...
read it
-
Inducing Predictive Uncertainty Estimation for Face Recognition
Knowing when an output can be trusted is critical for reliably using fac...
read it
-
Self-Supervised Learning of Audio-Visual Objects from Video
Our objective is to transform a video into a set of discrete audio-visua...
read it
-
Memory-augmented Dense Predictive Coding for Video Representation Learning
The objective of this paper is self-supervised learning from video, in p...
read it
-
RareAct: A video dataset of unusual interactions
This paper introduces a manually annotated video dataset of unusual acti...
read it
-
BSL-1K: Scaling up co-articulated sign language recognition using mouthing cues
Recent progress in fine-grained gesture and action classification, and m...
read it
-
CrossTransformers: spatially-aware few-shot transfer
Given new tasks with very little data–such as new classes in a classific...
read it
-
D2D: Learning to find good correspondences for image matching and manipulation
We propose a new approach to determining correspondences between image p...
read it
-
Spot the conversation: speaker diarisation in the wild
The goal of this paper is speaker diarisation of videos collected 'in th...
read it
-
Self-Supervised MultiModal Versatile Networks
Videos are a rich source of multi-modal supervision. In this work, we le...
read it
-
Counting Out Time: Class Agnostic Video Repetition Counting in the Wild
We present an approach for estimating the period with which an action is...
read it
-
LSD-C: Linearly Separable Deep Clusters
We present LSD-C, a novel method to identify clusters in an unlabeled da...
read it
-
Condensed Movies: Story Based Retrieval with Contextual Embeddings
Our objective in this work is the long range understanding of the narrat...
read it
-
The AVA-Kinetics Localized Human Actions Video Dataset
This paper describes the AVA-Kinetics localized human actions video data...
read it
-
VGGSound: A Large-scale Audio-Visual Dataset
Our goal is to collect a large-scale audio-visual dataset with low label...
read it
-
Monocular Depth Estimation with Self-supervised Instance Adaptation
Recent advances in self-supervised learning havedemonstrated that it is ...
read it
-
Speech2Action: Cross-modal Supervision for Action Recognition
Is it possible to guess human action from dialogue alone? In this work w...
read it
-
Compact Deep Aggregation for Set Retrieval
The objective of this work is to learn a compact embedding of a set of d...
read it
-
Visual Grounding in Video for Unsupervised Word Translation
There are thousands of actively spoken languages on Earth, but a single ...
read it
-
Disentangled Speech Embeddings using Cross-modal Self-supervision
The objective of this paper is to learn representations of speaker ident...
read it
-
Automatically Discovering and Learning New Visual Categories with Ranking Statistics
We tackle the problem of discovering novel classes in an image collectio...
read it
-
End-to-End Learning of Visual Representations from Uncurated Instructional Videos
Annotating videos is cumbersome, expensive and not scalable. Yet, many s...
read it
-
Synthetic Humans for Action Recognition from Unseen Viewpoints
Our goal in this work is to improve the performance of human action reco...
read it
-
VoxSRC 2019: The first VoxCeleb Speaker Recognition Challenge
The VoxCeleb Speaker Recognition Challenge 2019 aimed to assess how well...
read it
-
ASR is all you need: cross-modal distillation for lip reading
The goal of this work is to train strong models for visual speech recogn...
read it
-
Self-supervised learning of class embeddings from video
This work explores how to use self-supervised learning on videos to lear...
read it
-
Controllable Attention for Structured Layered Video Decomposition
The objective of this paper is to be able to separate a video into its n...
read it
-
Count, Crop and Recognise: Fine-Grained Recognition in the Wild
The goal of this paper is to label all the animal individuals present in...
read it
-
Video Representation Learning by Dense Predictive Coding
The objective of this paper is self-supervised learning of spatio-tempor...
read it
-
Geometry-Aware Video Object Detection for Static Cameras
In this paper we propose a geometry-aware model for video object detecti...
read it
-
Learning to Discover Novel Visual Categories via Deep Transfer Clustering
We consider the problem of discovering novel object categories in an ima...
read it
-
EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition
We focus on multi-modal fusion for egocentric action recognition, and pr...
read it
-
AutoCorrect: Deep Inductive Alignment of Noisy Geometric Annotations
We propose AutoCorrect, a method to automatically learn object-annotatio...
read it
-
Use What You Have: Video Retrieval Using Representations From Collaborative Experts
The rapid growth of video on the internet has made searching for video c...
read it
-
A Short Note on the Kinetics-700 Human Action Dataset
We describe an extension of the DeepMind Kinetics human action dataset f...
read it
-
My lips are concealed: Audio-visual speech enhancement through obstructions
Our objective is an audio-visual model for separating a single speaker f...
read it
-
Sim2real transfer learning for 3D pose estimation: motion to the rescue
Simulation is an anonymous, low-bias source of data where annotation can...
read it
-
Unsupervised Learning of Object Keypoints for Perception and Control
The study of object representations in computer vision has primarily foc...
read it
-
Training Neural Networks for and by Interpolation
The majority of modern deep learning models are able to interpolate the ...
read it
-
LAEO-Net: revisiting people Looking At Each Other in videos
Capturing the `mutual gaze' of people is essential for understanding and...
read it
-
A Hierarchical Probabilistic U-Net for Modeling Multi-Scale Ambiguities
Medical imaging only indirectly measures the molecular identity of the t...
read it
-
Object Discovery with a Copy-Pasting GAN
We tackle the problem of object discovery, where objects are segmented f...
read it