
-
Self-Supervised Pretraining of 3D Features on any Point-Cloud
Pretraining on large labeled datasets is a prerequisite to achieve good ...
read it
-
Forward Prediction for Physical Reasoning
Physical reasoning requires forward prediction: the ability to forecast ...
read it
-
Video Understanding as Machine Translation
With the advent of large-scale multimodal video datasets, especially seq...
read it
-
Are we asking the right questions in MovieQA?
Joint vision and language tasks like visual question answering are fasci...
read it
-
CATER: A diagnostic dataset for Compositional Actions and TEmporal Reasoning
Computer vision has undergone a dramatic revolution in performance, driv...
read it
-
MetaPix: Few-Shot Video Retargeting
We address the task of unsupervised retargeting of human actions from on...
read it
-
DistInit: Learning Video Representations without a Single Labeled Video
Video recognition models have progressed significantly over the past few...
read it
-
Video Action Transformer Network
We introduce the Action Transformer model for recognizing and localizing...
read it
-
A Better Baseline for AVA
We introduce a simple baseline for action localization on the AVA datase...
read it
-
Binge Watching: Scaling Affordance Learning from Sitcoms
In recent years, there has been a renewed interest in jointly modeling p...
read it
-
Detect-and-Track: Efficient Pose Estimation in Videos
This paper addresses the problem of estimating and tracking human body k...
read it
-
Attentional Pooling for Action Recognition
We introduce a simple yet surprisingly powerful model to incorporate att...
read it
-
ActionVLAD: Learning spatio-temporal aggregation for action classification
In this work, we introduce a new video representation for action classif...
read it
-
Learning a Predictable and Generative Vector Representation for Objects
What is a good vector representation of an object? We believe that it sh...
read it