We study visually grounded VideoQA in response to the emerging trends of...
We present HiFiHR, a high-fidelity hand reconstruction approach that uti...
Human actions in egocentric videos are often hand-object interactions
co...
One promising use case of AI assistants is to help with complex procedur...
Direct mesh fitting for 3D hand shape reconstruction is highly accurate....
Video super-resolution commonly uses a frame-wise alignment to support t...
We propose to perform video question answering (VideoQA) in a Contrastiv...
In human and hand pose estimation, heatmaps are a crucial intermediate
r...
In computer vision, it is often observed that formulating regression pro...
Temporal action segmentation tags action labels for every frame in an in...
We propose a novel framework for 3D hand shape reconstruction and hand-o...
Temporal action segmentation from videos aims at the dense labeling of v...
In image super-resolution, both pixel-wise accuracy and perceptual fidel...
In temporal action segmentation, Timestamp supervision requires only a
h...
Local counts, or the number of objects in a local area, is a continuous ...
We present a semi-supervised learning approach to the temporal action
se...
Multi-exit architectures consist of a backbone and branch classifiers th...
We propose a novel approach to generate temporally coherent UV coordinat...
Video deblurring has achieved remarkable progress thanks to the success ...
Assembly101 is a new procedural activity dataset featuring 4321 videos o...
A standard hardware bottleneck when training deep neural networks is GPU...
This paper addresses the 3D point cloud reconstruction and 3D pose estim...
Video question answering requires the models to understand and reason ab...
Temporal action segmentation classifies the action of each frame in (lon...
Dense anticipation aims to forecast future actions and their durations f...
Over the past few years, the success in action recognition on short trim...
Along with predictive performance and runtime speed, reliability is a ke...
We propose an efficient inference framework for semi-supervised video ob...
Deep Neural Networks (DNNs) are generated by sequentially performing lin...
Modeling the visual changes that an action brings to a scene is critical...
Can we teach a robot to recognize and make predictions for activities th...
This technical report extends our work presented in [9] with more
experi...
Convolutional neural networks (CNNs) are highly successful for
super-res...
Temporal convolutional networks (TCNs) are a commonly used architecture ...
We introduce NExT-QA, a rigorously designed video question answering
(Vi...
Segmenting objects of interest in an image is an essential building bloc...
In current interactive instance segmentation works, the user is granted ...
In this paper, we show that ImageNet-Pretrained standard deep CNN models...
Future prediction requires reasoning from current and past observations ...
Wavelets are well known for data compression, yet have rarely been appli...
In this work, we study how well different type of approaches generalise ...
The key prerequisite for accessing the huge potential of current machine...
We present a method for recovering the dense 3D surface of the hand by
r...
Fourier methods have a long and proven track record in as an excellent t...
We propose a novel single-image super-resolution approach based on the
g...
When judging style, a key question that often arises is whether or not a...
In interactive instance segmentation, users give feedback to iteratively...
How can we teach a robot to predict what will happen next for an activit...
Hand image synthesis and pose estimation from RGB images are both highly...
This report outlines the proceedings of the Fourth International Worksho...