We present Contrastive Feature Masking Vision Transformer (CFM-ViT) - an...
Building joint representations across images and text is an essential st...
Image-language learning has made unprecedented progress in visual
unders...
We present Region-aware Open-vocabulary Vision Transformers (RO-ViT) - a...
The development of language models have moved from encoder-decoder to
de...
We present a simple approach which can turn a ViT encoder into an effici...
We present F-VLM, a simple open-vocabulary object detection method built...
Effective scaling and a flexible task interface enable large language mo...
We present a pre-training approach for vision and language transformer
m...
Video question answering is a challenging task that requires understandi...
Stacking increases storage efficiency in shelves, but the lack of visibi...
We present Answer-Me, a task-aware multi-task framework which unifies a
...
We propose FindIt, a simple and versatile framework that unifies a varie...
Shelves are common in homes, warehouses, and commercial settings due to ...
We present 4D-Net, a 3D object detection approach, which utilizes 3D Poi...
3D perception of object shapes from RGB image input is fundamental towar...
Object proposals have become an integral preprocessing steps of many vis...
In this paper we address the problem of automatically discovering atomic...
In this paper, we introduce a novel visual representation learning which...
In this paper we address the problem of automatically discovering atomic...
We present SMURF, a method for unsupervised learning of optical flow tha...
A common strategy to video understanding is to incorporate spatial and m...
We propose a vision-based architecture search algorithm for robot
manipu...
Efficiently finding an occluded object with lateral access arises in man...
We present a method for jointly training the estimation of depth, ego-mo...
We create a family of powerful video models which are able to: (i) learn...
In this paper we propose an adversarial generative grammar model for fut...
Object recognition has seen significant progress in the image domain, wi...
Convolutional operations have two limitations: (1) do not explicitly mod...
We systematically compare and analyze a set of key components in unsuper...
Mapping and localization, preferably from a small number of observations...
It has been recognized that the joint training of computer vision tasks ...
For applications in e-commerce, warehouses, healthcare, and home service...
We leverage unsupervised learning of depth, egomotion, and camera intrin...
We present a new method to learn video representations from large-scale
...
We introduce a new high resolution, high frame rate stereo video dataset...
Estimating the 3D pose of desktop objects is crucial for applications su...
Video understanding is a challenging problem with great impact on the
ab...
We present an approach which takes advantage of both structure and seman...
We present a new method to learn video representations from unlabeled da...
Learning to represent videos is a very challenging task both algorithmic...
We present a novel method for simultaneous learning of depth, egomotion,...
Instance segmentation aims to detect and segment individual objects in a...
This paper proposes a novel algorithm which learns a formal regular gram...
Predicting the future to anticipate the outcome of events and actions is...
In this paper, we present a new method for evolving video CNN models to ...
Learning to predict scene depth from RGB inputs is a challenging task bo...
We present a novel approach for unsupervised learning of depth and ego-m...
We consider the problem of retrieving objects from image data and learni...
We approach structured output prediction by optimizing a deep value netw...