We introduce the Segment Anything (SA) project: a new task, model, and
d...
We present a sim-to-real learning-based approach for real-world humanoid...
In this work, we explore self-supervised visual pre-training on images f...
This paper shows that self-supervised visual pre-training from real-worl...
Vision transformer (ViT) models exhibit substandard optimizability. In
p...
Recent self-supervised contrastive methods have been able to produce
imp...
Metric learning seeks perceptual embeddings where visually similar insta...
Human action is naturally compositional: humans can easily recognize and...
Objects are entities we act upon, where the functionality of an object i...
Modern CNN-based object detectors rely on bounding box regression and
no...
Humans recognize the visual world at multiple levels: we effortlessly
ca...
We study the problem of grounding distributional representations of text...
Human detection has witnessed impressive progress in recent years. Howev...
Detecting individual pedestrians in a crowd remains a challenging proble...
The improvements in recent CNN-based object detection works, from R-CNN ...
Aggregating extra features has been considered as an effective approach ...