Massive web datasets play a key role in the success of large vision-lang...
Natural language processing and 2D vision models have attained remarkabl...
Large multimodal datasets have been instrumental in recent breakthroughs...
Open-vocabulary models like CLIP achieve high accuracy across many image...
Articulated objects are abundant in daily life. Discovering their parts,...
We propose Continuous Scene Representations (CSR), a scene representatio...
Households across the world contain arbitrary objects: from mate gourds ...
The conventional recipe for maximizing model accuracy is to (1) train
mu...
People often use physical intuition when manipulating articulated object...