3D visual grounding is a critical skill for household robots, enabling t...
Recent years have seen an increasing amount of work on embodied AI agent...
Human communication is multimodal in nature; it is through multiple
moda...
Question answering biases in video QA datasets can mislead multimodal mo...
Outlier detection is a key field of machine learning for identifying abn...