We introduce the Qwen-VL series, a set of large-scale vision-language mo...
Generalist models, which are capable of performing diverse multi-modal t...
Synthesizing high-fidelity videos from real-world multi-view input is
ch...
Embodied agents are expected to perform more complicated tasks in an
int...
Building a deep learning model for a Question-Answering (QA) task requir...
In the Vision-and-Language Navigation task, the embodied agent follows
l...
In this paper, we propose a novel Knowledge-based Embodied Question Answ...
Embodiment is an important characteristic for all intelligent agents
(cr...