The canonical approach to video captioning dictates a caption generation...
In recent years, we have witnessed significant performance boost in the ...
In this paper, we propose UNICORN, a vision-language (VL) model that uni...
Automated visual understanding of our diverse and open world demands com...
In this paper, we propose a single UniFied transfOrmer (UFO), which is
c...
Knowledge-based visual question answering (VQA) involves answering quest...