Vision-language models (VLMs) have shown impressive performance in
subst...
Generating visually grounded image captions with specific linguistic sty...
Current captioning approaches tend to generate correct but "generic"
des...
In recent years, vision and language pre-training (VLP) models have adva...
In this paper, we consider a novel research problem, music-to-text
synae...
Relation detection is a core step in many natural language process
appli...