Composed image retrieval (CIR) is a new and flexible image retrieval
par...
Multimodal Sarcasm Explanation (MuSE) is a new yet challenging task, whi...
Textual response generation is an essential task for multimodal task-ori...
The composed image retrieval (CIR) task aims to retrieve the desired tar...
Existing data-to-text generation efforts mainly focus on generating a
co...
Fake news often involves multimedia information such as text and image t...
Existing studies on multimodal sentiment analysis heavily rely on textua...
Text response generation for multimodal task-oriented dialog systems, wh...
Scene Graph Generation, which generally follows a regular encoder-decode...
Logical reasoning is of vital importance to natural language understandi...
Recommender systems can automatically recommend users items that they
pr...
Temporal Moment Localization (TML) in untrimmed videos is a challenging ...
This paper focuses on tackling the problem of temporal language localiza...
Visual attention in Visual Question Answering (VQA) targets at locating ...
With the rising incidence of some diseases, such as obesity and diabetes...
In this paper, we investigate the research problem of unsupervised multi...
Recently, the booming fashion sector and its huge potential benefits hav...