While current visual captioning models have achieved impressive performa...
There has been a growing interest in using end-to-end acoustic models fo...
Stylized visual captioning aims to generate image or video descriptions ...
Temporal video grounding (TVG) aims to retrieve the time interval of a
l...
Image captioning aims to describe visual content in natural language. As...
To help the visually impaired enjoy movies, automatic movie narrating sy...
Automatically narrating a video with natural language can assist people ...
Automatic image captioning evaluation is critical for benchmarking and
p...
Live video commenting is popular on video media platforms, as it can cre...
Image-text retrieval, as a fundamental and important branch of informati...
Singing voice synthesis (SVS), as a specific task for generating the voc...
Multimodal processing has attracted much attention lately especially wit...
Automatically generating textual descriptions for massive unlabeled imag...
In this paper we provide the technique report of Ego4D natural language ...
Dense video captioning aims to generate corresponding text descriptions ...
Text-Video retrieval is a task of great practical value and has received...
Image retrieval with hybrid-modality queries, also known as composing te...
Deep learning based singing voice synthesis (SVS) systems have been
demo...
In this paper, we briefly introduce our submission to the Valence-Arousa...
The Image Difference Captioning (IDC) task aims to describe the visual
d...
Multimodal emotion recognition study is hindered by the lack of labelled...
Inspired by the success of transformer-based pre-training methods on nat...
Translating e-commercial product descriptions, a.k.a product-oriented ma...
For an image with multiple scene texts, different people may be interest...
Most current image captioning systems focus on describing general image
...
Emotion recognition in conversation (ERC) is a crucial component in affe...
Entities Object Localization (EOL) aims to evaluate how grounded or fait...
Video paragraph captioning aims to describe multiple events in untrimmed...
The neural network (NN) based singing voice synthesis (SVS) systems requ...
Automatic emotion recognition is an active research topic with wide rang...
Mispronunciation detection is an essential component of the Computer-Ass...
Detecting meaningful events in an untrimmed video is essential for dense...
The goal of the YouMakeup VQA Challenge 2020 is to provide a common benc...
Sequence-level learning objective has been widely used in captioning tas...
Cross-modal retrieval between videos and texts has attracted growing
att...
Humans are able to describe image contents with coarse to fine details a...
A storyboard is a sequence of images to illustrate a story containing
mu...
This notebook paper presents our model in the VATEX video captioning
cha...
Generating image descriptions in different languages is essential to sat...
Contextual reasoning is essential to understand events in long untrimmed...
The neural machine translation model has suffered from the lack of
large...
Bilingual lexicon induction, translating words from the source language ...
This notebook paper presents our system in the ActivityNet Dense Caption...
Continuous dimensional emotion prediction is a challenging task where th...
The topic diversity of open-domain videos leads to various vocabularies ...
Generating video descriptions in natural language (a.k.a. video captioni...
This paper describes our winning entry in the ImageCLEF 2015 image sente...
Not all tags are relevant to an image, and the number of relevant tags i...