
-
Sequence-to-sequence Singing Voice Synthesis with Perceptual Entropy Loss
The neural network (NN) based singing voice synthesis (SVS) systems requ...
read it
-
Semi-supervised Multi-modal Emotion Recognition with Cross-Modal Distribution Matching
Automatic emotion recognition is an active research topic with wide rang...
read it
-
Context-aware Goodness of Pronunciation for Computer-Assisted Pronunciation Training
Mispronunciation detection is an essential component of the Computer-Ass...
read it
-
Team RUC_AIM3 Technical Report at Activitynet 2020 Task 2: Exploring Sequential Events Detection for Dense Video Captioning
Detecting meaningful events in an untrimmed video is essential for dense...
read it
-
YouMakeup VQA Challenge: Towards Fine-grained Action Understanding in Domain-Specific Videos
The goal of the YouMakeup VQA Challenge 2020 is to provide a common benc...
read it
-
Better Captioning with Sequence-Level Exploration
Sequence-level learning objective has been widely used in captioning tas...
read it
-
Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning
Cross-modal retrieval between videos and texts has attracted growing att...
read it
-
Say As You Wish: Fine-grained Control of Image Caption Generation with Abstract Scene Graphs
Humans are able to describe image contents with coarse to fine details a...
read it
-
Neural Storyboard Artist: Visualizing Stories with Coherent Image Sequences
A storyboard is a sequence of images to illustrate a story containing mu...
read it
-
Integrating Temporal and Spatial Attentions for VATEX Video Captioning Challenge 2019
This notebook paper presents our model in the VATEX video captioning cha...
read it
-
Unpaired Cross-lingual Image Caption Generation with Self-Supervised Rewards
Generating image descriptions in different languages is essential to sat...
read it
-
Activitynet 2019 Task 3: Exploring Contexts for Dense Captioning Events in Videos
Contextual reasoning is essential to understand events in long untrimmed...
read it
-
From Words to Sentences: A Progressive Learning Approach for Zero-resource Machine Translation with Visual Pivots
The neural machine translation model has suffered from the lack of large...
read it
-
Unsupervised Bilingual Lexicon Induction from Mono-lingual Multimodal Data
Bilingual lexicon induction, translating words from the source language ...
read it
-
RUC+CMU: System Report for Dense Captioning Events in Videos
This notebook paper presents our system in the ActivityNet Dense Caption...
read it
-
Multi-modal Conditional Attention Fusion for Dimensional Emotion Prediction
Continuous dimensional emotion prediction is a challenging task where th...
read it
-
Video Captioning with Guidance of Multimodal Latent Topics
The topic diversity of open-domain videos leads to various vocabularies ...
read it
-
Generating Video Descriptions with Topic Guidance
Generating video descriptions in natural language (a.k.a. video captioni...
read it
-
Improving Image Captioning by Concept-based Sentence Reranking
This paper describes our winning entry in the ImageCLEF 2015 image sente...
read it
-
Adaptive Tag Selection for Image Annotation
Not all tags are relevant to an image, and the number of relevant tags i...
read it