b'Qin Jin'

research

∙ 08/21/2023

Explore and Tell: Embodied Visual Captioning in 3D Environments

While current visual captioning models have achieved impressive performa...

0 Anwen Hu, et al. ∙

research

∙ 08/05/2023

A Systematic Exploration of Joint-training for Singing Voice Synthesis

There has been a growing interest in using end-to-end acoustic models fo...

0 Yuning Wu, et al. ∙

research

∙ 07/31/2023

Visual Captioning at Will: Describing Images and Videos Guided by a Few Stylized Sentences

Stylized visual captioning aims to generate image or video descriptions ...

0 Dingyi Yang, et al. ∙

research

∙ 07/20/2023

No-frills Temporal Video Grounding: Multi-Scale Neighboring Attention and Zoom-in Boundary Detection

Temporal video grounding (TVG) aims to retrieve the time interval of a l...

0 Qi Zhang, et al. ∙

research

∙ 06/23/2023

Learning Descriptive Image Captioning via Semipermeable Maximum Likelihood Estimation

Image captioning aims to describe visual content in natural language. As...

0 Zihao Yue, et al. ∙

research

∙ 05/20/2023

Movie101: A New Movie Understanding Benchmark

To help the visually impaired enjoy movies, automatic movie narrating sy...

0 Zihao Yue, et al. ∙

research

∙ 05/15/2023

Edit As You Wish: Video Description Editing with Multi-grained Commands

Automatically narrating a video with natural language can assist people ...

0 Linli Yao, et al. ∙

research

∙ 05/10/2023

InfoMetIC: An Informative Metric for Reference-free Image Caption Evaluation

Automatic image captioning evaluation is critical for benchmarking and p...

0 Anwen Hu, et al. ∙

research

∙ 04/28/2023

Knowledge Enhanced Model for Live Video Comment Generation

Live video commenting is popular on video media platforms, as it can cre...

0 Jieting Chen, et al. ∙

research

∙ 04/21/2023

Rethinking Benchmarks for Cross-modal Image-text Retrieval

Image-text retrieval, as a fundamental and important branch of informati...

0 Weijing Chen, et al. ∙

research

∙ 03/15/2023

PHONEix: Acoustic Feature Processing Strategy for Enhanced Singing Pronunciation with Phoneme Distribution Predictor

Singing voice synthesis (SVS), as a specific task for generating the voc...

0 Yuning Wu, et al. ∙

research

∙ 03/12/2023

Accommodating Audio Modality in CLIP for Multimodal Processing

Multimodal processing has attracted much attention lately especially wit...

0 Ludan Ruan, et al. ∙

research

∙ 11/17/2022

CapEnrich: Enriching Caption Semantics for Web Images via Cross-modal Pre-trained Knowledge

Automatically generating textual descriptions for massive unlabeled imag...

0 Linli Yao, et al. ∙

research

∙ 08/10/2022

Exploring Anchor-based Detection for Ego4D Natural Language Query

In this paper we provide the technique report of Ego4D natural language ...

0 Sipeng Zheng, et al. ∙

research

∙ 07/18/2022

Unifying Event Detection and Captioning as Sequence Generation via Pre-Training

Dense video captioning aims to generate corresponding text descriptions ...

0 Qi Zhang, et al. ∙

research

∙ 07/16/2022

TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval

Text-Video retrieval is a task of great practical value and has received...

0 Yuqi Liu, et al. ∙

research

∙ 04/24/2022

Progressive Learning for Image Retrieval with Hybrid-Modality Queries

Image retrieval with hybrid-modality queries, also known as composing te...

0 Yida Zhao, et al. ∙

research

∙ 03/31/2022

SingAug: Data Augmentation for Singing Voice Synthesis with Cycle-consistent Training Strategy

Deep learning based singing voice synthesis (SVS) systems have been demo...

0 Shuai Guo, et al. ∙

research

∙ 03/24/2022

Multi-modal Emotion Estimation for in-the-wild Videos

In this paper, we briefly introduce our submission to the Valence-Arousa...

0 Liyu Meng, et al. ∙

research

∙ 02/09/2022

Image Difference Captioning with Pre-training and Contrastive Learning

The Image Difference Captioning (IDC) task aims to describe the visual d...

0 Linli Yao, et al. ∙

research

∙ 10/27/2021

MEmoBERT: Pre-training Model with Prompt-based Learning for Multimodal Emotion Recognition

Multimodal emotion recognition study is hindered by the lack of labelled...

0 Jinming Zhao, et al. ∙

research

∙ 09/21/2021

Survey: Transformer based Video-Language Pre-training

Inspired by the success of transformer-based pre-training methods on nat...

0 Ludan Ruan, et al. ∙

research

∙ 08/25/2021

Product-oriented Machine Translation with Cross-modal Cross-lingual Pre-training

Translating e-commercial product descriptions, a.k.a product-oriented ma...

0 Yuqing Song, et al. ∙

research

∙ 08/04/2021

Question-controlled Text-aware Image Captioning

For an image with multiple scene texts, different people may be interest...

0 Anwen Hu, et al. ∙

research

∙ 08/04/2021

ICECAP: Information Concentrated Entity-aware Image Captioning

Most current image captioning systems focus on describing general image ...

0 Anwen Hu, et al. ∙

research

∙ 07/14/2021

MMGCN: Multimodal Fusion via Deep Graph Convolution Network for Emotion Recognition in Conversation

Emotion recognition in conversation (ERC) is a crucial component in affe...

0 Jingwen Hu, et al. ∙

research

∙ 06/11/2021

Team RUC_AIM3 Technical Report at ActivityNet 2021: Entities Object Localization

Entities Object Localization (EOL) aims to evaluate how grounded or fait...

0 Ludan Ruan, et al. ∙

research

∙ 05/30/2021

Towards Diverse Paragraph Captioning for Untrimmed Videos

Video paragraph captioning aims to describe multiple events in untrimmed...

0 Yuqing Song, et al. ∙

research

∙ 10/22/2020

Sequence-to-sequence Singing Voice Synthesis with Perceptual Entropy Loss

The neural network (NN) based singing voice synthesis (SVS) systems requ...

0 Jiatong Shi, et al. ∙

research

∙ 09/05/2020

Semi-supervised Multi-modal Emotion Recognition with Cross-Modal Distribution Matching

Automatic emotion recognition is an active research topic with wide rang...

0 Jingjun Liang, et al. ∙

research

∙ 08/19/2020

Context-aware Goodness of Pronunciation for Computer-Assisted Pronunciation Training

Mispronunciation detection is an essential component of the Computer-Ass...

0 Jiatong Shi, et al. ∙

research

∙ 06/14/2020

Team RUC_AIM3 Technical Report at Activitynet 2020 Task 2: Exploring Sequential Events Detection for Dense Video Captioning

Detecting meaningful events in an untrimmed video is essential for dense...

0 Yuqing Song, et al. ∙

research

∙ 04/12/2020

YouMakeup VQA Challenge: Towards Fine-grained Action Understanding in Domain-Specific Videos

The goal of the YouMakeup VQA Challenge 2020 is to provide a common benc...

16 Shizhe Chen, et al. ∙

research

∙ 03/08/2020

Better Captioning with Sequence-Level Exploration

Sequence-level learning objective has been widely used in captioning tas...

0 Jia Chen, et al. ∙

research

∙ 03/01/2020

Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning

Cross-modal retrieval between videos and texts has attracted growing att...

13 Shizhe Chen, et al. ∙

research

∙ 03/01/2020

Say As You Wish: Fine-grained Control of Image Caption Generation with Abstract Scene Graphs

Humans are able to describe image contents with coarse to fine details a...

7 Shizhe Chen, et al. ∙

research

∙ 11/24/2019

Neural Storyboard Artist: Visualizing Stories with Coherent Image Sequences

A storyboard is a sequence of images to illustrate a story containing mu...

0 Shizhe Chen, et al. ∙

research

∙ 10/15/2019

Integrating Temporal and Spatial Attentions for VATEX Video Captioning Challenge 2019

This notebook paper presents our model in the VATEX video captioning cha...

0 Shizhe Chen, et al. ∙

research

∙ 08/15/2019

Unpaired Cross-lingual Image Caption Generation with Self-Supervised Rewards

Generating image descriptions in different languages is essential to sat...

3 Yuqing Song, et al. ∙

research

∙ 07/11/2019

Activitynet 2019 Task 3: Exploring Contexts for Dense Captioning Events in Videos

Contextual reasoning is essential to understand events in long untrimmed...

0 Shizhe Chen, et al. ∙

research

∙ 06/03/2019

From Words to Sentences: A Progressive Learning Approach for Zero-resource Machine Translation with Visual Pivots

The neural machine translation model has suffered from the lack of large...

0 Shizhe Chen, et al. ∙

research

∙ 06/02/2019

Unsupervised Bilingual Lexicon Induction from Mono-lingual Multimodal Data

Bilingual lexicon induction, translating words from the source language ...

0 Shizhe Chen, et al. ∙

research

∙ 06/22/2018

RUC+CMU: System Report for Dense Captioning Events in Videos

This notebook paper presents our system in the ActivityNet Dense Caption...

0 Shizhe Chen, et al. ∙

research

∙ 09/04/2017

Multi-modal Conditional Attention Fusion for Dimensional Emotion Prediction

Continuous dimensional emotion prediction is a challenging task where th...

0 Shizhe Chen, et al. ∙

research

∙ 08/31/2017

Video Captioning with Guidance of Multimodal Latent Topics

The topic diversity of open-domain videos leads to various vocabularies ...

0 Shizhe Chen, et al. ∙

research

∙ 08/31/2017

Generating Video Descriptions with Topic Guidance

Generating video descriptions in natural language (a.k.a. video captioni...

0 Shizhe Chen, et al. ∙

research

∙ 05/03/2016

Improving Image Captioning by Concept-based Sentence Reranking

This paper describes our winning entry in the ImageCLEF 2015 image sente...

0 Xirong Li, et al. ∙

research

∙ 09/17/2014

Adaptive Tag Selection for Image Annotation

Not all tags are relevant to an image, and the number of relevant tags i...

0 Xixi He, et al. ∙

Qin Jin

Featured Co-authors

Sign in with Google

Consider DeepAI Pro