Learning to Compose Topic-Aware Mixture of Experts for Zero-Shot Video Captioning

11/07/2018
by   Xin Wang, et al.
0

Although promising results have been achieved in video captioning, existing models are limited to the fixed inventory of activities in the training corpus, and do not generalize to open vocabulary scenarios. Here we introduce a novel task, zero-shot video captioning, that aims at describing out-of-domain videos of unseen activities. Videos of different activities usually require different captioning strategies in many aspects, i.e. word selection, semantic construction, and style expression etc, which poses a great challenge to depict novel activities without paired training data. But meanwhile, similar activities share some of those aspects in common. Therefore, We propose a principled Topic-Aware Mixture of Experts (TAMoE) model for zero-shot video captioning, which learns to compose different experts based on different topic embeddings, implicitly transferring the knowledge learned from seen activities to unseen ones. Besides, we leverage external topic-related text corpus to construct the topic embedding for each activity, which embodies the most relevant semantic vectors within the topic. Empirical results not only validate the effectiveness of our method in utilizing semantic knowledge for video captioning, but also show its strong generalization ability when describing novel activities.

READ FULL TEXT

page 1

page 7

research
07/05/2023

Zero-Shot Dense Video Captioning by Jointly Optimizing Text and Moment

Dense video captioning, a task of localizing meaningful moments and gene...
research
01/22/2022

Visual Information Guided Zero-Shot Paraphrase Generation

Zero-shot paraphrase generation has drawn much attention as the large-sc...
research
03/12/2020

ZSTAD: Zero-Shot Temporal Activity Detection

An integral part of video analysis and surveillance is temporal activity...
research
06/06/2021

Learning Video Models from Text: Zero-Shot Anticipation for Procedural Actions

Can we teach a robot to recognize and make predictions for activities th...
research
06/01/2023

Divide, Conquer, and Combine: Mixture of Semantic-Independent Experts for Zero-Shot Dialogue State Tracking

Zero-shot transfer learning for Dialogue State Tracking (DST) helps to h...
research
08/31/2017

Video Captioning with Guidance of Multimodal Latent Topics

The topic diversity of open-domain videos leads to various vocabularies ...
research
07/24/2022

SAVCHOI: Detecting Suspicious Activities using Dense Video Captioning with Human Object Interactions

Detecting suspicious activities in surveillance videos has been a longst...

Please sign up or login with your details

Forgot password? Click here to reset