End-to-end Concept Word Detection for Video Captioning, Retrieval, and Question Answering

10/10/2016
by   Youngjae Yu, et al.
0

We propose a high-level concept word detector that can be integrated with any video-to-language models. It takes a video as input and generates a list of concept words as useful semantic priors for language generation models. The proposed word detector has two important properties. First, it does not require any external knowledge sources for training. Second, the proposed word detector is trainable in an end-to-end manner jointly with any video-to-language models. To maximize the values of detected words, we also develop a semantic attention mechanism that selectively focuses on the detected concept words and fuse them with the word encoding and decoding in the language model. In order to demonstrate that the proposed approach indeed improves the performance of multiple video-to-language tasks, we participate in four tasks of LSMDC 2016. Our approach achieves the best accuracies in three of them, including fill-in-the-blank, multiple-choice test, and movie retrieval. We also attain comparable performance for the other task, movie description.

READ FULL TEXT

page 14

page 15

page 17

page 18

page 19

page 20

page 21

page 22

research
03/20/2023

On-the-fly Text Retrieval for End-to-End ASR Adaptation

End-to-end speech recognition models are improved by incorporating exter...
research
12/26/2018

Hierarchical LSTMs with Adaptive Attention for Visual Captioning

Recent progress has been made in using attention based encoder-decoder f...
research
04/04/2023

Unsupervised Improvement of Factual Knowledge in Language Models

Masked language modeling (MLM) plays a key role in pretraining large lan...
research
09/26/2016

Learning Language-Visual Embedding for Movie Understanding with Natural-Language

Learning a joint language-visual embedding has a number of very appealin...
research
09/05/2019

A Better Way to Attend: Attention with Trees for Video Question Answering

We propose a new attention model for video question answering. The main ...
research
08/07/2018

A Joint Sequence Fusion Model for Video Question Answering and Retrieval

We present an approach named JSFusion (Joint Sequence Fusion) that can m...
research
06/03/2015

What value do explicit high level concepts have in vision to language problems?

Much of the recent progress in Vision-to-Language (V2L) problems has bee...

Please sign up or login with your details

Forgot password? Click here to reset