Models See Hallucinations: Evaluating the Factuality in Video Captioning

03/06/2023
by   Hui Liu, et al.
0

Video captioning aims to describe events in a video with natural language. In recent years, many works have focused on improving captioning models' performance. However, like other text generation tasks, it risks introducing factual errors not supported by the input video. These factual errors can seriously affect the quality of the generated text, sometimes making it completely unusable. Although factual consistency has received much research attention in text-to-text tasks (e.g., summarization), it is less studied in the context of vision-based text generation. In this work, we conduct a detailed human evaluation of the factuality in video captioning and collect two annotated factuality datasets. We find that 57.0 sentences have factual errors, indicating it is a severe problem in this field. However, existing evaluation metrics are mainly based on n-gram matching and show little correlation with human factuality annotation. We further propose a weakly-supervised, model-based factuality metric FactVC, which outperforms previous metrics on factuality evaluation of video captioning. The datasets and metrics will be released to promote future research for video captioning.

READ FULL TEXT

page 1

page 4

page 13

page 15

page 16

research
10/10/2022

Not All Errors are Equal: Learning Text Generation Metrics using Stratified Error Synthesis

Is it possible to build a general and automatic natural language generat...
research
06/26/2020

Evaluation of Text Generation: A Survey

The paper surveys evaluation methods of natural language generation (NLG...
research
07/25/2021

Boosting Video Captioning with Dynamic Loss Network

Video captioning is one of the challenging problems at the intersection ...
research
09/15/2020

Semantically Sensible Video Captioning (SSVC)

Video captioning, i.e. the task of generating captions from video sequen...
research
05/07/2022

Attract me to Buy: Advertisement Copywriting Generation with Multimodal Multi-structured Information

Recently, online shopping has gradually become a common way of shopping ...
research
10/03/2022

Text-to-Audio Grounding Based Novel Metric for Evaluating Audio Caption Similarity

Automatic Audio Captioning (AAC) refers to the task of translating an au...
research
09/21/2020

TRECVID 2019: An Evaluation Campaign to Benchmark Video Activity Detection, Video Captioning and Matching, and Video Search Retrieval

The TREC Video Retrieval Evaluation (TRECVID) 2019 was a TREC-style vide...

Please sign up or login with your details

Forgot password? Click here to reset