Let's Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction

05/23/2023
by   Vaishnavi Himakunthala, et al.
0

Despite constituting 65 underrepresented in generative AI research. Meanwhile, recent large language models (LLMs) have become increasingly integrated with capabilities in the visual modality. Integrating video with LLMs is a natural next step, so how can this gap be bridged? To advance video reasoning, we propose a new research direction of VideoCOT on video keyframes, which leverages the multimodal generative abilities of vision-language models to enhance video reasoning while reducing the computational complexity of processing hundreds or thousands of frames. We introduce VIP, an inference-time dataset that can be used to evaluate VideoCOT, containing 1) a variety of real-life videos with keyframes and corresponding unstructured and structured scene descriptions, and 2) two new video reasoning tasks: video infilling and scene prediction. We benchmark various vision-language models on VIP, demonstrating the potential to use vision-language models and LLMs to enhance video chain of thought reasoning.

READ FULL TEXT

page 2

page 6

page 7

research
10/06/2022

Language Models are Multilingual Chain-of-Thought Reasoners

We evaluate the reasoning abilities of large language models in multilin...
research
05/03/2023

Visual Chain of Thought: Bridging Logical Gaps with Multimodal Infillings

Recent advances in large language models elicit reasoning in a chain of ...
research
08/10/2023

Metacognitive Prompting Improves Understanding in Large Language Models

In Large Language Models (LLMs), there have been consistent advancements...
research
06/12/2023

Recursion of Thought: A Divide-and-Conquer Approach to Multi-Context Reasoning with Language Models

Generating intermediate steps, or Chain of Thought (CoT), is an effectiv...
research
12/01/2022

Distilling Multi-Step Reasoning Capabilities of Large Language Models into Smaller Models via Semantic Decompositions

Step-by-step reasoning approaches like chain-of-thought (CoT) have prove...
research
03/20/2023

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

We propose MM-REACT, a system paradigm that integrates ChatGPT with a po...
research
04/06/2023

When do you need Chain-of-Thought Prompting for ChatGPT?

Chain-of-Thought (CoT) prompting can effectively elicit complex multi-st...

Please sign up or login with your details

Forgot password? Click here to reset