VideoChat: Chat-Centric Video Understanding

05/10/2023
by   Kunchang Li, et al.
0

In this study, we initiate an exploration into video understanding by introducing VideoChat, an end-to-end chat-centric video understanding system. It integrates video foundation models and large language models via a learnable neural interface, excelling in spatiotemporal reasoning, event localization, and causal relationship inference. To instructively tune this system, we propose a video-centric instruction dataset, composed of thousands of videos matched with detailed descriptions and conversations. This dataset emphasizes spatiotemporal reasoning and causal relationships, providing a valuable asset for training chat-centric video understanding systems. Preliminary qualitative experiments reveal our system's potential across a broad spectrum of video applications and set the standard for future research. Access our code and data at https://github.com/OpenGVLab/Ask-Anything

READ FULL TEXT

page 5

page 7

page 8

page 9

page 10

page 11

page 17

research
06/08/2023

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Conversation agents fueled by Large Language Models (LLMs) are providing...
research
05/30/2022

From Representation to Reasoning: Towards both Evidence and Commonsense Reasoning for Video Question-Answering

Video understanding has achieved great success in representation learnin...
research
07/13/2023

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

This paper introduces InternVid, a large-scale video-centric multimodal ...
research
06/12/2023

Valley: Video Assistant with Large Language model Enhanced abilitY

Recently, several multi-modal models have been developed for joint image...
research
11/17/2022

InternVideo-Ego4D: A Pack of Champion Solutions to Ego4D Challenges

In this report, we present our champion solutions to five tracks at Ego4...
research
04/27/2023

ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System

Existing deep video models are limited by specific tasks, fixed input-ou...
research
05/06/2021

Object-centric Video Prediction without Annotation

In order to interact with the world, agents must be able to predict the ...

Please sign up or login with your details

Forgot password? Click here to reset