Language as the Medium: Multimodal Video Classification through text only

09/19/2023
by   Laura Hanu, et al.
0

Despite an exciting new wave of multimodal machine learning models, current approaches still struggle to interpret the complex contextual relationships between the different modalities present in videos. Going beyond existing methods that emphasize simple activities or objects, we propose a new model-agnostic approach for generating detailed textual descriptions that captures multimodal video information. Our method leverages the extensive knowledge learnt by large language models, such as GPT-3.5 or Llama2, to reason about textual descriptions of the visual and aural modalities, obtained from BLIP-2, Whisper and ImageBind. Without needing additional finetuning of video-text models or datasets, we demonstrate that available LLMs have the ability to use these multimodal textual descriptions as proxies for “sight” or “hearing” and perform zero-shot multimodal classification of videos in-context. Our evaluations on popular action recognition benchmarks, such as UCF-101 or Kinetics, show these context-rich descriptions can be successfully used in video understanding tasks. This method points towards a promising new research direction in multimodal classification, demonstrating how an interplay between textual, visual and auditory machine learning models can enable more holistic video understanding.

READ FULL TEXT
research
07/04/2023

Garbage in, garbage out: Zero-shot detection of crime using Large Language Models

This paper proposes exploiting the common sense knowledge learned by lar...
research
07/13/2023

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

This paper introduces InternVid, a large-scale video-centric multimodal ...
research
09/02/2023

Zero-Shot Recommendations with Pre-Trained Large Language Models for Multimodal Nudging

We present a method for zero-shot recommendation of multimodal non-stati...
research
08/17/2023

FashionLOGO: Prompting Multimodal Large Language Models for Fashion Logo Embeddings

Logo embedding plays a crucial role in various e-commerce applications b...
research
06/15/2023

Revealing the Illusion of Joint Multimodal Understanding in VideoQA Models

While VideoQA Transformer models demonstrate competitive performance on ...
research
05/28/2021

Highlight Timestamp Detection Model for Comedy Videos via Multimodal Sentiment Analysis

Nowadays, the videos on the Internet are prevailing. The precise and in-...
research
03/29/2022

Image Retrieval from Contextual Descriptions

The ability to integrate context, including perceptual and temporal cues...

Please sign up or login with your details

Forgot password? Click here to reset