Expanding Frozen Vision-Language Models without Retraining: Towards Improved Robot Perception

08/31/2023
by   Riley Tavassoli, et al.
0

Vision-language models (VLMs) have shown powerful capabilities in visual question answering and reasoning tasks by combining visual representations with the abstract skill set large language models (LLMs) learn during pretraining. Vision, while the most popular modality to augment LLMs with, is only one representation of a scene. In human-robot interaction scenarios, robot perception requires accurate scene understanding by the robot. In this paper, we define and demonstrate a method of aligning the embedding spaces of different modalities (in this case, inertial measurement unit (IMU) data) to the vision embedding space through a combination of supervised and contrastive training, enabling the VLM to understand and reason about these additional modalities without retraining. We opt to give the model IMU embeddings directly over using a separate human activity recognition model that feeds directly into the prompt to allow for any nonlinear interactions between the query, image, and IMU signal that would be lost by mapping the IMU data to a discrete activity label. Further, we demonstrate our methodology's efficacy through experiments involving human activity recognition using IMU data and visual inputs. Our results show that using multiple modalities as input improves the VLM's scene understanding and enhances its overall performance in various tasks, thus paving the way for more versatile and capable language models in multi-modal contexts.

READ FULL TEXT
research
03/06/2023

PaLM-E: An Embodied Multimodal Language Model

Large language models excel at a wide range of complex tasks. However, e...
research
09/24/2022

Deep Neural Networks for Visual Reasoning

Visual perception and language understanding are - fundamental component...
research
05/05/2023

COLA: How to adapt vision-language models to Compose Objects Localized with Attributes?

Compositional reasoning is a hallmark of human visual intelligence; yet ...
research
02/03/2023

LaMPP: Language Models as Probabilistic Priors for Perception and Action

Language models trained on large text corpora encode rich distributional...
research
05/05/2023

LMEye: An Interactive Perception Network for Large Language Models

Training a Large Visual Language Model (LVLM) from scratch, like GPT-4, ...
research
06/16/2023

Vision-Language Models can Identify Distracted Driver Behavior from Naturalistic Videos

Recognizing the activities, causing distraction, in real-world driving s...
research
05/04/2023

Generating Virtual On-body Accelerometer Data from Virtual Textual Descriptions for Human Activity Recognition

The development of robust, generalized models in human activity recognit...

Please sign up or login with your details

Forgot password? Click here to reset