Ethosight: A Reasoning-Guided Iterative Learning System for Nuanced Perception based on Joint-Embedding Contextual Label Affinity

by   Hugo Latapie, et al.

Traditional computer vision models often require extensive manual effort for data acquisition, annotation and validation, particularly when detecting subtle behavioral nuances or events. The difficulty in distinguishing routine behaviors from potential risks in real-world applications, such as differentiating routine shopping from potential shoplifting, further complicates the process. Moreover, these models may demonstrate high false positive rates and imprecise event detection when exposed to real-world scenarios that differ significantly from the conditions of the training data. To overcome these hurdles, we present Ethosight, a novel zero-shot computer vision system. Ethosight initiates with a clean slate based on user requirements and semantic knowledge of interest. Using localized label affinity calculations and a reasoning-guided iterative learning loop, Ethosight infers scene details and iteratively refines the label set. Reasoning mechanisms can be derived from large language models like GPT4, symbolic reasoners like OpenNARS<cit.><cit.>, or hybrid systems. Our evaluations demonstrate Ethosight's efficacy across 40 complex use cases, spanning domains such as health, safety, and security. Detailed results and case studies within the main body of this paper and an appendix underscore a promising trajectory towards enhancing the adaptability and resilience of computer vision models in detecting and extracting subtle and nuanced behaviors.


page 5

page 6


Vision-Language Models can Identify Distracted Driver Behavior from Naturalistic Videos

Recognizing the activities, causing distraction, in real-world driving s...

Comparisons of Reasoning Mechanisms for Computer Vision

An evidential reasoning mechanism based on the Dempster-Shafer theory of...

Semi-Lexical Languages – A Formal Basis for Unifying Machine Learning and Symbolic Reasoning in Computer Vision

Human vision is able to compensate imperfections in sensory inputs from ...

Visual Programming: Compositional visual reasoning without training

We present VISPROG, a neuro-symbolic approach to solving complex and com...

Automatic Data Transformation Using Large Language Model: An Experimental Study on Building Energy Data

Existing approaches to automatic data transformation are insufficient to...

Please sign up or login with your details

Forgot password? Click here to reset