Jointly Optimizing Sensing Pipelines for Multimodal Mixed Reality Interaction

10/13/2020
by   Darshana Rathnayake, et al.
0

Natural human interactions for Mixed Reality Applications are overwhelmingly multimodal: humans communicate intent and instructions via a combination of visual, aural and gestural cues. However, supporting low-latency and accurate comprehension of such multimodal instructions (MMI), on resource-constrained wearable devices, remains an open challenge, especially as the state-of-the-art comprehension techniques for each individual modality increasingly utilize complex Deep Neural Network models. We demonstrate the possibility of overcoming the core limitation of latency–vs.–accuracy tradeoff by exploiting cross-modal dependencies - i.e., by compensating for the inferior performance of one model with an increased accuracy of more complex model of a different modality. We present a sensor fusion architecture that performs MMI comprehension in a quasi-synchronous fashion, by fusing visual, speech and gestural input. The architecture is reconfigurable and supports dynamic modification of the complexity of the data processing pipeline for each individual modality in response to contextual changes. Using a representative "classroom" context and a set of four common interaction primitives, we then demonstrate how the choices between low and high complexity models for each individual modality are coupled. In particular, we show that (a) a judicious combination of low and high complexity models across modalities can offer a dramatic 3-fold decrease in comprehension latency together with an increase 10-15 dependent, with the performance of some model combinations being significantly more sensitive to changes in scene context or choice of interaction.

READ FULL TEXT
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

02/15/2022

Multimodal Driver Referencing: A Comparison of Pointing to Objects Inside and Outside the Vehicle

Advanced in-cabin sensing technologies, especially vision based approach...
06/04/2020

MHVAE: a Human-Inspired Deep Hierarchical Generative Model for Multimodal Representation Learning

Humans are able to create rich representations of their external reality...
02/04/2019

Exploring Temporal Dependencies in Multimodal Referring Expressions with Mixed Reality

In collaborative tasks, people rely both on verbal and non-verbal cues s...
02/03/2018

Multi-attention Recurrent Network for Human Communication Comprehension

Human face-to-face communication is a complex multimodal signal. We use ...
05/02/2018

Investigating Audio, Visual, and Text Fusion Methods for End-to-End Automatic Personality Prediction

We propose a tri-modal architecture to predict Big Five personality trai...
04/08/2021

Multimodal Fusion of EMG and Vision for Human Grasp Intent Inference in Prosthetic Hand Control

For lower arm amputees, robotic prosthetic hands offer the promise to re...
03/14/2017

A computational investigation of sources of variability in sentence comprehension difficulty in aphasia

We present a computational evaluation of three hypotheses about sources ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.