Jointly Optimizing Sensing Pipelines for Multimodal Mixed Reality Interaction

by   Darshana Rathnayake, et al.

Natural human interactions for Mixed Reality Applications are overwhelmingly multimodal: humans communicate intent and instructions via a combination of visual, aural and gestural cues. However, supporting low-latency and accurate comprehension of such multimodal instructions (MMI), on resource-constrained wearable devices, remains an open challenge, especially as the state-of-the-art comprehension techniques for each individual modality increasingly utilize complex Deep Neural Network models. We demonstrate the possibility of overcoming the core limitation of latency–vs.–accuracy tradeoff by exploiting cross-modal dependencies - i.e., by compensating for the inferior performance of one model with an increased accuracy of more complex model of a different modality. We present a sensor fusion architecture that performs MMI comprehension in a quasi-synchronous fashion, by fusing visual, speech and gestural input. The architecture is reconfigurable and supports dynamic modification of the complexity of the data processing pipeline for each individual modality in response to contextual changes. Using a representative "classroom" context and a set of four common interaction primitives, we then demonstrate how the choices between low and high complexity models for each individual modality are coupled. In particular, we show that (a) a judicious combination of low and high complexity models across modalities can offer a dramatic 3-fold decrease in comprehension latency together with an increase 10-15 dependent, with the performance of some model combinations being significantly more sensitive to changes in scene context or choice of interaction.



There are no comments yet.


page 1

page 3


Multimodal Driver Referencing: A Comparison of Pointing to Objects Inside and Outside the Vehicle

Advanced in-cabin sensing technologies, especially vision based approach...

MHVAE: a Human-Inspired Deep Hierarchical Generative Model for Multimodal Representation Learning

Humans are able to create rich representations of their external reality...

Exploring Temporal Dependencies in Multimodal Referring Expressions with Mixed Reality

In collaborative tasks, people rely both on verbal and non-verbal cues s...

Multi-attention Recurrent Network for Human Communication Comprehension

Human face-to-face communication is a complex multimodal signal. We use ...

Investigating Audio, Visual, and Text Fusion Methods for End-to-End Automatic Personality Prediction

We propose a tri-modal architecture to predict Big Five personality trai...

Multimodal Fusion of EMG and Vision for Human Grasp Intent Inference in Prosthetic Hand Control

For lower arm amputees, robotic prosthetic hands offer the promise to re...

A computational investigation of sources of variability in sentence comprehension difficulty in aphasia

We present a computational evaluation of three hypotheses about sources ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.