Log In Sign Up

Vision-based Engagement Detection in Virtual Reality

User engagement modeling for manipulating actions in vision-based interfaces is one of the most important case studies of user mental state detection. In a Virtual Reality environment that employs camera sensors to recognize human activities, we have to know when user intends to perform an action and when not. Without a proper algorithm for recognizing engagement status, any kind of activities could be interpreted as manipulating actions, called "Midas Touch" problem. Baseline approach for solving this problem is activating gesture recognition system using some focus gestures such as waiving or raising hand. However, a desirable natural user interface should be able to understand user's mental status automatically. In this paper, a novel multi-modal model for engagement detection, DAIA, is presented. using DAIA, the spectrum of mental status for performing an action is quantized in a finite number of engagement states. For this purpose, a Finite State Transducer (FST) is designed. This engagement framework shows how to integrate multi-modal information from user biometric data streams such as 2D and 3D imaging. FST is employed to make the state transition smoothly using combination of several boolean expressions. Our FST true detection rate is 92.3 also show FST can segment user hand gestures more robustly.


page 1

page 2

page 3

page 4


Engagement Detection in Meetings

Group meetings are frequent business events aimed to develop and conduct...

Preprint Extending Touch-less Interaction on Vision Based Wearable Device

This is the preprint version of our paper on IEEE Virtual Reality Confer...

Comparing Hand Gestures and the Gamepad Interfaces for Locomotion in Virtual Environments

Hand gesture is a new and promising interface for locomotion in virtual ...

Predicting User Engagement Status for Online Evaluation of Intelligent Assistants

Evaluation of intelligent assistants in large-scale and online settings ...

Digital Therapeutics for Mental Health: Is Attrition the Achilles Heel?

Digit therapeutics are novel software devices that clinicians may utiliz...

A Perception-Driven Approach To Immersive Remote Telerobotics

Virtual Reality (VR) interfaces are increasingly used as remote visualiz...

Detecting Engagement in Egocentric Video

In a wearable camera video, we see what the camera wearer sees. While th...

1 Introduction

User mental status detection has variety of applications in human activity recognition systems. Without a proper algorithm for detecting human intention to interact, the vision based system is always on, therefore any kind of activity may interpreted as an interaction [1]

. Group meetings which are frequent business events is modeled as a case study. In this case study, among all available data streams, a combination of tracking 3D skeleton data is combined for user engagement detection in meetings. Multiple binary classifiers are implemented to detect user intention for performing an action. The output of these binary classifiers are used to create transition and guard conditions in FST. Characteristics of engagement will be discussed and biometric data which can be used for this purpose will be introduced. 3D skeleton tracking will be introduced as one of the channels of biometric information for engagement detection. Although we just use this only channel of biometric data, experiment results show we still can predict engagement with high accuracy.

In addition, DAIA, the FST of engagement detection helps system to flow among states smoothly. FST is a predefined structure based on our knowledge of human activities that helps system predict engagement state more accurately. Furthermore, FST algorithm is computational efficient. This property of FST allows achievable on-line and real-time performance.

1.1 Related Works

Engagement has been investigated in various fields such as education, organizational behavior, work, or media. Engagement is defined as the value that a participant in an interaction attributes to the goal of being together with other participant(s) and continuing interaction [2, 3]. It is also defined as the process by which two or more participants establish, maintain, and end their perceived connection directly related to attention [3, 4] ). “Effort without distress” [5], “A meaningful involvement”[4], Enabled through vigor, dedication, and absorption[6]) are other definitions of engagement.

Body posture gives important information about engagement. Various approaches have been investigated based on body language analysis to improve human computer interaction. Intention to engage with an agent e.g. a robot [7, 8], or interactive display [9], are some of these studies. Measuring the engagement intent is used in service robots to identify relevant gestures from irrelevant gestures which is known as Midas Touch Problem [7, 10]. Another research interest is related to learner engagement with robotic companions or interfaces[11].In addition, Intention to engage with a display for improving user identification is addressed in Schwarz et al. [9].

The role of body pose and motion in user’s interest detection using body tracking systems such as Kinect has been addressed in several research [12, 13, 14, 15].

A variety of studies strives for a multi-modal approach using some features of facial expression, body motion, voice, or seat pressure to elucidate on mental states. Benkaouar et al. [16] is discussing gaze and upper body posture for engagement detetion. Schwarz et. al [9] used combination of gaze, upper body and arm position for the purpose of intention detction in engagement. Vaufreydaz et. al [7] used gaze and proxemics and Salam et. al [3] employed human state observation for engagement detection. Engell et al. used gaze and facial expression[17] and Balaban et al. [18] employed weight, head, and upper body motion; Scherer et al. [19] and Dael et. al [6] discuss voice, face, posture. Using Finite State Machine (FSM) for multi-modal system modeling is addressed in multiple research[20],[21],[22],[23].

Mota used both neural networks for posture detection and Hidden Markov Models to detect engagement at an overall accuracy of 77% which needs an expensive computation

[24]. Michalowski et al. offered a spatial model combined with gaze tracking to detect user engagement with a robot receptionist [25].

All of the studies mentioned in this section discuss a very small set of potential classifiers. They also do not make full usage of the amount of qualitative research on non-verbal body language and its indication of mental states. In this research, we address these features for engagement modeling and detection.

2 Engagement Modeling and Metrics

Frank et. al[26] have proposed a multi-modal engagement user state detection process. In our intended scenario, multiple people are within the operating range of the sensor, e.g. field of view of the camera. This module identifies the state of the participants on the engagement scale as disengaged up to involved action. Using the biometric information, the module identifies the person who exhibits a specific combination of classifiers of all categories. The classifiers, e.g. body posture components, are chosen based on research, relating them to attentiveness and engagement. The analysis is occurring on a frame-by-frame basis. Each frame is analyzed regarding all classifiers, e.g. if the person is 1) raising a hand, 2) facing a display, 3) leaning forward, 4) leaning backward, 5) uttering a feedback, 6) slouching, or 7) changing position in the last 60 frames. Classifiers are evaluated as being exhibited or not exhibited in a specific frame as a binary value.

Among all available biometric information such as gaze, voice and gesture to extend the proposed framework to detect engagement state of the user, we just use 3D joint information provided by a depth camera and NiTE SDK by Primesense. However, using more biometric channels of information will make system more accurate, our experiment results showed even use this only channel of information could result in a high performance user engagement detection system.

Upper body joints play important role in engagement detection. Multiple classifiers are designed to detect and classify upper body direction. In addition, hand movements such as raising hand, pointing, swiping, pushing or pulling are used to manipulate in vision-based interfaces. Therefore, different classifiers are designed to detect various hand movements such as raise hand above waist and also different levels of hand speed. These classifiers help to detect user intention for performing an action.

An action video with frames and joints in each frame can be represented as a set of 3D points sequence, written as . The 3D sensor provides us fifteen joints, and varies for different sequences. However, in our system , because our classifiers only use ten upper body joints from this set which are head, left and right shoulders, left and right elbows, left and right hand, torso and left and right hips. For each point, 3 dimensional position is obtained.

The first basic step of feature extraction is to compute basic feature for each frame, which describes the pose information of every of these ten joints in a single frame. The second step is calculating of left and right hand speed information. This feature is obtained by calculating 3D distance that each hand moves in two consecutive frames.

The binary values for the individual binary classifiers are weighted based on the relative influence in the training data and summed up to . The engagement score thereby assumes a value between 0 and 1.

Figure 1 shows how our system extract features from users and calculate engagement levels of each users.

Figure 1: System Approach

The engagement score is calculated using where

is vector of weights

, and is the vector of binary classifiers such that are 0 or 1 based on the output of the binary classifiers.

Table 1 gathers binary classifiers which are designed for this purpose. For each posture status, multiple binary classifiers are designed. The 0 or 1 output of these 37 classifiers are used to make our feature vector for intention to action classifier.

Posture Status Binary Classifiers
Hand Horizontal Right of Body, Close to Body, Left of Body
Hand Vertical Below Hip, Below Torso, Below Shoulder, Below Head
Hand Depth Back of Body, Close to Body, Front of Body
Hand Speed Stopped, Slow, Fast, Too Fast
Body Direction Facing Sensor
Leaning Lean back, No Lean, Lean Forward
Special Postures Hands folded, Hands on Head, Hands on Torso
Table 1: Binary classifiers for each posture status

This classifiers are mostly designed based on heuristic information extracted from joints 3D location. For instance, body direction classifier is made using the normal vector of the plane containing right and left shoulders, and torso joint.

3 DAIA: FST for Engagement Detection

DAIA is a frame-based engagement detection system using an FST. State machines are the description of a life cycle of a system. They can describe the different states of the lifeline, the events which influence it, and what it does when a particular event is detected at any states as the transition condition for particular state change. They offer the complete specification of the dynamic behavior of the system. Figure 2 shows the outline of this framework.

Figure 2: DAIA: FST for Engagement Detection

In order to increase efficiency and accuracy of our algorithm we implemented a Finite State Transducer (FST). A state is a description of the mental state or engagement of the user that is anticipated to change over time. A transition is initialized by a change in condition that results in a change of state. For example, when using a gesture recognition system to find out meaningful gestures of users, swiping or pointing can happen in some states such as in the Action state and similar gestures in the Attention state will be ignored or interpreted differently. In this research, we model engagement states as a finite state transducer with four different states: Disengagement: User is disengaged from screen or the target person/object. Attention: User has attention, but doesn’t have intention to do any actions. Intention: User has intention to do some action, but still not doing it. Action: User is performing an action.

A finite state transducer is a sextuple , where: is the input alphabet (a finite non-empty set of symbols). is the output alphabet (a finite, non-empty set of symbols). is a finite, non-empty set of states. is the initial state, an element of . is the state-transition function: . is the output function. If the output function is a function of a state and input alphabet () that definition corresponds to the Mealy model, and can be modeled as a Mealy machine.

Some hypotheses are considered in designing this FST: Engagement states change in a specific order: This property describes the FST design. It starts with disengagement (Initial State). All states can be transited to disengagement, but there is always a chain of ordered state transition for other states of engagement.

We may modify the detected state based on conditions of FST: States should transit smoothly. FST smooths state transition that helps better gesture segmentation. We don’t change state just based on Intention to act or disengagement classifier. We know the human activities are continues, so, using some protection which is called guard conditions we smoothly change states. Furthermore, engagement state classifier is memoryless. it may report wrong engagement state based on the current biometric properties of the states. However, FST keeps record of engagement states and help relabeling the frames more accurately.

Table 2 describes Transition Conditions, , for these state changes. is the combination of event triggering the transition, the target state, guard and actions as follows:

- -
Table 2: Transition table

, , , : Body Direction is not facing sensor or a Special Posture such as Hand Folded exists

, , : The output of binary classifiers and Intention to Act classifier doesn’t change.

: Body Direction is facing sensor. : Intention to Act classifier is triggered and output is 1, but both Hand Speed classifiers are stopped in at least one frame of each window of predefined number of consecutive frames. This window frame is a guard to protect state from transition because of small movements of hand which are not actions. : Intention to Act classifier converts from 1 to 0 for more than predefined consecutive frames. This window of frames protects transition from action to Intention for small position changes that make Intention to Act classifier zero. : Intention to Act classifier is triggered and output is 1, also at least one of Hand Speed classifiers for detecting slow or fast movements is 1 for a predefined number of consecutive frames. : Intention to Act classifier is triggered and output is 1, but both of Hand Speed classifiers for detecting stopped is 1 for a predefined number of consecutive frames. After each transition, if FST recognizes the labels that is assigned to some frames are wrong, it can change reconsider and modify them by relabling. In addition, based on analysis of the speed signal of the hand, FST will relabel the frames when user starts moving hand to raise his hand. That helps we have correct segment of gesture for our gesture recognition system.

4 Experiment Results

DAIA framework was implemented in C++ on Windows workstation. We used ASUS Xtion Pro to capture depth images and track skeletons using Primesense NiTE SDK. The system ran at 30 frames per second. As it is mentioned in Table 1 we created 37 binary classifiers. For each posture status, multiple binary classifiers are designed. The 0 or 1 output of these 37 classifiers is used to make our feature vector, , for intention to action classifier. Furthermore, we need to define or vector of weights to calculate engagement score and afterwards we should define a threshold to classify the frame as intention to act or disengagement similar to the procedure proposed in [9]. It needs extensive research on different body postures to calculate these weights. Furthermore, putting constant weight values for different classifiers may result in wrong classification for complex body postures. Therefore, instead of defining constant values for the weight vector, we used SVM[27] with linear kernel for training our intention to act classification. We used as the feature vector for training and testing our SVM.

To train the classifier, a simple ”Catch the Box!” game is designed. In this game, we used hand tracking algorithm implemented in NiTE middle-ware by Primesense to move the cursor on screen. A solid rectangle randomly appears on the screen and user should move the cursor on rectangle area to receive points. The game has 3 stages which are ”Getting Ready”, ”Play” and ”Stop”. The binary classifier outputs of table 1 in ”Play” mode of the game are combined as a series of 0 and 1 and used as ”Intention to Act” feature vectors to train an binary SVM with linear kernel. The output of classifiers in ”Getting Ready” and ”Stop!” stages of game creates feature vectors of the SVM when ”Intention to Act” is not present.

We captured and labeled 23,210 frames from 5 different subjects that played the game separately. 5,000 frames are used for training and the remaining used for testing the classifier. This classifier performance was 86.38%. The frame is classified as Intention to Act or not. FST helps to relabel the frame based on the current state properties and guards. In our experiment, we asked 30 different users to hear random order of commands from a list of actions such as ”raising hand” or ”swiping right to left” and perform them in front of a screen.

Afterwards, each recorded frame is labeled in four different engagement states from Disengagement to Action by an expert and used as ground truth. Our FST performance is calculated based on the number of correct states reported by FST after relabeling the frames and also ground truth labeled manually. In total, 165,422 frames are labeled to each engagement states. The results are gathered in table 3. Figure 3 shows engagement state detection using FST and combinations of boolean operations on raising and putting down right hand.

State FST Performance
Disengagement 97.3%
Attention 87.2%
Intention 90.8%
Action 94.2%
Total 92.3%
Table 3: Performance of FST
Figure 3: Engagement state detection using FST for raising and putting down right hand in 600 frames: a) Facing classifier b) Right hand speed value c) Intention to Act Classifier d) Engagement states for raising and putting down right hand

5 Discussions and Conlutions

In this paper, a novel multi-modal model for engagement is introduced. Based on this model, a combination of tracking 3D gesture data is employed for user engagement detection. Therefore, the spectrum of mental status for performing an action is quantized in a finite number of engagement states. Furthermore, a finite state transducer (FST) with the following engagement states is proposed: Disengagement, Attention, Intention, Action. Results show our Intention to Act performance is 86.3%. In addition, FST relables some of those labels based on the history of engagement states and guard conditions. The performance of our FST for labeling the frames correctly is 92.3%.The processing time for each frame is less than 10ms which indicates real-time usability of our algorithm.

In future research, we expect using other channels of biometric information such as voice and facial data such as gaze. We may reach even higher true detection rates using extra channels of information. In addition, by using multi-camera and calculating engagement state for each of the audience in a meeting room we will be able to detect the main operator and give the control of the vision based interface to that participant.


  • [1] Rick Kjeldsen, Anthony Levas, and Claudio Pinhanez, “Dynamically reconfigurable vision-based user interfaces,” in Computer Vision Systems, pp. 323–332. Springer, 2003.
  • [2] Rodie Cowie, Ursula Hess, Shlomo Hareli, Maria Francesca O’Connor, Laurel D Riek, Louis-Philippe Morency, Jonathan Aigrain, Severine Dubuisson, Marcin Detyniecki, Mohamed Chetouani, et al., “Cbar 2015: Context based affect recognition,” 2015.
  • [3] Hanan Salam and Mohamed Chetouani, “A multi-level context-based modeling of engagement in human-robot interaction,” in Automatic Face and Gesture Recognition (FG), 2015 11th IEEE International Conference and Workshops on. IEEE, 2015, vol. 3, pp. 1–6.
  • [4] Candace L Sidner, Cory D Kidd, Christopher Lee, and Neal Lesh, “Where to look: a study of human-robot engagement,” in Proceedings of the 9th international conference on Intelligent user interfaces. ACM, 2004, pp. 78–84.
  • [5] Kelli I Stajduhar, Sally E Thorne, Liza McGuinness, and Charmaine Kim-Sing, “Patient perceptions of helpful communication in the context of advanced cancer,” Journal of clinical nursing, vol. 19, no. 13-14, pp. 2039–2047, 2010.
  • [6] Nele Dael, Marcello Mortillaro, and Klaus R Scherer, “Emotion expression in body action and posture.,” Emotion, vol. 12, no. 5, pp. 1085, 2012.
  • [7] Dominique Vaufreydaz, Wafa Johal, and Claudine Combe, “Starting engagement detection towards a companion robot using multimodal features,” Robotics and Autonomous Systems, 2015.
  • [8] David Klotz, Johannes Wienke, Julia Peltason, Britta Wrede, Sebastian Wrede, Vasil Khalidov, and Jean-Marc Odobez, “Engagement-based multi-party dialog with a humanoid robot,” in Proceedings of the SIGDIAL 2011 Conference. Association for Computational Linguistics, 2011, pp. 341–343.
  • [9] Julia Schwarz, Charles Claudius Marais, Tommer Leyvand, Scott E Hudson, and Jennifer Mankoff, “Combining body pose, gaze, and gesture to determine intention to interact in vision-based interfaces,” in Proceedings of the 32nd annual ACM conference on Human factors in computing systems. ACM, 2014, pp. 3443–3452.
  • [10] Ross Mead, Amin Atrash, and Maja J Matarić, “Proxemic feature recognition for interactive robots: automating metrics from the social sciences,” in International conference on social robotics. Springer, 2011, pp. 52–61.
  • [11] Jyotirmay Sanghvi, Ginevra Castellano, Iolanda Leite, André Pereira, Peter W McOwan, and Ana Paiva, “Automatic analysis of affective postures and body motion to detect engagement with a game companion,” in 2011 6th ACM/IEEE International Conference on Human-Robot Interaction (HRI). IEEE, 2011, pp. 305–311.
  • [12] Nadia Bianchi-Berthouze, “Understanding the role of body movement in player engagement,” Human–Computer Interaction, vol. 28, no. 1, pp. 40–75, 2013.
  • [13] Andrea Kleinsmith and Nadia Bianchi-Berthouze, “Affective body expression perception and recognition: A survey,” Affective Computing, IEEE Transactions on, vol. 4, no. 1, pp. 15–33, 2013.
  • [14] N Bianchi-Berthouze, “What can body movement tell us about players’ engagement,” Measuring Behavior’12, pp. 94–97, 2012.
  • [15] Stylianos Asteriadis, Kostas Karpouzis, and Stefanos Kollias, “Feature extraction and selection for inferring user engagement in an hci environment,” in Human-Computer Interaction. New Trends, pp. 22–29. Springer, 2009.
  • [16] Wafa Benkaouar and Dominique Vaufreydaz, “Multi-sensors engagement detection with a robot companion in a home environment,” in Workshop on Assistance and Service robotics in a human environment at IEEE International Conference on Intelligent Robots and Systems (IROS2012), 2012, pp. 45–52.
  • [17] Andrew D Engell and James V Haxby, “Facial expression and gaze-direction in human superior temporal sulcus,” Neuropsychologia, vol. 45, no. 14, pp. 3234–3241, 2007.
  • [18] Carey D Balaban, Joseph Cohn, Mark S Redfern, Jarad Prinkey, Roy Stripling, and Michael Hoffer, “Postural control as a probe for cognitive state: Exploiting human information processing to enhance performance,” International Journal of Human-Computer Interaction, vol. 17, no. 2, pp. 275–286, 2004.
  • [19] Stefan Scherer, Michael Glodek, Georg Layher, Martin Schels, Miriam Schmidt, Tobias Brosch, Stephan Tschechne, Friedhelm Schwenker, Heiko Neumann, and Günther Palm, “A generic framework for the inference of user states in human computer interaction,” Journal on Multimodal User Interfaces, vol. 6, no. 3-4, pp. 117–141, 2012.
  • [20] Michael Johnston, Srinivas Bangalore, Gunaranjan Vasireddy, Amanda Stent, Patrick Ehlen, Marilyn Walker, Steve Whittaker, and Preetam Maloor, “Match: An architecture for multimodal dialogue systems,” in Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 2002, pp. 376–383.
  • [21] Michael Johnston and Srinivas Bangalore, “Finite-state multimodal parsing and understanding,” in Proceedings of the 18th conference on Computational linguistics-Volume 1. Association for Computational Linguistics, 2000, pp. 369–375.
  • [22] Bradley A Singletary and Thad E Starner, “Learning visual models of social engagement,” in Recognition, Analysis, and Tracking of Faces and Gestures in Real-Time Systems, 2001. Proceedings. IEEE ICCV Workshop on. IEEE, 2001, pp. 141–148.
  • [23] Marie-Luce Bourguet, “Designing and prototyping multimodal commands.,” in INTERACT. Citeseer, 2003, vol. 3, pp. 717–720.
  • [24] Selene Mota and Rosalind W Picard, “Automated posture analysis for detecting learner’s interest level,” in

    Computer Vision and Pattern Recognition Workshop, 2003. CVPRW’03. Conference on

    . IEEE, 2003, vol. 5, pp. 49–49.
  • [25] Marek P Michalowski, Selma Sabanovic, and Reid Simmons, “A spatial model of engagement for a social robot,” in Advanced Motion Control, 2006. 9th IEEE International Workshop on. IEEE, 2006, pp. 762–767.
  • [26] Maria Frank, Ghassem Tofighi, Haisong Gu, and Renate Fruchter, “Engagement detection in meetings,” arXiv preprint arXiv:1608.08711, 2016.
  • [27] Chih-Chung Chang and Chih-Jen Lin,

    “Libsvm: a library for support vector machines,”

    ACM Transactions on Intelligent Systems and Technology (TIST), vol. 2, no. 3, pp. 27, 2011.