Facial Electromyography-based Adaptive Virtual Reality Gaming for Cognitive Training

by   Lorcan Reidy, et al.
University of Cambridge

As life expectancy rises, age-related diseases causing dementia become more prevalent. In line with this, the health economic impact of dementia is escalating to unsustainable levels, with estimates that by 2050 dementia care will cost an annual 1 trillion in the US alone. The development of interventions capable of improving cognition therefore represents an issue of the highest priority for healthcare. There has been considerable focus on cognitive training (CT) in particular, but work to date has been limited by two main factors, namely (i) the lack of transferability of CT gains to real life activities, and (ii) the lack of adherence to CT programmes. This paper will outline a new CT paradigm designed to offset these two limitations. This is achieved by combining the benefits of gamification, virtual reality (VR), and affective adaptation in the development of an engaging, ecologically valid, CT task. Additionally, it incorporates facial electromyography (EMG) as a means of determining user emotional state while engaged in the CT task. This information is then utilised to dynamically adjust the game's difficulty in real-time as users play, with the aim of leading them into a state of flow. Emotion recognition rates of 64.1 were achieved by classifying a DWT-Haar approximation of the input signal using kNN. The affect-aware VR cognitive training intervention was then evaluated with a control group of older adults. The results obtained substantiated the notion that adaptation techniques can lead to greater feelings of competence thereby increasing intrinsic motivation for the activity, and a more appropriate challenge of the user's skills.




The Efficacy of a Virtual Reality-Based Mindfulness Intervention

Mindfulness can be defined as increased awareness of and sustained atten...

An Open Platform for Research about Cognitive Load in Virtual Reality

The cognitive load can be used to assess if someone is struggling while ...

A Two-Systems Perspective for Computational Thinking

Computational Thinking (CT) has emerged as one of the vital thinking ski...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Dementia occurs as an consequence of severe brain damage, often due to neurodegenerative disorders such as Alzheimer’s disease, and is defined as a deterioration in memory and thinking to the point where activities of daily living are impaired and functional independence is lost. As dementia prevalence rises with the ageing population, attention is turning increasingly towards interventions that may ameliorate cognitive decline in patients in earlier stages of disease, prior to the onset of dementia.

To this end, cognitive training (CT) has garnered considerable attention. The potential value of CT builds on extensive observations that engagement in intellectually stimulating lifestyle activities help maintain cognitive function into later life, via enhancement of cognitive reserve [Stern2012]. The aim of CT therefore is to deliver such benefits to cognition as a targeted discrete intervention. However, to date the numerous studies and CT products, often commercially marketed as ”brain training” programmes or apps, have not been found on rigorous systematic review - such as that conducted by the Cochrane collaboration [GatesEtAl2019] - to produce clear evidence of benefit to cognition. This failure has been attributed to several main factors. The first problem is that of adherence to the CT programme; participant engagement often drops off after a period because they become bored or frustrated with the training programme. This negative effect, and the resultant lack of engagement, is usually a result of the training not being sufficiently challenging, being too difficult, or simply not being sufficiently compelling to begin with [Simons2016]. The second problem is that the skills developed and the cognitive improvements made during these training programmes do not transfer to daily life. While the user might get better at playing the games, there is little evidence that these skills will generalise beyond that [Simons2016]. The third main problem relates to the lack in many studies of an appropriate ”active control” against which the CT intervention can be compared, thus compromising interpretation of study results.

The work presented in this paper introduces a novel CT programme designed to address all three problems. In keeping with other approaches to CT, a game-based paradigm will be used to enhance participant enjoyment and thus increase adherence. The use of VR to create simulated real-world environments within which the CT game is enacted helps overcome the transfer problem and facilitate extension of CT-generated gains to real life activities. Additionally, the design of the CT task incorporates an active control arm. Finally the incorporation of an affective feedback loop that enables the dynamic adjustment of the game’s difficulty (DDA), in real-time, based on the player’s affective state [Liu2009] and their in-game performance.

This work aims to answer the following two preliminary research questions, crucial to future large-scale trialling and implementation in older adults and patients with mild cognitive impairment at risk of developing dementia:

  • Does the use of VR, gamification, and affective adaptation lead to an engaging cognitive training paradigm for a target population of older adults with little, or no, game playing experience? and

  • Do facial EMG signals serve as a reliable, and unobtrusive, source of affective data for emotion classification in a VR environment for cognitive training?

To investigate these research questions, following the development of the VR cognitive training game, two user studies were conducted with a combined 24 participants. The first study was focused on the acquisition of annotated EMG signals during gameplay sessions with an early, non-adaptive, prototype of the VR game. The data collected here facilitated the development of a machine learning model that is capable of detecting players’ affective states. This model was integral in controlling the affective feedback loop used to adjust the difficulty of tasks in the adaptive VR game. This adaptive version of the game was then evaluated, with an older control group, in the second study that showed that it led to greater feelings of competence, and a more appropriate challenge of the user’s skills.

Ii Related Work

Ii-a Cognition, Memory and Affect

Memory can largely be understood as information gained from past experience, that can be used in the service of current, or future, adaptive behaviour. Prominent literature in the field has led away from the unsubstantiated perception that memory functions as a unitary process towards the concept that memory can be categorised into subsystems (operating across different regions of the brain), and that its functions are underpinned by the interactions between these subsystems [schacter1994a]. These subsystems are classified according to their content and function. The bisection of memory into short-term-memory (STM) and long-term-memory (LTM), first proposed by Hebb [hebb1949a], represents the core underlying division imposed in the taxonomy of cognitive systems. [miller1956a]) and duration (storage is fragile, information can easily be lost with distraction or passage of time), and stores information in a task dependent manner (e.g. in terms of physical qualities of the experience - such as what we see, do, hear etc.). [baddeleysquire1992a]

; (ii) the visuospatial sketch pad, which stores, and manipulates, visual and spatial information; and (iii) the phonological loop, which maintains auditory information through rehearsal. These memory systems can be further divided into a more granular taxonomy. The unitary notion of STM can be partitioned to recognise a distinct cognitive system, working memory (WM), that is responsible for temporarily storing task relevant information available for manipulation. LTM can be divided into procedural (non-conscious) memory, which underpins skills/habits/conditioning, and declarative (consciously accessed) memory for storing facts and events

[squire2004a]. Declarative memory can be further separated into semantic memory (SM), a store of facts about the world, and episodic memory (EM), the facility for re-experiencing events in the context in which they originally occurred [tulving1983a]. EM is widely considered as a cognitive competence unique to humans. Unlike SM, it explicitly encodes spatial, contextual and temporal information.

While neurological diseases that cause dementia are more commonly associated with long-term memory impairments, numerous studies have shown deficits in short-term memory and coordination of multiple tasks resulting from these diseases (e.g. see [baddeley1991a]). Therefore, the VR cognitive training intervention developed for this work incorporates both WM and EM training tasks.

WM capacity can be tested through a variety of tasks. These tasks typically come in the form of a dual-task paradigm that combines a measure of memory span (a STM test that involves immediate recollection of an ordered list of items) with a simultaneous processing task (e.g. see [daneman1980a]). It has also been argued that WM reflects the ability to maintain multiple, task-relevant, pieces of information in the face of distracting irrelevant information [engle1999a]. The WM training task implemented in this work draws on both of these ideas. Due to the slightly broader definition for EM, and the numerous competing theoretical and empirical perspectives on it, a wide variety of methods have been developed to assess EM capacity (not all of which produce consistent results [cheke2013a]). For the purpose of this work, the implemented EM training primarily focuses on spatial memory (SM) tasks, as they engage the same regions of the brain that EM requires [burgess2002a]. SM tasks also proved to be a good fit for the VR paradigm, where ecologically valid scenarios could be presented to the user in an engaging manner.

Bennion et al. [bennion2013a], in their study of the effect of emotion on memory, suggest there is strong evidence to support the following hypotheses: emotion usually enhances memory; when it does not, its effect can be understood by the magnitude of elicited arousal (with arousal benefiting memory up to a point, but then having a detrimental influence); and when emotion facilitates the processing of information, it also facilitates the retention of that information. The general notion that arousal will enhance memory, up to a point, was reinforced by Yeh et al. in their research of the effects of negative affect on WM capacity [yeh2015a]. They promote the idea of a game that appropriately challenges users in order to activate their attention, while avoiding negative emotional responses. This phenomenon has also been observed in a physiological context. Suriya-Prakash et al. found, in their study on the influence of visuospatial WM tasks on heart rate variability (HRV), that HRV was lower among poor WM performers compared to good performers [suriya-prakash2015a].

Ii-B Affect and Gaming

Two domains, particularly relevant to this work, have shown promise in recent literature for the application of affective computing methods. The first is the domain of cognitive training, which is motivated by the strong relationship between emotions and cognitive performance [gabana2017a]. The second is the domain of video games, where the interaction has been noted as a predominantly emotional one [yee2007a] and therefore susceptible to affective adaptation (dynamically changing the gameplay experience based on affective signals read from the player).

Video games, perhaps more than any other entertainment medium, engage users in a diverse range of experiences. Players’ motivations for engaging with video games vary. They have been categorised into three overarching motives [yee2007a]: players seeking game mastery and competition (achievement), players who want to interact with others and develop in-game relationships (social), and players who pursue escapism by engaging with a game’s story (immersion). These motivations indicate that the activity of playing video games is a predominantly emotional one. This suggests that game development is a domain that would benefit from the application of affective computing techniques. Affective gaming can be realised through the use of biofeedback techniques. However, for a game to be considered affective (and not simply a biofeedback game), it must exploit this biological information to propagate affective feedback [bersak2001a]. That is, the game is an intelligent participant in the biofeedback loop. What distinguishes affective feedback from biofeedback, is that the player is not deliberately controlling their physiological responses in order to influence gameplay.

There are numerous novel possibilities for emotive twists on conventional gameplay experiences. For example, Reynolds and Picard developed AffQuake [reynolds2004a], a modification to ID Software’s Quake II that incorporated affective signals to alter gameplay in a variety of ways (e.g. in StartleQuake, when a player becomes startled, their avatar also becomes startled and jumps back). Valve Corporation have also experimented with similar modifications to their games: Half-Life 2 and Left 4 Dead 2 [bouchard2012a]. In these modifications, the player’s stress level, measured as the electrical response of their skin, determines the pace of the gameplay.

Ii-C Therapeutic Applications of Virtual Reality

The use of VR in video games induces a greater degree of engagement and immersion in players. This heightened immersion, named presence (or the feeling of being there in the virtual world), has been reported to directly impact the affective states experienced by the player during gameplay (high levels of presence induce more intense and vivid emotions [riva2007a]). This perceived link with affective states has led to VR being referred to as an ’affective medium’, or as an ’empathy machine’ (technology with the capability to make sensible to oneself the emotional experience of another [bollmer2017a]). The emotive nature of VR experiences naturally leads to a synergistic relationship with affective computing techniques [shumailov2017a].

The key advantage offered by VR in the neuroscience field is the ability to place patients in ecologically valid, safe and controlled environments that provide multisensory stimulation [bohil2011a]. Successful therapeutic applications of VR have ranged from desensitisation treatments for PTSD sufferers to serving as a tool for pain distraction in forms of exposure therapy for people suffering from various phobias and anxiety disorders [denmark2017a, pedroli2018a, gabana2017a, sanchez-vives2005a].

A subset of literature in this domain, that has recently seen an upsurge in attention, has been on the use of VR in creating ecologically valid experiments. In a systematic review of computerised cognitive training (CCT) literature, Hill et al. concluded that CCT is a viable intervention for enhancing cognition in people with MCI [hill2017a]. Interestingly, they found that for individuals with dementia, the only clinically meaningful effect sizes were found in studies that utilised immersive technology such as VR or the Nintendo Wii (which has motion controls similar to VR interaction methods) [hill2017a]. Teo et al., in their literature review of VR as a platform for neurorehabilitation, reported that established evidence supports the efficacy of VR, but suggested that the combination of VR and conventional therapies is likely to be more efficacious compared to using either alone [teo2016a].

Recent literature has reported on successful applications of gamification with older adults, with focus groups made up of older adults expressing an acute awareness of the need to strengthen their cognitive skills and regarding games as a means to do so [kayali-a]. A number of design considerations have been outlined that should be taken into account when targeting this population [kayali-a]. These considerations emphasise the accessibility of the game (e.g. limiting game speed, use of strong contrasting colours etc.), suggest the use of adaptive difficulty schemes, and encourage careful selection of physical interaction methods [miesenberger2008a]. Additionally, Whitlock et al. recommend the provision of training support prior to getting started, and during early gameplay sessions [whitlock2010a].

Ii-D Emotion Sensing from Facial EMG

The face is widely considered to be the most reliable source of affective information. It conveys information about a person’s age, sex, background and identity, what they are feeling, and what they are thinking [ekman2005a]

. There are two prominent strategies for measuring facial expressions. The first, and most prevalent, method relies on computer vision and image processing techniques

[bartlett1999a, sebe2007a, koelstra2010a, littlewort2016a]

. It typically involves three steps: face detection in an image or video stream frame, facial feature extraction, and facial expression (or Action Unit

[ekman1978a]) classification. While this strategy is generally preferred (due to its unobtrusiveness), it is not appropriate in a VR context as the user’s face will be mostly occluded by the VR head-mounted display (HMD). The second method (employed in this work) utilises facial electromyogram (EMG) measurements, recorded from surface electrodes placed over regions of the user’s face. This strategy benefits from high sensitivity, enabling the detection of slight muscular movements that may not be evident to the human eye. However, a significant drawback is that the application of facial surface electrodes is obtrusive to the user, and makes the user aware of, and self-conscious about, their facial expressions and the measurement thereof [ekman1992a]. This work aims to negate this drawback through the use of a novel affective human-computer interaction (HCI) device, Faceteq [mavridou2017a].

Jerritta et al. investigated the application of higher order statistics (HOS), an efficient feature extraction method, to derive a set of facial EMG features for classifying Ekman’s six basic emotional states [ekman1992b]

. They used audio-visual (video clips) stimuli to induce emotional responses in participants, and employed a kNN classifier with PCA as a dimensionality reduction technique. They found that the HOS features outperformed commonly used statistical features (mean absolute value (MAV), standard deviation, etc.), though it is worth noting that this did not include commonly reported informative temporal features such as root mean square (RMS) and integrated EMG (IEMG)

[hamedi2012a]. Their results showed that the use of PCA prior to classification improved the classifier’s accuracy, achieving an average classification rate of 69.5% across the six basic emotions (using a 70/30 training/validation split of the data for CV).

Perusquia-Hernandez et al. examined the relative classification rates for spontaneous and posed smiles, using spatial and temporal patterns of facial EMG. Spontaneous smiles were elicited through audio-visual stimuli, while posed smiles were requested by the experimenter (with participants informed of the purpose, that being to classify EMG signals). Due to the unbalanced nature of the collected data, they undersampled the majority class to match the minority class samples (as in [shumailov2017a]). The best classification (distinguishing between posed/spontaneous smiles) results were obtained using spatial-temporal features with a gaussian kernel SVM. Classification rates range from 85.23% to 96.43% (across participants), using a 70/30 training/validation split of the data for CV.

Soon et al. developed a novel application for speech recognition based on facial EMG. Their study was conducted with 3 participants, with each asked to say a series of numeric (spoken in Malay and English) and command words (spoken in English). Temporal features (similar to those mentioned above) were then extracted from a DWT-Haar approximation of the input signal. Four different classifiers were evaluated: Random Forest (RF), LDA, Naive Bayes and multilayer perceptron (MLP). Classification results were obtained through a CV scheme with a 66/34 training/validation split of the data. RF produced the best overall performance with temporal features achieving 64.7%, 49%, and 41.8% in Malay, English, and command words respectively.

Iii Study Design

Iii-a Game Design

Two separate virtual environments were developed, a virtual supermarket and a virtual multi-room museum (see Fig. 2 and Fig. reffig:museum). These locales provided the setting for the WM and EM tasks respectively, and were selected to promote the ecological validity of the intervention, i.e., both environments are likely to be familiar to the older target population and, in the case of the supermarket, to reflect a daily activity. The underlying hypothesis was that by setting the tasks in highly immersive virtual re-creations of real-world environments and having users perform practical tasks (e.g. collecting products from a shopping list and interacting with displays in a museum) the acquired cognitive skills would better generalise to daily life.

In this initial phase of the work, a fixed difficulty framework was implemented for both the WM and EM tasks consisting of three difficulty levels (easy, medium and hard). This was designed with the purpose of eliciting a range of emotional responses from study participants and, thus, generating a balanced dataset. These difficulty levels differed in the cognitive load required from the participant (e.g. shorter/longer shopping lists in the supermarket, less/more display locations to remember in the museum).

Fig. 1: The Affective Slider, Betella and Verschure’s digital self-assessment scale for the measurement of human emotions [betella2016a]. The top slider indicates level of arousal, the bottom slider indicates valence of emotion.

Both tasks employed the same annotation and EMG logging scheme. EMG data is recorded from when the task starts. After every 45 seconds of gameplay (a time period arrived at through pilot tests) the recording is paused and written out to a log file (created for that segment). When this occurs, the game environment fades to black and Betella and Verschure’s affective slider (implemented in VR to mitigate gameplay disruption), is displayed to the player. Players interact with the sliders using a VR laser pointer and, when they are happy with their selection (which should best describe the average affect experienced by the player during the preceding 45 seconds of gameplay), press a confirmation button to append the arousal/valence values to the associated EMG log. Gameplay and EMG recording then resumes. This process repeats until the timer runs out.

The decision to collect the labels during gameplay was motivated by the hypothesis that these labels will better represent the range of emotions experienced at different points during gameplay (while the experience is still fresh), with the added benefit of the elimination of potentially time consuming post-game annotation sessions. Score tracking and a leaderboard (staples of gamification) were included for both tasks. These mechanics have been shown to improve engagement [hamari2014a] and serve as progress indicators, guiding and enhancing player performance [mekler2013a].

A heads-up-display (HUD) was included that enabled the player to track their current score, time left and other task specific information. Together, the inclusion of these elements was intended to draw the player’s focus away from the novelty of VR (thereby mitigating the expected positive bias in the dataset) onto their task performance. To further associate player performance and emotional response, audio-visual stimuli were added in response to correct (bell ringing sound and confetti explosion) and incorrect (buzzing sound and red X) answers.

Iii-A1 Working Memory Task

Fig. 2: The custom virtual supermarket environment, in which WM tasks were carried out, and player interaction with this environment.

The goal of the WM task is to find a (randomly generated) array of products in a virtual supermarket (see Fig. 2), placing each product in a shopping basket. The products are specified to the player at the start of each round through the HUD. Each product on the shopping list is displayed on the HUD (as an image and text description) for 1 second, with a 500 ms interval. Players are challenged to remember remaining items on the shopping list (stored in WM), while they actively search for each product. The number of products to be collected is determined by the difficulty level. The medium difficulty tasks users with finding 7 products, this is intended to be the most engaging and balanced difficulty for most users (based on Miller’s magic number seven, plus or minus two [miller1956a]). The easy and hard difficulties task users with finding 2 and 12 products respectively. These difficulties were designed to increase the likelihood of inducing negative affect in users, i.e., calm-negative on easy (bored due to insufficient challenge) and energetic-negative on hard (frustrated due to excessive challenge). The specific number of products for each difficulty level was determined through pilot testing and feedback.

For each correct item collected, 5 points are added to the player’s score. Collecting an item that was not on the shopping list reduces the player’s score by 4. Therefore, while the player is incentivised to carry out the task quickly, the priority is to ensure that no mistakes are made. The random generation of shopping lists promotes the task’s replayability, maintaining the emphasis on short-term WM (rather than remembering the shopping lists from previous attempts) on repeated playthroughs.

Iii-A2 Episodic Memory Task

Fig. 3: The custom, multi-room, virtual museum environment, in which EM tasks are carried out. More rooms are unlocked as the difficulty level increases. Green arrow markers indicating which displays are to be remembered during the encoding phase. Bottom right: visual feedback for correct answer during the retrieval phase.

The EM task takes place in a multi-room virtual museum environment (see Fig. 3). The core task is divided into two consecutive phases: encoding (storage of information, such that it can be distinguished from other distinct pieces of information) and retrieval (recognition of previously stored information) [wang2012a]. In the encoding phase, players are asked to search for one or more displays at randomly generated locations in the museum. Players interact with marked displays in this phase using a laser pointer. On doing so, the age of the display is shown to the player. This interaction can take place at a distance, allowing a greater degree of spatial context to be encoded. After all the marked displays have been interacted with, the game transitions to the retrieval phase removing the marked displays from the museum, and teleporting the player back to the museum entrance.

In the retrieval phase, players are tasked with placing a subset of the displays they interacted with during the encoding phase back in their original positions. The player uses the laser pointer to indicate where in the environment (from a selection of highlighted zones) they think it was located. On completion of the retrieval phase, a short bonus phase is initiated. Players are shown three displays they have interacted with and are asked which of them is the oldest/youngest. This textual (age) recall is not randomised, and players who can efficiently store the information in their LTM should perform better over repeated sessions.

Iii-B Data Acquisition

After the study procedure and protocol was approved by the relevant ethics committee, 18 participants (5 female and 13 male, ranging in age from 20 to 37) volunteered to engage in the EMG data acquisition study. 6 of these participants engaged in a preliminary pilot study, while the data collected from the remaining 12 formed the final annotated EMG dataset.

Fig. 4: Prototype Faceteq sensing HMD foam insert and how it is placed in the HTC Vive.

First a pilot study with 6 participants was run to investigate the efficacy of the Facteq EMG sensor. This study also provided an opportunity to refine the study methodology prior to data collection.

The participants were first acquainted with the research goals of the study through an information sheet and verbal introduction. They were introduced to the meaning of arousal and valence, and shown the affective slider annotation scheme (see Fig. 1

). The EMG recording sessions lasted for 3 minutes and 45 seconds, of which there were six in total (one per difficulty level, for both the WM and EM tasks). Half of the participants played through the difficulty levels in reverse order (hard-to-easy) to reduce the likelihood of the collected data being skewed positive (as in

[shumailov2017a]). The hypothesis here is based on the concept of the difficulty curve, the idea that, for an optimal experience, a game’s difficulty should progress in a manner consistent with real-world skill acquisition (easy challenges during the cognitive stage, moderate challenges during the associative stage, and more difficult challenges during the autonomous stage) [b2009a]. By delivering challenges to the player in a reversed order, it is expected that they will experience negative affect more frequently (e.g. frustration early on, and boredom towards the end of the session).

Participants were given a two minute break between gameplay sessions, allowing them to return to a neutral affective state. During these breaks, participants were asked to give an affective label that best summed up that session (using Russell’s circumplex model [russell1980a]). A short informal interview was conducted after the EMG data collection was finished with the following questions.

  • Which of the two environments did you prefer spending time in?

  • Which of the two tasks did you find more engaging?

  • Did you experience any discomfort during the session and, if you have prior experience of VR, was the addition of the Faceteq sensor off-putting in any way?

  • To what extent, if any, did the annotation scheme affect your gameplay experience?

Iii-C Findings

The majority of participants (10/12) expressed a preference for the museum environment over the supermarket, with many responses indicating that the supermarket felt more mundane as it is an environment they are overly familiar with in the real world. This may point to a trade-off between ecological validity and engagement in the choice of cognitive training environment. Responses were evenly split when it came to task preference, with many stating they preferred the EM task as it had more gameplay variety, while others appreciated the more naturalistic interactions in the WM task The response to the Faceteq sensor was positive. No participants indicated that they experienced any motion sickness or that the sensor was off-putting. Of the 9 participants with prior VR experience, 5 responded that they weren’t aware of the sensor once they started playing, 2 responded that they there were aware of the sensor but it had no significant impact on their engagement, and 2 (self-identified regular VR users) stated that the ADC box attached to the back of the HMD (see Fig. 4) served as a counterweight to the front-heavy HTC Vive. Most participants (7/12) stated that the in-game annotation scheme was mildly disruptive to the gameplay experience, while others either found it didn’t affect their experience (3/12) or found it very disruptive (2/12).

Iv EMG Feature Extraction

Iv-a Pre-processing

EMG signals possess highly complex time and frequency domain characteristics. Therefore, the use of wavelet transforms serve as a better fit as they can handle the non-stationary (time domain) characteristics of EMG signals

[zhang2010a]. Wavelets are generated from a single ’mother’ wavelet through a process of scaling and translation.

An accurate, reliable, pipeline for EMG signal analysis can be constructed using DWT transformations. Surface EMG (sEMG) signals are recorded through electrodes placed on a person’s skin, where they capture the electrical signals emitted by the person’s muscles. The amplitude of these signals is quite small (normally between 0.1-5.0 mV). The most useful information is usually located in the 50-150 Hz range [jiang2005a].

Iv-B Feature Extraction and Selection

Sousa and Tavares noted, in their review of EMG normalisation methods [sousa2012a], that the voltage potential of surface EMG depends on several factors, varying between individuals and also over time within an individual. Baseline normalisation (removal) is a viable strategy to respond to these issues, having seen use in numerous studies on a variety of physiological signals (e.g. see [m2013a]).

The user’s baseline was read during the first 45 seconds of gameplay as at this point the novelty of VR had diminished to some extent, and the activities during the early-game are typically less arousing (e.g. reading shopping list, and looking at museum displays).

For the purposes of this study, a simple baseline division was employed. The input signal was then processed further using DWT [zhang2010a]. The choice of mother wavelet for signal approximation was informed by the work of Phinyomark et al. [phinyomark2010a]. They found that, for the purpose of denoising, coif5, Haar (db1), bior1.1 and rbio1.1 are the most suitable. A preliminary evaluation with our dataset showed that each of these wavelets resulted in very similar classification improvements (around +5% to +7% accuracy depending on the classifier). Therefore, going forward, the presented results are based on the DWT-Haar approximation of the EMG signal (because of its efficient computation).

A significant number of studies, in the domain of facial EMG classification, have shown temporal features to be the most informative [jerritta2014a, perusqu2017a, hamedi2018a, soon2017a]. Based on these findings, the time and time-frequency domain features were extracted from the DWT approximation of the signal are shown in Fig. 5(see [tkach2010a] for mathematical definitions).

Fig. 5: List of time and time-frequency domain features extracted from the DWT approximation of the signal.

Extracting this many features, from eight EMG channels, results in a high dimensional (8 * 14 = 112) dataset. Therefore, it was considered pertinent to include a feature selection step prior to classification. The strategy employed here was inspired by the work of Clerico et al.

[clerico2016a], who utilised the minimal-redundancy-maximal-relevance criterion (mRMR) [peng2005a] to select the best features in an EMG affective gaming context. mRMR attempts to find optimal features, based on mutual information, through forward selection.

The greatest classification improvement was achieved by selecting the best 30 features, identified by mRMR, from the feature set (around +3% to +5% accuracy depending on the classifier). These features were distributed among different muscle groups with the top 30 ranking made up of 10 features from users’ eyes, 9 from their mouth, 7 from their eyebrows, and 3 from their corrugator supercilii.

Fig. 6 shows the frequency of different statistical features in the top 30 ranking.

Fig. 6: Frequency of different statistical features in the top 30 ranking.

Iv-C Findings

SSC was the most common feature in the ranking, being extracted from all muscle groups bar the corrugator supercilii. SSC, extracted from the right eye sensor, was also computed to be the second most informative feature in the ranking. This was accompanied by MMAV1 extracted from the right mouth sensor (1st) and ZC extracted from the left mouth sensor (3rd), in a top three that scored significantly higher (by a factor of at least 3) than the remaining 27 features. Interestingly, extracted RMS features were considered relatively uninformative despite it being regularly cited as one in facial EMG emotion recognition research (e.g. [jerritta2014a, hamedi2018a]). Though deviations in expected results, such as this, may be attributed to the fact that this is the first study to utilise facial EMG in a VR context, and physical expressions of affect in peoples’ faces are likely to be impacted by the VR headset.

V Classification

First the original (-1 to +1 continuous) valence/arousal labels were truncated into one of four emotion labels: energetic-positive (high valence, high arousal); calm-positive (high valence, low arousal); energetic-negative (low valence, high arousal); calm-negative (low valence, low arousal). This allows emotion recognition to be framed as a 4-class classification problem.

Three classifiers, which have been utilised to varying degrees of success in existing facial EMG literature [jerritta2014a, perusqu2017a, hamedi2018a, soon2017a]

, SVMs (with Gaussian kernel); kNN (with various values for k); and LDA were evaluated using the (subject-independent) leave-one-subject-out (LOSO) cross-validation strategy. The kNN (k = 4) offered the best arousal classification rate (76.2%), and the best classification rate on the combined valence/arousal 4-class classification problem (68.8%). LDA yielded the best valence classification rate (64.1%), but fell behind kNN on the 4-class classification problem (65.9%). The Gaussian SVM generally underperformed, though by investigating different kernels and further tuning hyperparameters, this could be improved in future work.

The best classification results for each can be seen in Fig. 7 below.

Fig. 7: Classification accuracies for kNN, SVM and LDA for valence and arousal when formulated as a 2-class classification problem (positive valence vs. negative valence and high arousal vs. low arousal). Results are obtained using leave-one-subject-out cross-validation strategy.

V-a Findings

The most noticeable discrepancy between the labels acquired in-game and those acquired post-game, was a consistently lower annotated value for arousal in the post-game interview. This manifested as a significantly lower classification accuracy for arousal when using the post-game labels as the four classes (around -15% to -19% depending on the classifier), while valence classification remained comparable. This suggests that the primary benefit of acquiring affective labels in-game is a more reliable estimate of the intensity of emotions. Placing participants in a reversed difficulty group had the desired effect of inducing negative affect more frequently (resulting in a more balanced dataset), with that group accounting for approximately 61% of energetic-negative and 67% of calm-negative annotations.

A notable difference between how our classification results were arrived at compared to those discussed in similar works (section 3.3), is that these results were computed using the (subject-independent) leave-one-subject-out (LOSO) cross-validation strategy. This methodology has been shown to avoid non-independency issues that lead to unrealistic estimates of the generalisability of the model [esterman2010a].

Vi Intervention Evaluation

The goal of this evaluation was to gather qualitative feedback from the target population (older adults) that would help define future research steps, and to examine the potential benefits of utilising affective adaptation in this domain.

Vi-a Affective Feedback Loop Integration

The system integrated here relies on both affect sensing and player performance as a data point, due to its exhibited value in existing DDA solutions and to offset the (classification) inaccuracies in the model (i.e. to avoid fixing what is not broken [hunicke2005a]). The number of difficulty levels in this new adaptive version of the game was increased from three (easy-medium-hard) to ten (10 point scale). This allows for more subtle transitions in difficulty, to avoid the player becoming overly conscious of the adaptation (and potentially feeling ‘cheated’ by it [hunicke2005a]). Every 45 seconds (in place of the annotation interface from the previous study) the player’s affect will be classified in real-time based on incoming EMG signals.

The following DDA rules (encompassing both player affect and performance) will then govern how the difficulty is adapted:

  • [Calm-negative + perfect score]: increment difficulty by 2;

  • [Calm-negative + imperfect score] or [positive valence + perfect score]: increment difficulty by 1;

  • [Positive valence + imperfect score]: no change in difficulty;

  • [Negative score] or [energetic-negative + imperfect score]: decrement difficulty by 1;

  • [Energetic-negative + negative score]: decrement difficulty by 2.

This rule set was arrived at following a short testing period with pilot participants. In the best-case execution, the adaptation is intended to lead players to a flow state [csikszentmihalyi2014a], where they are faced with tasks that they have a chance of completing through application of their skills. In addition to being a signifier of high engagement, being in a state of flow has been shown to improve cognitive performance [gabana2017a].

Vi-B Evaluation Study

The protocol for the evaluation study was largely similar to the previous study, with a few notable exceptions. 6 participants (4 female and 2 male, ranging in age from 60 to 100), with no history of cognitive impairment, volunteered to engage in this evaluation study. The recruitment of this older control group was facilitated by the researchers from The University of Cambridge’s Department of Neuroscience. None of the participants played video games with any degree of regularity and only 1 had prior experience of VR. Participants started by completing a battery of standardised cognitive tests [mioshi2006a, nelson1982a, osterrieth1944a, crockett2008a, tombaugh2004a, chan2016a, wechsler1958a] (administered by the Clinical Neuroscience researchers). This enabled accurate characterisation of the sample and the investigation of correlations between the standardised tests and the new VR paradigm. The time taken to administer these tests averaged at about 1 hour.

Participants were given an explanation of the task goals and time to practice in the VR environment prior to starting the session proper. They played both an adaptive and non-adaptive (linearly increasing difficulty) version of the game (without being told which version is which), in two, fifteen minute, gameplay sessions (7 minutes and 30 seconds for both WM and EM tasks). To mitigate any order effects bias in the evaluation of the adaptive and non-adaptive versions, half of the participants played the adaptive version first, while the other half played the non-adaptive version first. Participants’ subjective experience (i.e. immersion, engagement, and flow) with the VR training intervention was evaluated using the in-game and post-game components of the game experience questionnaire (GEQ) [ijsselsteijn2007a]. Each session was concluded with an informal interview. Participation in the study lasted for about 2 hours and 15 minutes on average (including breaks). Immediately after playing each version (adaptive/non-adaptive) of the game, participants reported on their feelings of competence, sensory and imaginative immersion, flow, tension, challenge, negative affect, and positive affect by completing the in-game module of the GEQ. Each of these categories is represented by a series of sub-components in the questionnaire, a numeric value is then computed for each by averaging across their sub-component values. The accumulated responses can be found in Fig.8.

Vi-C Findings

Looking at Fig.8, while the response to both versions of the game can generally be described as positive, there are a few noteworthy differences. The two standout differences are the increased feeling of competence and the decreased feeling of challenge while playing the adaptive version of the game. The significant increase in competence is particularly encouraging as it relates to one of the key deficiencies identified with existing cognitive training interventions, the drop-off in user engagement [Simons2016]. In their research of intrinsic motivation, Deci and Ryan argue that structures that enable feelings of competence during action can enhance intrinsic motivation for that action [deci1985a]. This increased feeling of competence, brought about through affective adaptation, highlights the potential of adaptive techniques in motivating users to engage with cognitive training interventions. While the drop in challenge is not unequivocally positive, the decrease to a more neutral value in the adaptive version, along with the slight increase in flow (which describes a state of high engagement), suggests that participants are being met with more appropriate challenges that they can overcome using their skills [csikszentmihalyi2014a] (potentially explaining the slight increase in positive affect, and decrease in negative affect while playing the adaptive version). Finally, when both gameplay sessions were complete, the participants filled out the post-game module of the GEQ [ijsselsteijn2007a]. This gave participants the opportunity to think and reflect on the experience as whole. The accumulated results (similarly calculated by averaging across their sub-components in the questionnaire) can be found in Fig.8.

Fig. 8: Post-game GEQ [ijsselsteijn2007a] module responses for each participant (P). Values are on a scale of 0 (not at all) to 4 (extremely), and were calculated by averaging across their respective components.

The largely positive responses here, in conjunction with those recorded by the in-game GEQ module, are promising indicators that gamification and VR can play a role in increasing engagement with cognitive interventions in older adults.

Another area of interest for this evaluation was to what extent the WM and EM tasks, implemented in VR, engaged the intended cognitive abilities of the participants. This was examined by looking for correlations between how participants performed (relative to each other) in the standardised tests and the VR tasks. Positive correlations, calculated using Spearman’s rank-order correlation (rho), were found between the performance rankings of participants in the VR paradigm and closely related standardized tests. Most notably, strong positive correlations were found between the Trail Making Test [tombaugh2004a], which examines executive functioning (a superset of WM), and the WM task in VR (rho=+0.60), and between the 4 Mountains test (a short SM test) and the EM task in VR (rho=+0.74).

In the post-session interview, all 6 participants responded that they felt no motion sickness during the study. 4 out of the 6 participants indicated that they felt no facial discomfort, while 2 participants, who wore glasses throughout the study, felt a bit of pressure on their face towards the end. This was likely a result of the slightly thicker face cushion used with the Faceteq prototype (compared to the default HTC Vive cushion).

All participants responded that the weight of HMD, and attached ADC box, did not bother them (1 participant added that it took time to get used to it, and another that they would like regular breaks if they were to use it for longer durations).

The participants were then encouraged to give more open-ended feedback. The blurriness of the low resolution VR lenses was reported by 2 participants as having a negative impact on their performance and immersion. All 6 participants stated their preference for the museum environment (finding the supermarket more mundane).

The overall consensus was, however, that the environments they would prefer would match those in the real-world. 5 participants preferred the EM task, stating that while they found it to be more complex, the greater variety it offered was a motivating factor for them to return to it and improve. This may point to task variety being an important factor in maintaining user engagement in cognitive training.

Vii Conclusion and Future Work

This work investigated the application of a variety techniques, and technologies, in the development of more engaging cognitive training schemes. It has detailed the steps in the process for production of an affect aware VR game for cognitive training. The novel use of facial EMG signals for emotion classification in a VR context was showcased. Across 18 participants (ranging in age from 20 to 100), our choice of sensor (Faceteq prototype), was universally considered to be unobtrusive (a much cited flaw in facial EMG applications). It was demonstrated that this physiological modality can be relied on, with a relatively small data source (12 participants, with 29 labeled 45 second EMG segments per participant), to moderate success. Classification rates of 64.1% and 76.2%, for valence and arousal respectively, were achieved through a combination of baseline normalisation, DWT-Haar filtering, temporal feature extraction, feature selection based on the mRMR criterion, and kNN classification. The combined valence/arousal 4-class classification problem served as a reasonably accurate driver (in unison with performance-driven adaptation) for affective adaptation, with an emotion recognition rate of 68.8% (as determined by subject independent LOSO CV).

The promise of DDA in the development of more engaging cognitive training was substantiated through a small-scale user study with older adults. The qualitative feedback garnered over the course of the study pointed to a notable increase in feelings of competency (associated with intrinsic motivation and engagement [deci1985a]), and participants being more appropriately challenged (moderate feelings of challenge, and a slight increase in flow). Strong positive correlations were found between the participant rankings obtained in the baseline neuropsychology tests and those obtained in the VR paradigm, suggesting that performance in the VR training tasks may be a good indicator of cognitive capacity. Participant feedback relating to both the adaptive and non-adaptive versions of the game (as assessed through in-game and post-game modules of the GEQ, and informal interviews) was largely positive, with very few expressions of negative affect. This response lends credence to the notion that gamification and VR are viable tools for improving engagement in cognitive training with older adults. The findings here should be qualified by reiterating an inherent limitation of the study. Six participants is a small user group for an evaluation study (necessitated by the constrained timescale for the work) and should be expanded in future research to fully determine the veracity of these findings.

While the response and feedback to this study have been encouraging, it is important to qualify these results by reiterating the small sample size being used, and the once off nature of the study. Nevertheless, these are promising indicators which suggest that this is a domain that warrants further research.