Measuring Cognitive Workload Using Multimodal Sensors

by   Niraj Hirachan, et al.
University of Canberra

This study aims to identify a set of indicators to estimate cognitive workload using a multimodal sensing approach and machine learning. A set of three cognitive tests were conducted to induce cognitive workload in twelve participants at two levels of task difficulty (Easy and Hard). Four sensors were used to measure the participants' physiological change, including, Electrocardiogram (ECG), electrodermal activity (EDA), respiration (RESP), and blood oxygen saturation (SpO2). To understand the perceived cognitive workload, NASA-TLX was used after each test and analysed using Chi-Square test. Three well-know classifiers (LDA, SVM, and DT) were trained and tested independently using the physiological data. The statistical analysis showed that participants' perceived cognitive workload was significantly different (p<0.001) between the tests, which demonstrated the validity of the experimental conditions to induce different cognitive levels. Classification results showed that a fusion of ECG and EDA presented good discriminating power (acc=0.74) for cognitive workload detection. This study provides preliminary results in the identification of a possible set of indicators of cognitive workload. Future work needs to be carried out to validate the indicators using more realistic scenarios and with a larger population.



page 2

page 4


Workload-Aware Systems and Interfaces for Cognitive Augmentation

In today's society, our cognition is constantly influenced by informatio...

Context-Aware Personality Inference in Dyadic Scenarios: Introducing the UDIVA Dataset

This paper introduces UDIVA, a new non-acted dataset of face-to-face dya...

ROSbag-based Multimodal Affective Dataset for Emotional and Cognitive States

This paper introduces a new ROSbag-based multimodal affective dataset fo...

The acute:chronic workload ratio: challenges and prospects for improvement

Injuries occur when an athlete performs a greater amount of activity (wo...

Cognitive Workload Associated with Different Conceptual Modeling Approaches in Information Systems

Conceptual models visually represent entities and relationships between ...

Eye Tracking to Understand Impact of Aging on Mobile Phone Applications

Usage of smartphones and tablets have been increasing rapidly with multi...

Correlation between Unconscious Mouse Actions and Human Cognitive Workload

Unconscious behaviors are one of the indicators of the human perception ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Humans have a limited amount of cognitive resources. At any instance, we can process only a certain amount of information and maintaining a healthy cognitive function is a challenge. For instance, continuous high cognitive workload (i.e., overload) for extended periods have a negative impact on performance, and sub-optimal decisions, human errors, or accidents might occur [1]. Similarly, low levels of cognitive load (i.e., underload) can affect performance, due to lack of concentration, boredom, or lost of motivation. Thus, it is important to evaluate the cognitive load imposed by different tasks in objective terms and in real time. This measurement would enable us to create assistive systems, that can assist the user in making optimal decisions when faced in overload or underload conditions, and also maintain a healthy psychological well-being in the long run.

In the literature, two main methods to measure cognitive load have been proposed: subjective and objective measures [1]. Subjective measures are based on the self-reporting scales, questionnaires, or interviews. A popular metric is the NASA Task Load Index (NASA-TLX) questionnaire, since it is a well-established and reliable tool to measure workload [2]. However, a disadvantage of this technique is that it lacks the measurement of the cognitive load objectively in real time as the questionnaire is completed at the end of the task. On the other hand, objective physiological measures are based on collecting physiological signals from sensors. This approach is based on the principle that the cognitive load is reflected in physiological signals controlled by the autonomic nervous system  [3].

Applications of physiological sensors to measure cognitive load are found in the literature. For instance, Nourbakhsh et al. [4], used EDA and eye blinks to assess cognitive workload during an arithmetic task. In another example, Tsunoda et al. [5], employed heart rate variability (HRV) to examine cognitive load in the advanced trail making test. Heeman et al. [6], used pupillary diameter for the estimation of cognitive load during a dialog task. While research has been done in measuring the cognitive load using physiological sensors, these studies have been limited to a certain type of activity. However, there is the need to investigate multiple physiological correlates in multiple cognitive tasks.

In this study, we designed an experiment where we used four different physiological indicators to measure cognitive workload in twelve participants while performing multiple tasks. The three cognitive tasks are Raven test [7], Numerical test, and a Video Game; all test under two type of difficulty level ( and ). Given the importance of cognitive workload in human performance, mental fatigue, and detrimental effect to mental health, measuring cognitive workload objectively is imperative in managing mental resources and maintaining overall well-being.

Ii Methodology

Ii-a Participants

Twelve participants (7M/5F) took part in the experiment. Their age ranged from 20 to 40 year old (mean age 25 5.5 std). Written informed consent was obtained before the experiment and it was informed that no personal information would be recorded. The experimental procedures involving human subjects described in this paper were approved by the Institutional Review Board. Prior to the start of the experiment, the protocol was clearly explained.

Ii-B Experimental Protocol

All experiments were conducted at the Human-Machine Interface Laboratory at University of Canberra, Australia. During the experiment, the participants were seated on a fixed chair in front of a computer screen placed on a desk. Four sensors (Biosignal plux, Lisbon, Portugal) were used to record physiological data: Electrocardiography (ECG), Electrodermal Activity (EDA), Respiration (RESP), and Oxigen Saturation (SpO2). Figure 1 presents an example of the experimental setup.

Fig. 1: Experiment setup with subject wearing the physiological sensors

The experiment consists of 3 different cognitive activities (Game, Numerical Test and Raven’s Test) to induce cognitive load. Each activity is divided into two difficulty levels, Easy and Hard; in total there were 6 activities (3 Tests 2 levels). In Game, the participants played a bouncing ball video game where they have to navigate through a maze by clicking a mouse button; the difficulty level was obtained by increasing the speed of the game. Raven’s test consisted of 10 visual items depicting a matrix of colored geometric shapes arranged in a 1x6 layout, each matrix contained one empty cell and options for the participant to select from to complete the matrix; two difficulty levels were taken from [8]. The numerical test included 10 multi-choices for the participants to solve mentally; the Easy level included operations with 2 variables, while the Hard level included 4 variables.

The whole experiment was divided into two parts following a randomized orthogonal experimental design. At the start of the experiment one minute was given for baseline recording, after that, the experiment was presented to the participants with each activity lasting two minutes. After completing each activity, a computer-based NASA-TLX assessment was administered to the participants. The NASA-TLX is used to measure the subjective mental load of the participants (ground truth of experiment)[1]. The NASA-TLX measures 6 different aspects: Mental Demand, Physical Demand, Temporal Demand, Own Performance, Effort and Frustration level.

Ii-C Data Analysis

Ii-C1 Validation of Experimental Conditions

In order to validate the design of this experiment and the experimental conditions, the response of the NASA-TLX questionnaire was analysed. Thus, the null hypothesis is that the perceived workload is not affected during the experimental task. A Chi-Square test of contingencies (with

) was used to assess if the NASA-TLX items were related to the different experimental conditions (Easy and Hard levels) across all the participants. The assumptions for independence and minimum expected frequencies for the statistical test were met. Post-hoc tests were undertaken using a Bonferroni Correction for pairwise comparisons between tests.

Ii-C2 Pre-processing

All sensors collected physiological data with a sampling rate of . The data was processed using 10s windows, as this window size was the optimal after testing several window sizes (5s-50s). Then, each physiological signal was treated separately to remove noise. For ECG signal, a band-pass filter () was used to remove high-frequency oscillations and powerline interference. Similarly, the respiration data was filtered with a band-pass filter (). For the EDA data, a low-pass filter () was applied to remove line noise.

Ii-C3 Feature Extraction

Physiological features that potentially correlate with cognitive workload were extracted from the four sensors. Statistical features (n=10) were obtained from EDA and SPO2 sensors, including: Mean, Standard Deviation, Amplitude, Trough, Range, Skew, Kurtosis, first and third quartiles (Q1, Q3), and interquartile range (IQR). From ECG signals, morphological features (n=10) were calculated: R-peaks, beats per minute, inter-beat interval (IBI), standard deviation of RR intervals (stdI), standard deviation of successive differences (stdD), root mean square of successive differences, proportion of successive differences above 20ms (PNN20), proportion of successive differences above 50ms (PNN50), median absolute deviation of RR intervals (MAD), and breathing rate. From the respiration signal (n=11), inhalation, exhalation, range estimation top and bottom percentile of peaks, trough respiratory values, offset level, slope, breath-to-breath interval (BBI), standard deviation, Q1, Q3, and IQR were calculated. In total we extracted 41 features.

Fig. 2: An example of the physiological data captured by the EDA, SpO2, ECG, and Respiration sensors during each of the experimental conditions.

Ii-C4 Feature Selection

For all the sensor data, appropriate feature extraction was conducted to extract the best features and to build a more accurate model. We chose Joint Mutual Information (JMI) for feature selection, as it presents a good trade-off in terms of accuracy, stability, and flexibility than other ranking methods

[9]. An early fusion approach was followed to concatenate all computed features before the classification task.

Ii-C5 Classification

The objective of the classification problem was to identify the difficulty level ( vs

) from the physiological data. Three kinds of classification methods are compared to establish more appropriate recognition models, these are: Linear Discriminant Analysis (LDA), Support Vector Machine (with radial basis kernel), and Decision Trees. For all the classifiers, the data was split into 70% training and remaining 30% testing. A 10-Folds cross-validation was performed during the training process.

Iii Results

Figure 2 presents an example of the recorded data from all four sensors in each experimental condition.

Iii-a Validation of Experimental Conditions

The experimental assumption is that in Hard experimental conditions, the participant’s perceived workload will be significant different than in Easy conditions. Following a Chi-Square test, the Game- test was non-significant , . On the other hand, the Raven’s- test was statistically significant , . Similarly, the Numerical- test was also statistically significant , . Pairwise comparisons reported that the differences between Easy and Hard tasks were significant for Raven’s Frustration, , ; Mental Demand, , ; Performance, , ; and Temporal Demand, , . Similarly, significant differences in Numerical test sub-scales were found in Performance , and Mental Demand, , .

Fig. 3: Pairwise Comparisons Among Tasks and NASA-TLX sub-scales. *

Iii-B Evaluation of Physiological Indicators

Iii-B1 Baseline Results

The results on the test set using LDA, SVM and DT are presented in Table I. The classification models were applied with data from each sensor individually and then with all sensors combined, the number of features used in each model appears in parenthesis. When analysed separately, Respiration and SpO2 sensor data obtained the lowest accuracy using the LDA (acc = 0.60) and DT (acc = 0.56), respectively. On the other hand, the ECG and EDA sensor data obtained favorable results (acc = 0.68) using DT. Overall, the best results were obtained using data from all four sensors with DT (acc = 0.70).

ECG (10) RESP (11) Sp02 (10) EDA (10) All (41)
LDA 0.63 0.60 0.53 0.61 0.67
SVM 0.66 0.60 0.51 0.62 0.66
DT 0.68 0.57 0.56 0.68 0.70
TABLE I: Baseline results, number of features appear in parentheses.

Iii-B2 Feature Selection

After using JMI to find feature significance, all features were ranked and used to train and test the classifiers. The identification of the best performing model was systematically tested with a different number of features based on their feature importance (ranking). Figure 4 presents the results of this systematic search. It is clear that DT obtained the best results (acc = 0.77) with 27 features. Lower results were obtained with the LDA (acc = 0.70) and the SVM (acc = 0.68) classifiers using 30 and 12 features, respectively. However, by using the top-10 most important features, the DT classifier obtained acceptable results (acc = 0.74), which represents a clear improvement from the baseline results (acc = 0.70) using the 41 features from all sensors. These top-10 features are (in order of importance): , , , , , , , , .

Fig. 4: Classification results using ranked features.

Iv Discussions

The subjective workload assessment using NASA-TLX responses was evaluated to determine if the experimental conditions induced different levels of cognitive load. Based on the statistical analysis, the null hypothesis was rejected; thus, it was found that in Hard experimental conditions, the participants’ perceived workload was significant different than in Easy conditions. Pairwise comparison showed that not all experiments induced different levels of cognitive load. The statistical significance for Raven’s and Numerical tests indicated that both were able to detect differences between metrics of the NASA-TLX; however, the Game test did not contributed on differentiating Hard and Easy levels of cognitive load. It can be argued that the reason there are not statistical differences in the items of the Game test is because the majority of the participants were university students, which are familiar with video games. This test was considered less difficult than expected due to the experience of the users. Hence, in our future work we will increase the difficulty level to make the game more challenging for the participants.

The statistical tests also showed significant differences in the NASA-TLX sub-scales. For instance, most of the sub-scales for the Raven’s test were significantly different, with the perceived Frustration, Mental Demand, Performance, and the Temporal Demand sub-scales higher in the condition than in the condition. Similarly, the Numerical test exhibited significant differences in the perceived Mental Demand and Performance sub-scales. In these two tests, the perceived Mental Demand was the highest sub-scale during the Hard (high difficulty) condition; while perceived reported Performance was higher in the Easy (low difficulty) condition. These results are in line with similar studies where NASA-TLX has been employed to assess workload variations in different conditions [10].

Physiological data were used to investigate the automated identification of cognitive workload. Using data from each sensor separately, both the ECG and EDA outperformed (acc = 0.68) the RESP (acc = 0.60) and SpO2 (acc = 0.56) sensors. The performance of the ECG and EDA sensor data was further confirmed after the feature selection process, in which the top-10 features were composed only with features from ECG and EDA data (acc = 0.74). The obtained results are comparable with published results in the literature. For instance, Ding et el., [11], found that a fusion of ECG and EDA reached satisfactory results (acc =

) for the classification of mental workload using neural networks. In another similar study

[12], a combination of ECG, GSR, SpO2, electroencephalography, and electromyography were successfully used to discriminate different cognitive tasks using SVM (acc = ). These results show that ECG and EDA are possible indicators of cognitive workload.

Finally, the contributions of this study can be summarised as follows: 1) it offers an exploratory study that aims to compare different physiological metrics for the objective assessment of cognitive workload, and 2) it presents ten physiological indicators from both ECG and EDA as potential indicators for the objective assessment of cognitive workload.

V Conclusions

This study aimed to determine a set of indicators to estimate cognitive workload using physiological sensors and machine learning. Ten features were identified as possible indicators of cognitive workload. Fusion ECG and EDA data were more informative than RESP and SpO2 data to discriminate between cognitive workload levels; fusion of multiple sensors usually improves cognitive workload assessment [13, 14]. In our future work, these indicators will be tested and validated using other cognitive tasks, more realistic scenarios, and with a larger population.


  • [1] R. Fernandez Rojas, E. Debie, J. Fidock, M. Barlow, K. Kasmarik, S. Anavatti, M. Garratt, and H. Abbass, “Electroencephalographic workload indicators during teleoperation of an unmanned aerial vehicle shepherding a swarm of unmanned ground vehicles in contested environments,” Frontiers in neuroscience, vol. 14, p. 40, 2020.
  • [2] E. Galy, J. Paxion, and C. Berthelon, “Measuring mental workload with the nasa-tlx needs to examine each dimension rather than relying on the global score: an example with driving,” Ergonomics, vol. 61, no. 4, pp. 517–527, 2018.
  • [3] R. L. Charles and J. Nixon, “Measuring mental workload using physiological measures: A systematic review,” Applied ergonomics, vol. 74, pp. 221–232, 2019.
  • [4] N. Nourbakhsh, Y. Wang, and F. Chen, “Gsr and blink features for cognitive load classification,” in IFIP conference on human-computer interaction.   Springer, 2013, pp. 159–166.
  • [5] K. Tsunoda, A. Chiba, H. Chigira, T. Ura, and O. Mizunq, “Estimating changes in a cognitive performance using heart rate variability,” in 2015 IEEE 15th International Conference on Bioinformatics and Bioengineering (BIBE).   IEEE, 2015, pp. 1–6.
  • [6] P. A. Heeman, T. Meshorer, A. L. Kun, O. Palinko, and Z. Medenica, “Estimating cognitive load using pupil diameter during a spoken dialogue task,” in Proceedings of the 5th International Conference on Automotive User Interfaces and Interactive Vehicular Applications, 2013, pp. 242–245.
  • [7] P. A. Carpenter, M. A. Just, and P. Shell, “What one intelligence test measures: a theoretical account of the processing in the raven progressive matrices test.” Psychological review, vol. 97, no. 3, p. 404, 1990.
  • [8] J. Raven et al., “Raven progressive matrices,” in Handbook of nonverbal assessment.   Springer, 2003, pp. 223–237.
  • [9] R. F. Rojas, X. Huang, and K.-L. Ou, “A machine learning approach for the identification of a biomarker of human pain using fnirs,” Scientific reports, vol. 9, no. 1, pp. 1–12, 2019.
  • [10] B. R. Lowndes, K. L. Forsyth, R. C. Blocker, P. G. Dean, M. J. Truty, S. F. Heller, S. Blackmon, M. S. Hallbeck, and H. Nelson, “Nasa-tlx assessment of surgeon workload variation across specialties,” Annals of surgery, vol. 271, no. 4, pp. 686–692, 2020.
  • [11] Y. Ding, Y. Cao, V. G. Duffy, Y. Wang, and X. Zhang, “Measurement and identification of mental workload during simulated computer tasks with multimodal methods and machine learning,” Ergonomics, vol. 63, no. 7, pp. 896–908, 2020.
  • [12] Q. Xu, T. L. Nwe, and C. Guan, “Cluster-based analysis for personalized stress evaluation using physiological signals,” IEEE journal of biomedical and health informatics, vol. 19, no. 1, pp. 275–281, 2014.
  • [13] E. Debie, R. F. Rojas, J. Fidock, M. Barlow, K. Kasmarik, S. Anavatti, M. Garratt, and H. A. Abbass, “Multimodal fusion for objective assessment of cognitive workload: a review,” IEEE transactions on cybernetics, vol. 51, no. 3, pp. 1542–1555, 2019.
  • [14] M. Webber and R. F. Rojas, “Human activity recognition with accelerometer and gyroscope: a data fusion approach,” IEEE Sensors Journal, 2021.