Affect recognition is an important domain in artificial intelligence due to its significant health, safety and entertainment potentials[mase2020evaluating, richman2005positive]. For example, Richman et al [richman2005positive] suggest that negative emotions have a high likelihood of causing chronic stress and cardiovascular diseases, as well as a significant reduction in work performance, while Mase et al [mase2020evaluating] suggest that the emotional states of drivers, both positive and negative, have a significant impact on their driving performance.
Images with facial expressions are one of the main data sources for affect recognition in modern studies [kim2017multi, jain2018hybrid]. Facial expressions are a form of nonverbal communication that helps humans and systems to understand emotional states from facial movements. The main categories of affective states identified by facial expressions are: happy, neutral, sad, surprised, angry, disgusted and anxious [breuer2017deep]. These affective states are intuitive and simple (i.e. easy to observe and describe), but they fail to represent complex levels of emotions [sun2014deep]. A more realistic and comprehensive representation of affective states uses a two-dimensional continuous description of emotions (i.e. valence and arousal dimensions) [ringeval2013introducing] as shown in Fig. 1. The valence dimension varies from positive emotions, such as happy and energetic, to negative feelings, such as angry and sad. The arousal dimension ranges from excited to calm moods.
Deep Learning (DL) methods have been successfully used to detect the affective state of facial expressions [kim2017multi, jain2018hybrid]. However, the affect recognition systems proposed in previous studies process images in a non-federated manner, which pose serious privacy concerns if the images are accessed by malicious users or organisations in a central database. For instance, a hacker could obtain and assume users’ identities [rathor2013social], or a malicious organisation may use the imagery demographic information for discriminatory bias [zerr2012privacy]. In this work, we propose a two-level privacy-preserving strategy. The first level extracts Action Units (AUs) from a database of raw images, discards the images and processes the AUs as AUs safeguard the identities and demographic information of users in case of unauthorised access or misuse of data. The second level employs a Federated Learning (FL) approach where raw images are processed in users’ local machines and the locally trained models sent to the main processing machine for aggregation. The main contributions of this study are:
Employing FL in affect recognition using images.
Comparing the prediction, efficiency and privacy performance of non-federated processing of raw images, non-federated processing of anonymised facial features (AUs) and FL of raw images using different variations of Recurrent Neural Networks(RNNs).
In this section, we will first provide an overview of the importance of privacy in affect recognition. Secondly, we will review the literature on processing images and AUs using non-federated deep learning strategies, and present the privacy limitations of the strategies. Later, we will outline all works that explore a FL approach to preserve user privacy and detect emotions.
Ii-a Privacy motivation
Affective AI technologies are increasingly being adopted and becoming more prevalent in the early 2020s [mcstay2020emotional]. Such technologies can be deployed in various contexts for different purposes ranging from assistive technology [bishop2015supporting] to more personalised user experiences, to behaviour manipulation [poria2017review]. However, in some cases, the use of affective AI can be viewed as causes to privacy-related concerns. For example, the ubiquitous use of facial emotion recognition systems by retailers poses significant privacy challenges [retailers]
. As emotion recognition systems rely on automated facial recognition, retailers are able to identify customers, track their emotions and infer different psychological states, as well as gather and process their personal data, resulting in big datasets for consumer profiling and bias, and in some cases, for price and service discriminations[wachter2020affinity]. In addition, McStay [mcstay2016empathic] noted that data derived from someone’s emotional state may be “intimate” and sensitive information and can be easily linked back to the person using their image.
On this account, individual identity should be protected to prevent further exploitation by malicious individuals and companies.
Ii-B Non-federated deep learning methods
Different non-federated deep learning architectures have been successfully used to process raw images and predict valence and arousal states [tzirakis2017end, chao2015long, khorrami2016deep, lee2018spatiotemporal]. For example, Tzirakis et al [tzirakis2017end]
reported best valence and arousal recognition performance after training Convolutional Neural Networks (CNNs) coupled with Long Short Term Memory networks (LSTMs) on images from the RECOLA database, while Leeet al [lee2018spatiotemporal]
combined features extracted from RECOLA images using 3D CNNs with spatio-temporal features extracted using Convolutional LSTMs to predict valence score. These non-federated CNN approaches require the developer to maintain a database of images for the automatic extraction of non-linear features, and as such, are susceptible to privacy issues if the images are accessed by malicious users or organisations.
To protect participants’ privacy and still accurately identify human emotions, researchers have explored AUs, extracted from facial expressions in images [chao2015long, ortega2019multimodal, valstar2016avec, han2017strength]
. These AUs represent human-observable facial muscle movements, which estimates the intensity of facial movements using facial landmarks. For example, AUs 12 (raising lip corners), 15 (lowering lip corners) and 20 (lip stretch) can be estimated using the facial landmarks on the lips of the avatar face in Fig.2. We also observe from Fig. 2 that the entire face (thousands of pixels) is reduced to just 98 points containing far less identifiable information which are further processed to obtain the AUs. In typical non-federated AU approaches, developers extract AUs from the images, create a database of AUs, discard the images and use the AU database to train their models. Such methods [chao2015long, ortega2019multimodal, valstar2016avec, han2017strength] have shown to be more privacy-preserving but less accurate compared to processing the images. In addition, AUs are not completely private as a recent study by Fan et al [fan2021demographic] demonstrates relationships between AUs and demographic factors such as race, gender and age.
Ii-C Federated deep learning methods
In order to provide complete user privacy and avoid misuse of their identities, one possible solution is to employ a FL approach [yang2019federated]. FL is a technique that allows a ML model to be trained without collecting data. This is done by the collaborative training of multiple ML models on users local machines (local models) where their personal data resides and sending the trained models (i.e model weights) back to the developer’s machine (central model) for aggregation. Different ensemble methods can be employed to aggregate the model weights depending on the problem such as mean, median, and weighted average. The central model updates its weights using the aggregated weights and sends the updated weights back to the local models. This process keeps the local training data private and confidential. FL methods have been employed in speech emotion recognition [latif2020federated] and detection of depression using mobile health data [xu2021fedmood] but not in Facial Emotion Recognition (FER). To the best of our knowledge, the only study that mentions FL for FER is Chhikara et al [chhikara2020federated]. However, the study does not implement FL nor present any FL results. They simply mention FL as a privacy solution to their multi-modal affect recognition approach.
In this paper, we implement a two-level privacy preserving strategy consisting of non-federated AUs and FL of images using deep neural networks. We assess their effectiveness in terms of accuracy, efficiency and privacy by comparing their performance with a non-federated image processing strategy. We test the approaches on RECOLA, a comprehensive multimodal affect database with continuous valence and arousal states.
In this section, we introduce the privacy-preserving schemes proposed, including (1) non-federated processing of extracted facial features (AUs), and (2) FL using the raw images. The processing modules for those schemes use RNNs to learn temporal relationships between facial features and emotional states [jain2018hybrid, zhu2017dependency].
Iii-a Level 1: Non-federated processing of action units
Fig. 3 shows the first level of our privacy-preserving architecture that processes AUs in a centralised manner. AUs are extracted from a database of images using facial landmark detectors (e.g. OpenFace AU [baltrusaitis2018openface]
) and the images are discarded to protect users’ identities and demographic information as the AUs are free of human faces and demographic information. Later, the AUs are pre-processed by transforming them to similar scales for better performance (e.g. normalisation). The pre-processed AUs serve as the input to the processing module. We utilise RNNs in the processing module to capture the temporal relationships among the sequential AUs and feed the temporal features to fully-connected neurons. The fully-connected neurons learn the non-linear relationships between the temporal features and the continuous affective states.
Iii-B Level 2: Federated learning using images
Level 2 of our architecture uses a FL approach to process users’ images at their local machines and their trained models sent to the central processing module for aggregation as shown in Fig. 4. The local and central processing machines should have the same model implementation to enable easy aggregation of the trained models. For simplicity, we adopt a mean aggregation strategy where the weights of the trained models are averaged to represent the aggregated weights (global weights). It is important to note that different aggregation strategies can be explored to merge the weights e.g. the central processing module could maintain n global weights for the n local machines where the different global weights are weighted averages of the local machines. We implement CNNs coupled with RNNs to process the images at the local machines. The CNNs and RNNs are trained together end-to-end.
Training occurs simultaneously across the machines. After each training iteration, the locally trained models (i.e. model weights) are sent to the central processing module for aggregation. The centrally aggregated weights are sent back to the local machines to update their weights for the next training iteration. The local training, central aggregation and local weight update processes are repeated until the training process is completed.
Iv Experimental design
This section first describes the RECOLA database and presents the deep learning methods selected for the processing modules of our architecture along with their hyper-parameter configurations. Furthermore, it defines the Concordance Correlation Coefficient (CCC) metric used for evaluating the accuracy of the models and concludes with the evaluation protocols for the experiments.
Iv-a RECOLA database
The RECOLA (Remote COLlaborative and Affective interactions) database [ringeval2013introducing]
is the most popular and comprehensive affective dataset with continuous response variables (i.e. valence and arousal). The database consists of images, AUs, audio, ECG and EDA datasets for 23 participants. The data was collected during spontaneous and naturalistic interactions between the participants when performing collaborative tasks. The database also contains the ground truth continuous labels (valence and arousal) that range from -1 to +1 with a step size of 0.01. The annotations were carried out by six annotators. In this study, we will explore the image and AU datasets of RECOLA. The AU dataset consists of 40 AUs.
Iv-B Model selection and hyper-parameter configuration
We explore three state-of-the-art RNN models to detect valence and arousal: simple RNNs [jain2018hybrid]
, Bi-directional Gated Recurrent Units (BiGRUs)[chung2015gated], and Bi-directional Long Short Term Memory networks (BiLSTMs) [tzirakis2021end]. We choose these models due to their remarkable performance in time series or sequential analysis [mase2020benchmarking]. To process the raw images in level 2 of our architecture, we employ shallow residual convolutional networks (i.e. ResNet18) [bengio1994learning] due to their remarkable training efficiency (fewer number of layers compared to other state-of-the-art CNNs) and prediction performance [mase2020benchmarking]
. The ResNets are pre-trained on the ImageNet dataset[russakovsky2015imagenet]
to take advantage of its large size (transfer learning). Later, we remove the fully connected layers of the networks and use their output feature maps as inputs to the RNN networks. The performance of our privacy-preserving strategies are compared with the non-federated processing of raw images, where a database of user images is maintained. For this, we utilise a deep learning method similar to that employed for processing the images in FL i.e. ResNets coupled with RNNs, trained together end-to-end as shown in Fig.5.
When training the networks, we minimise the Mean Squared Error (MSE) between the predicted affective states and annotated affective states, and we use Adam Stochastic Gradient Descent to optimise the loss function (MSE), which is a fast optimisation algorithm for deep neural networks. The RNN networks consist of the following hyper-parameters: learning rate, hidden layers, sequence length, number of recurrent layers and fully connected layers. The learning rate controls how the weights are updated with respect to the estimated error. If the learning rate is very low, the learning process will be slow as the updates will be very small, and if the learning rate is very high, the weight updates will be very large which can lead to divergence. We train the models using popular learning rates, 0.001, 0.0001, and 0.00001. The hidden size represents the number of hidden units within each recurrent memory cell. We explored 8, 16, 64, 128, 256 and 512 hidden sizes. We also explored 50, 100, 200, 400, 600, 800, 1000 and 2000 AU sequence lengths, and 4, 8, 16, 32 image sequence lengths. The following number of recurrent memory cells (recurrent layers) were evaluated: 1, 2, 4, 6 and 8. Lastly, one fully connected layer was used in the networks consisting of 10 neurons with 2 output neurons for valence and arousal affective states. TableI presents the optimal hyper-parameter configurations of the architectures after evaluating the validation loss using the above selected hyper-parameters.
|Method||Networks||Learning rate||Sequence length||Hidden size||Number of layers|
Iv-C Evaluation metrics
For performance evaluation, we use Concordance Correlation Coefficient (CCC). CCC is the correlation between two variables that fall on the 45 degrees line through the origin. Similarly to Pearson’s correlation coefficient, CCC measures how closely related two variables are in linear fashion, but it also calculates the degree of correspondence (agreement) between the two variables by measuring their fitness to the line passing through the origin with a slope of 1. It is said to be more robust than Pearson’s correlation as it measures both covariation and correspondence. Fig. 6 shows two plots (orange and green) with Pearson’s correlation coefficients of 1 but the orange plot has a CCC of 1 while the green plot has a CCC of 0.403 due to its disagreement with the 45 degree line. CCC ranges from -1 to 1, with perfect concordance at 1 and perfect discordance at -1.
CCC is calculated as follows:
where and are the means for the two variables and and
are their corresponding variances.is the correlation coefficient between the two variables.
Iv-D Evaluation protocol
First, the ground truth valence and arousal values are obtained by averaging the annotations from the six annotators. Secondly, we employ k-fold cross validation to evaluate the models. The dataset is split by participants to prevent overfitting. In our experiments, we select k = 8 i.e. data split into 8 folds with each fold consisting of data for 2-3 participants depending on the split. The training process is repeated k times to produce k
trained models and during each training process, one fold is left out for evaluating the model and the remaining folds are used to train the model. The average CCC amongst the k evaluated models gives the overall performance of the method across the entire dataset. The higher the value of k the more computationally expensive is the training process, however, the more robust and accurate is the model’s performance. For a more realistic implementation of FL, we use each participant as a local machine and divide the total training time by the number of participants to represent synchronous local processing. For example. using 8-fold cross validation on 23 participants, we have data for 20 or 21 participants (local machines) for training and data for the remaining 2 or 3 participants kept aside for evaluating the global model. All experiments are executed on a Graphics Processing Unit (GPU) using 4 CPU cores and 6GB RAM. Our code is implemented in Pytorch with an epoch size of 100 for each experiment.
|Affect||Deep||Average of||Average of|
|Recognition Method||Neural Network||Valence CCC||Arousal CCC|
|of Raw Images||CNN-RNN||0.471||0.486|
|of Action Units||RNN||0.183||0.228|
|of Raw Images||CNN-RNN||0.304||0.197|
|Affect recognition||Data Privacy||Average of||Average of||Training time||Model size||Inference time using||Inference time using|
|method||Level||Valence CCC||Arousal CCC||(mins)||(MB)||100 images(secs)||for 500 images(secs)|
|Affect||Type of Machine|
|Recognition Method||Learning Model||Valence CCC||Arousal CCC|
|CNN + LSTM [tzirakis2017end]||0.620||0.435|
|Non-federated Processing||CNN + LSTM [chao2015long]||0.538||0.336|
|of Raw Images||CNN + RNN [khorrami2016deep]||0.474||-|
|2D CNN + ConvLSTM + 3D CNN [lee2018spatiotemporal]||0.546||-|
|Our CNN + BiLSTM||0.476||0.514|
|Non-federated Processing||SVM [valstar2016avec]||0.507||0.272|
|of Action Units or Landmarks||BiLSTM + SVM [han2017strength]||0.394||0.265|
|Federated Processing||Our CNN + BiLSTM||0.426||0.390|
|of Raw Images|
Note: A dash is inserted if the results were not reported in the original papers.
V Results and Discussion
V-a Comparison of proposed methods
We implemented three state-of-the-art RNN models (i.e., RNN, BiGRU, and BiLSTM) for each method and evaluated their performance using CCC coupled with cross-validation on the RECOLA image and AU datasets. Table II shows the average CCC for valence and arousal after evaluating the models using the best hyper-parameters shown in Table I. The bold values represent the best model performance for valence and arousal affective states. Overall, the non-federated processing of raw images shows best valence and arousal predictions, followed by the federated processing of raw images. These strategies that process raw images outperform the processing of AUs due to the loss of spatial information in the AUs. CNNs coupled with BiLSTMs show best performance for non-federated processing of images with 0.476 average CCC for valence and 0.515 for arousal. Next, the processing of AUs shows similar arousal prediction performance compared to the federated processing of images. In addition, we observe that LSTMs outperform GRUs when processing the raw images similar to results from other studies that analyse raw images [mase2020benchmarking]. However, for AU processing, GRUs show better performance compared to LSTMs. This is due to the efficiency of GRUs in processing smaller datasets or feature sets compared to LSTMs as only 40 AUs are extracted by the facial landmark extractor while 512 features are extracted by the convolutional networks.
Table III presents the efficiency results of the best performing models in terms of training time, inference time and model size. We observe that processing AUs has the least training and inference times due to a smaller feature set (which reduces the complexity of the network) and lack of the convolutional feature extraction layer. This makes the AU processing modules more suitable for real-time affect recognition systems such as, real-time monitoring of patients’ affective states for early intervention and assistance [song2020spectral]. However, the predictive accuracy of processing AUs is lower compared to the non-federated processing of raw images for both valence and arousal. The non-federated processing of raw images shows better accuracy in predicting valence and arousal compared to AUs and FL at the detriment of users’ privacy and inference time. FL best preserves users’ privacy and identity compared to the other methods as data is maintained in users’ local machines, however, its training time is significantly higher, which can further increase if the processing at the local machines is not done synchronously. Lastly, FL’s CCC results are inferior to the non-federated processing of images due to limited data at the local machines.
V-B Comparison with other studies
In Table IV
, we compare the performance of our models with other studies that employ machine learning methods on the RECOLA image and AU datasets for affect recognition. For the non-federated processing of raw images, we observe that[tzirakis2017end, ortega2019multimodal, chao2015long, lee2018spatiotemporal] show better valence recognition results compared to our model, with Tzirakis et al [tzirakis2017end] having the best CCC valence (0.620). However, our model shows best arousal accuracy with a CCC value of 0.514. Those studies also explored different architectures of CNNs coupled with LSTMs, however, they are limited in their model evaluation strategy (i.e. train-test split) that prevents a comprehensive exploration of the data and may lead to less accurate or biased results.
Furthermore, the processing of AUs and facial landmarks by previous studies [chao2015long, valstar2016avec, han2017strength] show better CCC results in predicting the valence dimension. Valstar et al [valstar2016avec] presented best valence CCC results of 0.507
using support vector machines. Our model shows a contrary performance as our arousal prediction results are better than valence and outperforms the arousal accuracy of the other studies (0.401). This is due to the remarkable performance of GRUs in processing small feature sets. Next, storing the anonymised AUs is more secured in terms of privacy compared to raw images. Therefore, in order to maintain a database of images with facial expressions, requires appropriate security levels and systems to safeguard the data, which can be challenging to implement. The trade-off between efficiency and privacy is at the cross road. Consequently, from a privacy-compliant perspective, it could be argued that storing raw images may not be necessary if other alternative methods are available and extracting AUs could be considered as a data anonymisation technique for images to protect the identities and demographic information of users, and an alternative method for affect recognition.
Lastly, we could not find any study in the literature that explores FL to process raw images for FER. As a result, we show the performance of our FL architecture that uses CNNs coupled with BiLSTMs as a benchmark for future research on FL and privacy-preserving deep learning techniques for FER. By taking into account the privacy benefits of FL as well as their promising results (i.e. valence = 0.426 and arousal= 0.390), further research is required to advance FL for affect recognition.
In this paper, we prioritised privacy in facial emotion recognition by presenting a two-level privacy-preserving architecture consisting of: (1) a deep learning model for processing anonymised facial features (Action Units) from facial expressions to preserve users’ identities, and (2) a federated deep learning approach that aggregates locally trained models on raw images. We implemented three variations of RNNs and compared the models’ performance including the non-federated processing of images on the RECOLA databases. Our results show state-of-the-art performance of for valence and for arousal using Concordance Correlation Coefficient evaluation metric using the privacy-preserving architecture.
For future work, we plan to improve the performance of these models by combining and fusing data from other modalities while still maintaining the privacy proactive nature of our system as well as promoting responsible technology as “data protection by design and by default”. For example, extracting and combining acoustic features with the AUs or a federated learning approach to aggregate locally trained models on audio-video data. We also intend to optimise the models, and explore spiking neural networks to reduce model complexity for efficient use in real-time emotion detection systems.