Multimodal Data Fusion based on the Global Workspace Theory

01/26/2020 ∙ by Cong Bao, et al. ∙ UCL 0

We propose a novel neural network architecture, named the Global Workspace Network (GWN), that addresses the challenge of dynamic uncertainties in multimodal data fusion. The GWN is inspired by the well-established Global Workspace Theory from cognitive science. We implement it as a model of attention, between multiple modalities, that evolves through time. The GWN achieved F1 score of 0.92, averaged over two classes, for the discrimination between patient and healthy participants, based on the multimodal EmoPain dataset captured from people with chronic pain and healthy people performing different types of exercise movements in unconstrained settings. In this task, the GWN significantly outperformed a vanilla architecture. It additionally outperformed the vanilla model in further classification of three pain levels for a patient (average F1 score = 0.75) based on the EmoPain dataset. We further provide extensive analysis of the behaviour of GWN and its ability to deal with uncertainty in multimodal data.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reasoning about and interpreting multimodal data is an important task in machine learning research because life involves streaming of data from multiple modalities (Baltrusaitis et al., 2017). Multimodal data fusion, which leverages the combination of multiple modalities, is a valuable strategy  (Atrey et al., 2010; Calhoun and Sui, 2016; Hori et al., 2017; Liu et al., 2018). Its benefits including complementarity of information, higher prediction performance, and robustness (Baltrusaitis et al., 2017). However, multimodal fusion comes with challenges; Lahat et al. (2015) specifies them under two categories: (1) challenges at the data observation and acquisition level, and (2) challenges due to uncertainty in the data (such as noise, missing values, conflicting information). Challenges at the observation level can be managed by data pre-processing, e.g. with data resampling, to deal with different temporal resolutions across modalities) (Aung et al., 2016). However, challenges due to uncertainty require the design of models that can exploit complementarity or discrepancy across modalities (Lahat et al., 2015), an area which is less explored. Findings in previous work on multimodal fusion have highlighted the effectiveness of weighting different modalities based on some “importance” metric (Wilderjans et al., 2011; Şimşekli et al., 2013; Liberman et al., 2014; Kumar et al., 2007; Acar et al., 2011), which is the basis of the use of attention mechanisms in machine learning (Bahdanau et al., 2015). Despite the fact that uncertainty evolves through time in multimodal, sequential data (Lahat et al., 2015), relevant studies have not sufficiently explored mechanisms for both cross-modal cum temporal attention. For example, the architecture proposed by Beard et al. (2018) captures variations in importance along the time axis separately for the different modalities in the data. A drawback of their approach is that these variations are not simultaneously fused over the modalities.

To address this gap in multimodal data fusion, we propose the Global Workspace Network (GWN) which integrates variations in importance simultaneous through time and across modalities. Our GWN is inspired by the Global Workspace Theory (GWT) (Baars, 1997, 2002), which is a well-developed framework in cognitive science, and originally proposed as a model of human consciousness (Baars, 1988). The GWT states that concomitant cognitive processes compete for the opportunity to broadcast their current state (to peers) (Fountas et al., 2011). At each iteration, the winner (a single process or a coalition of processes) earns the privilege of contributing current information in a global workspace which can be accessed by all processes (including the winner) (Shanahan, 2008). This competition and broadcast cycle is believed to be ubiquitous in the perceptual regions of the brain (Baars, 1988). Although the literature contains architectures of biologically-realistic spiking neural networks based on GWT Shanahan (2008); Fountas et al. (2011), so far, to our knowledge, there has been no relevant implementation in machine learning. Mechanistically, the main concept of this theory can be implemented as the combination of a compete-and-broadcast procedure and an external memory structure. In contrast to the global workspace, which can be seen as a communication module, external memory here is used as the means to store information for later application (Shanahan, 2006). Taking the processing of each modality in multimodal data as analogous to specialised processes in the brain, the similarity between the compete-and-broadcast cycle and typical cross-modality attention mechanism is obvious. The repetitiveness of the cycle allows the pattern of attention to evolve over time and, given the external memory module, be used in the primary prediction task of the network.

In order to simulate the two main components of the GWT (compete-and-broadcast and external memory), we employed two widely-tested algorithms, the transformer (Vaswani et al., 2017)

and the Long Short-Term Memory (LSTM) neural network 

(Hochreiter and Schmidhuber, 1997; Gers et al., 1999) respectively. There are 3 key elements of the transformer that we leverage in the GWN. First is its self-attention mechanism (Cheng et al., 2016; Paulus et al., 2017) that we use as the GWN’s compete-and-broadcast procedure where each modality independently scores all modalities and integrates the data from them based on the resulting weights. The second valuable component of the transformer is its memory-based attention mechanism (Weston et al., 2015; Sukhbaatar et al., 2015)

. Drawing from its traditional application in Natural Language Processing (NLP) question answering tasks 

(Sukhbaatar et al., 2015; Miller et al., 2016)

, this unit further maps the feature vector into query, key and value spaces to increase the weighting depth and robustness 

(Hu, 2018). This additionally enables distributed competition versus broadcasting computations. In essence, the query and key forms can be used for the competition, while broadcast is performed on value form, which can have more expressive information that is not valuable for the competition. The third merit of the transformer is its bagging approach, i.e. the use of multiple heads in which multiple attention patterns are learnt in parallel, which has the advantage of increasing robustness. We used the LSTM as the basis for the complementary external memory because it has been shown as effective for learning long-term dependencies (Lipton, 2015).

The contributions of this paper are as follows:

  • The GWN architecture, a novel approach to fusion of sequential data from multiple modalities.

    We evaluate the proposed GWN architecture on the EmoPain dataset (Aung et al., 2016), which consists of motion capture and electromyography (EMG) data collected from patients with chronic lower back pain and healthy control participants while they performed exercise movements. This dataset is representative of real-life data with continuously-streamed multiple modalities, each with varying degrees of uncertainty.

  • Analysis of the GWN’s outputs demonstrating its effectiveness in handling uncertainty in data.

The paper is organized as follows. We discuss the state of the art in attention approaches in Section 2. We then describe the proposed GWN architecture in Section 3 and present both validation and analysis of the network in Section 4. Section 5 concludes the paper.

2 Related Work

Attention in the temporal dimension

In the literature on neural networks for multimodal data, attention performed on the time axis is usually separated by modality, and the resulting context vectors from each modality are fused as non-temporal features. A representative case of this approach is the Recursive Recurrent Neural Network (RRNN) architecture proposed by 

Beard et al. (2018). In their work, different modalities (video, audio, and subtitles) extracted from a subtitled audiovisual dataset were divided into segments of uttered sentences and each segment was used an input to the network. For each modality in a segment, a bi-directional LSTM layer was used to extract features. At a given time step, attention computation is performed for each modality separately and the outputs are concatenated over all modalities together with the current state of a shared memory, which the authors implemented with a GRU cell (Cho et al., 2014). The outcome is then used to update the state of shared memory. An advantage of this work is that since each modality was encoded separately, they do not have to follow a common time axis, which allows each modality to optimally exploit its inherent temporal properties. However, as this method cannot account for attention between modalities, different modalities (some potentially more noisy than others) affect the final prediction equally and thus the problem of cross-modal uncertainty variation remains unsolved.

Attention across modalities Several related studies have considered the relation between modalities in fusing them. The typical approach  (Wilderjans et al., 2011; Şimşekli et al., 2013; Liberman et al., 2014) is the use of modality weighting although not particularly based on attention mechanisms (Bahdanau et al., 2015). One study that does explicitly use an attention mechanism is the work of Hori et al. (2017) on the modelling of video description. Their approach leverages attention between different modalities using an encoder-decoder architecture (Bahdanau et al., 2015) with separate encoders for each modality and a single decoder. Features of each modality are encoded separately and the decoder weights them to generate a context vector as an output. A similar study Caglayan et al. (2016)

applies multimodal attention in neural machine translation where images are leveraged in translating the description texts from one language to another. The image and text modalities were first encoded using pre-trained ResNet-50 

(He et al., 2015) and bi-directional GRU (Cho et al., 2014) respectively. Then, attention scores were computed for these encodings. The common approach of encoding the temporal data before computing attention is appropriate for obtaining modality-specific feature representation, however, it does not allow in-depth capture of the complex interactions between modalities through time. In addition, it is not suitable for online process of live-streamed data.

The GWN architecture we propose addresses these limitations by considering both the interaction of multiple modalities and the temporal variations in this interaction. It is indeed a more intuitive approach to processing a stream of multimodal data, by weighting multiple modalities at each timestep.

3 Global Workspace Network (GWN)

Figure 1: The architecture of the GWN. Here the intermediate matrices , , , and have the same dimensionality of .

The architecture of the GWN is shown in Figure 1. The network consists of five components: an input unit, a mapping block, an attention module, an external memory, and a prediction block. These components are described in detail below.

3.1 Mapping Inputs to a Common Feature Space

Consider modalities that they have an identical sampling rate, i.e. for each data instance, each modality in that instance can be written as , where denotes the common temporal length (common across modalities) of the data instance. The dimensionality at a given time may nevertheless be different across these modalities, i.e. . The attention mechanism of the GWN requires identical dimension across modalities and so, it is necessary to have a module for mapping the modalities into the same dimensions.

Inspired by the work of Akbari et al. (2018) and Bollegala and Bao (2018)

, we take the approach of using multiple autoencoders 

(Vincent et al., 2008) that each learn a common feature space for multiple modalities. Assuming that the common feature space has a dimensionality of , the mapping function in the encoder for each autoencoder outputs a vector with dimensionality of

. This function can be designed as a feed forward network with one hidden layer which is activated with the rectified linear unit (ReLU

(Nair and Hinton, 2010) non-linearity, i.e.


where is the data instance sampled at modality and time ; and , , , and are trainable parameters of function. The findings of Cybenko (1989) suggest that such encoding should be capable of mapping different modalities into a common feature space. can then be obtained by summing the outputs across the encoders


This is based on previous work in Bollegala and Bao (2018). The decoders have the same form as the encoders, i.e.


where is the reconstruction of data instance sampled at modality and time ; and , , , and are trainable parameters of decoder. A sum of the mean squared error loss for each autoencoder can be used to train the full mapping module.

Figure 2: An illustration of the mapping module with two different modalities and two autoencoders.

Figure 2 provides an illustration with an example of two modalities mapped into a common feature space and then reconstructed, based on two autoencoders. After pre-training the autoencoders, the encoders are used directly as the mapping function in the GWN. The pre-trained parameters in the encoders then serve as initial values for the mapping block in the GWN. Though this approach introduces more learnable parameters, the findings of Hinton et al. (2006) suggest that unsupervised pre-training on shallow layers can improve the performance of a deep network.

For the subsequent attention module, the output vector from each modality’s mapping are merged by stacking, to form a matrix .

3.2 The Attention Module

The attention module is a single layer of the transformer encoder described in  (Vaswani et al., 2017) with the difference that, in the GWN, the input is a set of different modalities for a number of data instances at a specific time , rather than data sequences (i.e. multiple time steps and instances) based on a single modality. Since the input is already in matrix form, the following multi-head attention calculation can be performed:


where is a set of heads and is a trainable matrix. Each context matrix for a specific head is calculated as


The query, key, and value matrices of a specific head at time are calculated as:


Here, the query, key, and value are variations of the input , based on the idea of memory-based attention mechanism Miller et al. (2016). Note that the trainable matrices , , and are reused on different time steps but are independent for different heads .

As shown in Figure 1

, there are two residual connections 

(He et al., 2015) in the attention module. Each of the residual connection is followed by a layer normalisation (Lei Ba et al., 2016). The first residual connection can be represented as:


Here, the assumption of identical dimensionality for residual connection is satisfied as and . The subsequent feed forward layer and the final output of the attention module, respectively, are:


both .

3.3 External Memory

The external memory is implemented as an LSTM cell (Hochreiter and Schmidhuber, 1997) with updates:


where the input vector is the flattened form of ,

is the sigmoid function:


is the hyperbolic tangent function


and denotes the Hadamard product (i.e. element-wise product). is the recurrent state at time step , and consists of a memory cell and the output at that time step, with

as an hyperparameter that indicates the size of the external memory. The initial state

is set with zeros. , , and represent forget, input, and output gates respectively (Hochreiter and Schmidhuber, 1997; Gers et al., 1999). All the gates have the same dimensionality . The output vector in the last recurrent state is used by the final prediction component.

3.4 Prediction

The final prediction module consists of a feed forward layer with one hidden layer activated with a ReLU followed by a softmax function. The layer serves as a simple non-linear transformation from the external memory and can be applied at any time step, making it suitable for online prediction with streaming data. The equations are given as




where is the prediction result mapped into the distribution . Both and have the same dimensionality, the size of label .

4 Experiments

To evaluate the proposed GWN architecture, we conducted experiments on the multimodal EmoPain dataset Aung et al. (2016). In Section 4.1, the dataset, data preprocessing, and experiment tasks are introduced. Section 4.2 describes the methods and metrics used for evaluation against a baseline model. Finally, Section 4.3 presents the performance and empirical analyses of the GWN.

4.1 Data

4.1.1 The EmoPain Dataset

The EmoPain dataset (Aung et al., 2016) is suitable given that it consists of sequential data from multiple modalities and in unconstrained settings where there are bound to be uncertainties (e.g. in form of sensor noise) in the data, and in varying degrees over time. The data was collected from 22 patients with chronic low back pain and 28 healthy control participants and includes motion capture (MC) and muscle activity data based on surface electromyography (EMG). The data for each participant was acquired while they performed physical exercises that put demands on the lower back. For each exercise, there were two levels of difficulty. There is the normal trial, for 7 types of exercise ((1) balancing on preferred leg, (2) sitting still, (3) reaching forward, (4) standing still, (5) sitting to standing and standing to sitting, (6) bending down, and (7) walking). There is additionally the difficult trial, where four of these exercise types were modified to increase the level of physical demand, i.e. (8) balancing on each leg, (9) reaching forward while standing holding a 2 kg dumbbell, (10) sitting to standing and return to sitting initiated upon instruction, (11) walking with 2 kg weight in each hand, starting by bending down to pick up the weights, and exercises (2) and (4) repeated without modification. The data was acquired so as to build automatic detection models for pain and related cognitive and affective states, and so after each exercise type, patients self-reported the level of pain they experienced, on a scale of 0 to 10 (0 for no pain and 10 for extreme pain) (Jensen and Karoly, 1992). In this paper, we used the subset of the EmoPain dataset with the self-reported pain labels available and where consent was given for further use of the data. This subset consists of 14 patients with chronic pain and 8 healthy control participants resulting in a total of 200 exercise instances.

4.1.2 Evaluation Experiment Tasks

The proposed GWN architecture was evaluated on two classification tasks based on the multimodal EmoPain dataset:

Pain Level Detection Task
Figure 3: Number of exercise instances per each classes for (a) Healthy-vs-Patient Discrimination Task and (b) Pain Level Detection Task.

The aim of this task is to detect the level of a person with chronic pain. The motivation for creating such system is to endow technology with the capability for supporting physical rehabilitation by providing timely feedback or prompts, and personalised recommendations tailored to the pain level of a person with chronic pain. For example, a person with low level pain may be reminded to take breaks at appropriate times and not overdo, whilst a person with high pain may be reminded to breath to reduce tension which may further increase pain levels  (Olugbade et al., 2019).

A formal description of the task is as follows. Given M and E, denoting MC and EMG data, for an unseen subject known to have chronic pain (i.e. the event

), infer the probability

that the data corresponds to one of three levels of pain. A random variable

represents the level of chronic pain and is . In this paper, 0 represents zero level pain, i.e. pain self-report = 0, 1 represents low level pain, i.e 0 pain self-report 5), and 2 represents high level pain, i.e pain self-report 5).

Healthy-vs-Patient Discrimination Task

The healthy control participants were assumed to have no pain. However, patients with chronic pain who reported pain as 0 were not considered to be in the same class as these participants. Hence, a separate model may be needed to first distinguish a person with chronic pain from healthy participants.

The formal definition of the task is as follows. Given M and E, infer the probability that the data belongs to a person with chronic pain. A random variable represents the event that an unseen subject has chronic pain, and with 0 for healthy and 1 for chronic pain person.

Figure 3 shows the number of exercise instances for each class, for the Healthy-vs-Patient Discrimination Task and Pain Level Detection Task respectively.

4.1.3 Data Preprocessing

Here, we describe the preprocessing performed to prepare the data for the evaluation experiments.

Dealing with A High Sampling Rate

The EMG data of the EmoPain dataset had been downsampled from 1000Hz to 60Hz for consistency with the MC data. However, 60Hz results in high dimensionality whereas preliminary experiments suggest that 10Hz may be sufficient for the Healthy-vs-Patient Discrimination Task. Thus, we downsampling both MC and EMG data further, to 10 Hz to be suitable for the Healthy-vs-Patient Discrimination Task. The original 60Hz was found to be more appropriate for the Pain Level Detection Task.

Padding for Uniform Sequence Lengths

Based on the findings in Dwarampudi and Reddy (2019); Wang et al. (2019)

, we used pre-padding rather than post-padding to obtain uniform time sequence lengths for different data instances. Further, we used zero padding, which is the common approach used in modelling when assuming no prior knowledge about the input data 

(Shi et al., 2015).

Dealing with Imbalanced Data

As can be seen in Figure 3

, the class distribution of the data is skewed for both pain classification tasks. To reduce bias toward the majority class, we randomly over-sampled data instances of the minority class  

(Kotsiantis et al., 2005).

Data Augmentation

The total number of exercise instances available for training and evaluation was 200, which is a limited amount for training a neural network. We employed data augmentation, particularly creating new instances from the original by rotating them, to address this problem. Preliminary experiments that we performed show that rotation about y-axis, which is along the cranial-caudal, outperforms the mirror reflection augmentation used in Olugbade et al. (2018). This augmentation approach used four angles, 0°, 90°, 180°, and 270°, and resulted in four times the original data size. For each newly created instance, only the original MC data was changed by the rotation; for these instances, the original EMG data was used unchanged as they are not affected by the orientations.

4.2 Evaluation Methods

4.2.1 Baseline Model

A simple concatenation (CONCATN) architecture, which is representative of the traditional multimodal data fusion approach, was used as the baseline network against which we evaluated our GWN architecture. This baseline allows evaluation of the contribution of the GWN’s mapping and attention components to its performance. The CONCATN has identical external memory and prediction units. Hence, it can be seen as a network that does not pay particular attention to different modalities over time, but rather treats them equally through time.

In the CONCATN, multiple modalities are concatenated along the feature axis and fed into a LSTM network. The feed forward equations are


where is the number of modalities, is a memory cell and is the hidden state. Initial states and have values of zero. Assuming the dimensionality of each modality input at a specific time is , the dimensionality of the concatenated vector is . The dimensionalities of and have the same values as in the GWN model. The prediction module is also identical to the GWN model, i.e. the last LSTM output is fed into a feed forward network with one hidden layer activated with ReLU (Nair and Hinton, 2010) non-linearity.

Task Validation Model ACC MCC (0) (1) (2) (avg)
Discrimination Task
LOSOCV CONCATN 0.765 0.489 0.662 0.820 - 0.745 0.628 0.003
GWN 0.920 0.831 0.887 0.938 - 0.915
CV CONCATN 0.587 0.110 0.434 0.675 - 0.555 0.768 0.015
GWN 0.648 0.225 0.482 0.733 - 0.613
Pain Level
Detection Task
LOSOCV CONCATN 0.653 0.465 0.464 0.667 0.756 0.629 0.487 0.068
GWN 0.766 0.645 0.581 0.800 0.857 0.748
CV CONCATN 0.395 0.075 0.249 0.438 0.441 0.379 0.596 0.059
GWN 0.448 0.151 0.309 0.474 0.503 0.430
Table 1: Evaluation Experiment Results Comparing the GWN with the Baseline CONCATN. indicates that a Wilcoxon Signed-Rank test showed that the model performance is significantly (significance level = 0.05) higher. indicates that the model accuracy is marginally significantly higher.

4.2.2 Validation Technique

In the experiments carried out, we used the leave-one-subject-out cross-validation (LOSOCV), where the data for a single subject is left out for testing in each fold as is the standard approach for evaluating the generalisation capability of a model to unseen subjects. However, for statistical tests to compare the proposed GWN with the baseline CONCATN, the LOSOCV has the limitation of overlapping training sets across folds that has the risk of high Type I error 

(Dietterich, 1998). Thus, in this work, we additionally perform CV (i.e. 5 random replications of 2-fold CV) which has a lower risk of Type I errors (Dietterich, 1998) for the purpose of model comparison. The advantage of the 2-fold CV is that there are no overlap between training sets.

For both LOSOCV and CV, we perform Wilcoxon signed-rank test (Wilcoxon, 1945) to compare the proposed GWN and the baseline CONCATN.

4.3 Results and Discussion

4.3.1 Comparison with the Baseline

Both the GWN and the CONCATN baseline model are trained with optimisation algorithm (Adam (Kingma and Ba, 2014)), learning rate (0.001), and batch size (32), which were chosen by grid search. The dimensionality of LSTM cell, which is the shared hyperparameter of the two models, are also kept the same, i.e. 64. The performance of the GWN can be seen in Table 1 comparison with the CONCATN baseline model, based on accuracy (ACC), Matthews Correlation Coefficient (MCC) (Matthews, 1975), and F1 scores.

Our results show that the GWN significantly outperforms the baseline for the Health-vs-Patient Discrimination task (significance level = 0.05) with F1 score of 0.913 based on LOSOCV, averaged over the two classes. The effect size is =0.768 for the CV and =0.628 for the LOSOCV. As expected, due to smaller training data size in the

CV, it gives lower performance estimation than the LOSOCV for both the baseline CONCATN and the GWN. Although only marginally significant in this case, the GWN also outperforms the baseline CONCATN in the Pain Level Detection Task, effect size

=0.596, for the CV.

4.3.2 Attention Patterns

An additional advantage of the proposed GWN model is that the attention patterns obtained in modelling can provide insight into the relevance of each modality through time. In our experiments, we found 5 attention patterns (see Figure 4 for further specification of each pattern):


Favours-Itself-Always (FIA)

The given modality always pays attention to itself and never switches attention to the other modality.

Favours-Other-Sometimes (FOS)

The given modality mostly pays attention to itself but sometimes switches its attention to the other modality.

Favours-Itself-and-Other-in-Balance (FIOB)

The given modality pays balanced attention to itself and the other modality.

Favours-Itself-Sometimes (FIS)

The given modality mostly pays attention to the other modality but sometimes switches attention to itself.

Favours-Other-Always (FOA)

The given modality always pays attention to the other modality and never to itself.

Figure 4:

The percentage a modality pays attention to itself in the five patterns. The threshold 40% and 60% used in this definition were chosen heuristically as a

interval around 50%.
Figure 5: An example of attention distribution of one exercise instance. Head 0 means the first attention head. Modality 0 (M0) represents MC and modality 1 (M1) represents EMG.
mean of
switch #
std. of
switch #
3 None 0.51 0.40 0.04 0.29 0.03 0.05 0.05 0.15 0.37 0.11 0.40 14.3 1.32 30.9
4 In MC 0.31 0.43 0.08 0.36 0.02 0.05 0.11 0.10 0.48 0.07 6.92 14.6 25.0 30.4
5 In EMG 0.50 0.46 0.02 0.27 0.02 0.05 0.06 0.09 0.41 0.13 0.35 12.6 1.52 30.7
Table 2: Relative frequency of the five attention patterns for the Pain Level Detection Task, with or without noise added in the data.

Figure 5 gives an example of the FOS pattern. In this case, modality 1 (EMG) pays attention to itself most of the time (98.54%), with a few switches (6 times) to modality 0 (MC).

The frequency of occurrence of each of the five attention cases are shown in Table 2 (row 3). It can be seen that MC tends to always pay attention to either only itself or mostly to the EMG (higher FIA and FOA frequencies), whereas the EMG balances its attention (higher FOS, FIOB and FIS frequencies). One possible explanation is that, since the dimensionality of EMG (4) is much lower than the dimensionality of MC data (78), EMG is always trying to balance the difference in information. In contrast, the modality of MC is rich in information, and so can afford to pay 100 percent attention to itself.

4.3.3 Evaluating How The GWN Deals with Uncertainty in Data

Noise ACC MCC (0) (1) (2) (avg)
None 0.766 0.645 0.581 0.800 0.857 0.748
In MC 0.734 0.594 0.557 0.763 0.822 0.715
In EMG 0.734 0.599 0.590 0.747 0.813 0.721
Table 3: Results of Pain Level Detection Task with or without noise in each MC and EMG.

In order to further examine the behaviour of the GWN model with respect to uncertainties in the data, noise was added to one modality at a time. We experimented with different levels of noise. We expected that if the GWN manages uncertainty in data, the modality without added noise would pay less attention to the noisy modality.

The noise was sampled from a Gaussian distribution with zero mean and standard deviation

, equal to 10% of the standard deviation in the original data for this modality. For instance, as the standard deviation of MC in the Pain Level Detection Task is 105.4, in this case, (round as integral ten digits). Similarly, in the case of the EMG recordings of the same dataset, .

Table 3 presents the result of adding noise. A Wilcoxon Signed-Rank test showed no significant (significance level of = 0.05) difference between the accuracy of the GWN model with and without noise in the MC data, based on the LOSOCV () or with and without noise in the EMG also based on the LOSOCV (). This suggests that the proposed GWN may be tolerant to this level of noise.

Table 2 shows the GWN’s behaviour with the noisy input (row 4 for noisy MC and row 5 for noisy EMG), separated based on the detected attention patterns. Compared with frequencies of the 5 attention cases without added noise, with the noisy MC data, the frequency of FIA for the MC decreases while its frequencies of FOS, FIS, and FOA increase. This indicates that the MC modality is able to recognise noise in itself and rely more on the other modality (EMG). This is also evident in the increase in mean switch frequency.

In contrast, having a noisy EMG (see row 5 in Table 2) does not result in the same behaviour. Compared with the frequencies of the 5 attention cases (see row 3), the frequency of the EMG’s FIA with noisy EMG unexpectedly increases. The frequencies of FOS and FIS also do not increase. Only the FOA frequencies shows expected albeit slight increase. In addition, the mean of switch frequency shows no increment. These results suggest that the EMG modality is less sensitive to its noisiness. One explanation is that the amount of noise added to EMG is not sufficient enough to influence the feature representation. Another possible reason is that the system is sensitive to precise amount of information being lost per modality. Since the dimensionalities of MC and EMG are different, 78 and 4 respectively, the noise added to MC corrupts more information than when added to the EMG, leading to a more sensitive MC in the case of the former.

5 Conclusion

Here we proposed the GWN, a novel neural network architecture for multimodal fusion in temporal data. At each time step, multiple modalities compete for broadcasting information, and each broadcast is accumulated over time. We find that the GWN outperforms baseline multimodal fusion by concatenation, for pain level detection based on the EmoPain dataset. Our analysis further highlights the selectivity of the different modalities in this dataset. Moreover, modality-specific noise manipulations revealed the ability of GWN to deal with changes in uncertainty over time. We believe that our system presents a promising direction for future research in multimodal neural networks, while promoting a close connection with cognitive neuroscience. Such interdisciplinary links can be fruitful for both communities and help to propel each other forward.


  • E. Acar, D. M. Dunlavy, T. G. Kolda, and M. Mørup (2011)

    Scalable tensor factorizations for incomplete data

    Chemometrics and Intelligent Laboratory Systems 106, pp. 41–56. External Links: Document Cited by: §1.
  • H. Akbari, S. Karaman, S. Bhargava, B. Chen, C. Vondrick, and S. Chang (2018) Multi-level multimodal common semantic space for image-phrase grounding. CoRR abs/1811.11683. External Links: Link, 1811.11683 Cited by: §3.1.
  • P. K. Atrey, M. A. Hossain, A. El Saddik, and M. S. Kankanhalli (2010) Multimodal fusion for multimedia analysis: a survey. Multimedia Syst. 16 (6), pp. 345–379. External Links: ISSN 0942-4962, Link, Document Cited by: §1.
  • M. S. H. Aung, S. Kaltwang, B. Romera-Paredes, B. Martinez, A. Singh, M. Cella, M. Valstar, H. Meng, A. Kemp, M. Shafizadeh, A. C. Elkins, N. Kanakam, A. de Rothschild, N. Tyler, P. J. Watson, A. C. d. C. Williams, M. Pantic, and N. Bianchi-Berthouze (2016) The automatic detection of chronic pain-related expression: requirements, challenges and the multimodal emopain dataset. IEEE Trans. Affect. Comput. 7 (4), pp. 435–451. External Links: ISSN 1949-3045, Link, Document Cited by: 1st item, §1, §4.1.1, §4.
  • B. J. Baars (1988) A cognitive theory of consciousness. Cambridge University Press, Cambridge, MA. Cited by: §1.
  • B. J. Baars (1997) In the theater of consciousness. Oxford University Press, New York, NY. Cited by: §1.
  • B. J. Baars (2002) The conscious access hypothesis: origins and recent evidence. Trends in Cognitive Sciences 6 (1), pp. 47–52. External Links: Document Cited by: §1.
  • D. Bahdanau, K. Cho, and Y. Bengio (2015) Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, External Links: Link Cited by: §1, §2.
  • T. Baltrusaitis, C. Ahuja, and L. Morency (2017) Multimodal machine learning: A survey and taxonomy. CoRR abs/1705.09406. External Links: Link, 1705.09406 Cited by: §1.
  • R. Beard, R. Das, R. W. M. Ng, P. G. K. Gopalakrishnan, L. Eerens, P. Swietojanski, and O. Miksik (2018) Multi-modal sequence fusion via recursive attention for emotion recognition. In Proceedings of the 22nd Conference on Computational Natural Language Learning, Brussels, Belgium, pp. 251–259. External Links: Link Cited by: §1, §2.
  • D. Bollegala and C. Bao (2018) Learning word meta-embeddings by autoencoding. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, USA, pp. 1650–1661. External Links: Link Cited by: §3.1.
  • O. Caglayan, L. Barrault, and F. Bougares (2016) Multimodal attention for neural machine translation. CoRR abs/1609.03976. External Links: Link, 1609.03976 Cited by: §2.
  • V. D. Calhoun and J. Sui (2016) Multimodal fusion of brain imaging data: a key to finding the missing link(s) in complex mental illness. Biological Psychiatry: Cognitive Neuroscience and Neuroimaging 1 (3), pp. 230 – 244. Note: Brain Connectivity in Psychopathology External Links: ISSN 2451-9022, Document, Link Cited by: §1.
  • J. Cheng, L. Dong, and M. Lapata (2016) Long short-term memory-networks for machine reading. CoRR abs/1601.06733. External Links: Link, 1601.06733 Cited by: §1.
  • K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio (2014) On the properties of neural machine translation: encoder-decoder approaches. CoRR abs/1409.1259. External Links: Link, 1409.1259 Cited by: §2, §2.
  • G. Cybenko (1989) Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems 2, pp. 303–314. Cited by: §3.1.
  • T. G. Dietterich (1998) Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation 10 (7), pp. 1895–1923. External Links: Document, ISSN 0899-7667 Cited by: §4.2.2.
  • M. Dwarampudi and N. V. S. Reddy (2019) Effects of padding on lstms and cnns. ArXiv abs/1903.07288. Cited by: §4.1.3.
  • Z. Fountas, D. Gamez, and A. K. Fidjeland (2011)

    A neuronal global workspace for human-like control of a computer game character

    In 2011 IEEE Conference on Computational Intelligence and Games (CIG’11), Vol. , pp. 350–357. External Links: Document, ISSN 2325-4270 Cited by: §1.
  • F. A. Gers, J. Schmidhuber, and F. Cummins (1999) Learning to forget: continual prediction with lstm. In 1999 Ninth International Conference on Artificial Neural Networks ICANN 99. (Conf. Publ. No. 470), Vol. 2, pp. 850–855 vol.2. External Links: Document, ISSN 0537-9989 Cited by: §1, §3.3.
  • K. He, X. Zhang, S. Ren, and J. Sun (2015) Deep residual learning for image recognition. CoRR abs/1512.03385. External Links: Link, 1512.03385 Cited by: §2, §3.2.
  • G. E. Hinton, S. Osindero, and Y. Teh (2006) A fast learning algorithm for deep belief nets. Neural Comput. 18 (7), pp. 1527–1554. External Links: ISSN 0899-7667, Link, Document Cited by: §3.1.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural Comput. 9 (8), pp. 1735–1780. External Links: ISSN 0899-7667, Link, Document Cited by: §1, §3.3.
  • C. Hori, T. Hori, T. Lee, K. Sumi, J. R. Hershey, and T. K. Marks (2017) Attention-based multimodal fusion for video description. CoRR abs/1701.03126. External Links: Link, 1701.03126 Cited by: §1, §2.
  • D. Hu (2018) An introductory survey on attention mechanisms in NLP problems. CoRR abs/1811.05544. External Links: Link, 1811.05544 Cited by: §1.
  • M. P. Jensen and P. Karoly (1992) Self-report scales and procedures for assessing pain in adults. Handbook of pain assessment, pp. 135–151. Cited by: §4.1.1.
  • D. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. Note: cite arxiv:1412.6980Comment: Published as a conference paper at the 3rd International Conference for Learning Representations, San Diego, 2015 External Links: Link Cited by: §4.3.1.
  • S. Kotsiantis, D. Kanellopoulos, and P. Pintelas (2005) Handling imbalanced datasets: a review. GESTS International Transactions on Computer Science and Engineering 30, pp. 25–36. Cited by: §4.1.3.
  • M. Kumar, D. P. Garg, and R. A. Zachery (2007) A method for judicious fusion of inconsistent multiple sensor data. IEEE Sensors Journal 7 (5), pp. 723–733. External Links: Document, ISSN 1530-437X Cited by: §1.
  • D. Lahat, T. Adali, and C. Jutten (2015) Multimodal data fusion: an overview of methods, challenges, and prospects. Proceedings of the IEEE 103 (9), pp. 1449–1477. External Links: Document, ISSN 0018-9219 Cited by: §1.
  • J. Lei Ba, J. Ryan Kiros, and G. E. Hinton (2016) Layer normalization. arXiv abs/1607.06450, pp. . External Links: Link Cited by: §3.2.
  • Y. Liberman, R. Samuels, P. Alpert, and H. Messer (2014) New algorithm for integration between wireless microwave sensor network and radar for improved rainfall measurement and mapping. Atmospheric Measurement Techniques 7 (10), pp. 3549–3563. External Links: Link, Document Cited by: §1, §2.
  • Z. C. Lipton (2015) A critical review of recurrent neural networks for sequence learning. CoRR abs/1506.00019. External Links: Link, 1506.00019 Cited by: §1.
  • H. Liu, Y. Wu, F. Sun, B. Fang, and D. Guo (2018) Weakly paired multimodal fusion for object recognition. IEEE Transactions on Automation Science and Engineering 15 (2), pp. 784–795. External Links: Document, ISSN Cited by: §1.
  • B.W. Matthews (1975) Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochimica et Biophysica Acta (BBA) - Protein Structure 405 (2), pp. 442 – 451. External Links: ISSN 0005-2795, Document, Link Cited by: §4.3.1.
  • A. H. Miller, A. Fisch, J. Dodge, A. Karimi, A. Bordes, and J. Weston (2016) Key-value memory networks for directly reading documents. CoRR abs/1606.03126. External Links: Link, 1606.03126 Cited by: §1, §3.2.
  • V. Nair and G. E. Hinton (2010)

    Rectified linear units improve restricted boltzmann machines

    In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10, USA, pp. 807–814. External Links: ISBN 978-1-60558-907-7, Link Cited by: §3.1, §4.2.1.
  • T. A. Olugbade, J. W. Newbold, R. M. G. Johnson, E. Volta, P. Alborno, R. Niewiadomski, M. Dillon, G. Volpe, and N. Bianchi-Berthouze (2018) Automatic detection of reflective thinking in mathematical problem solving based on unconstrained bodily exploration. CoRR abs/1812.07941. External Links: Link, 1812.07941 Cited by: §4.1.3.
  • T. A. Olugbade, A. Singh, N. Bianchi-Berthouze, N. Marquardt, M. S. H. Aung, and A. C. D. C. Williams (2019) How can affect be detected and represented in technological support for physical rehabilitation?. ACM Trans. Comput.-Hum. Interact. 26 (1), pp. 1:1–1:29. External Links: ISSN 1073-0516, Link, Document Cited by: §4.1.2.
  • R. Paulus, C. Xiong, and R. Socher (2017) A deep reinforced model for abstractive summarization. CoRR abs/1705.04304. External Links: Link, 1705.04304 Cited by: §1.
  • M. Shanahan (2006) A cognitive architecture that combines internal simulation with a global workspace. Consciousness and Cognition 15 (2), pp. 433–449. External Links: Document Cited by: §1.
  • M. Shanahan (2008) A spiking neuron model of cortical broadcast and competition. Consciousness and Cognition 17 (1), pp. 288 – 303. External Links: ISSN 1053-8100, Document, Link Cited by: §1.
  • X. Shi, Z. Chen, H. Wang, D. Yeung, W. Wong, and W. Woo (2015) Convolutional LSTM network: A machine learning approach for precipitation nowcasting. CoRR abs/1506.04214. External Links: Link, 1506.04214 Cited by: §4.1.3.
  • U. Şimşekli, B. Ermiş, A. T. Cemgil, and E. Acar (2013) Optimal weight learning for coupled tensor factorization with mixed divergences. In 21st European Signal Processing Conference (EUSIPCO 2013), Vol. , pp. 1–5. External Links: Document, ISSN 2076-1465 Cited by: §1, §2.
  • S. Sukhbaatar, A. Szlam, J. Weston, and R. Fergus (2015) Weakly supervised memory networks. CoRR abs/1503.08895. External Links: Link, 1503.08895 Cited by: §1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. External Links: Link Cited by: §1, §3.2.
  • P. Vincent, H. Larochelle, Y. Bengio, and P. Manzagol (2008)

    Extracting and composing robust features with denoising autoencoders

    In Proceedings of the 25th International Conference on Machine Learning, ICML ’08, New York, NY, USA, pp. 1096–1103. External Links: ISBN 978-1-60558-205-4, Link, Document Cited by: §3.1.
  • C. Wang, T. A. Olugbade, A. Mathur, A. C. de C. Williams, N. D. Lane, and N. Bianchi-Berthouze (2019) Automatic detection of protective behavior in chronic pain physical rehabilitation: A recurrent neural network approach. CoRR abs/1902.08990. External Links: Link, 1902.08990 Cited by: §4.1.3.
  • J. Weston, S. Chopra, and A. Bordes (2015) Memory networks. CoRR abs/1410.3916. Cited by: §1.
  • F. Wilcoxon (1945) Individual comparisons by ranking methods. Biometrics Bulletin 1 (6), pp. 80–83. External Links: ISSN 00994987, Link Cited by: §4.2.2.
  • T. Wilderjans, E. Ceulemans, I. Van Mechelen, and R. van den Berg (2011) Simultaneous analysis of coupled data matrices subject to different amounts of noise. The British journal of mathematical and statistical psychology 64, pp. 277–90. External Links: Document Cited by: §1, §2.