Reasoning about and interpreting multimodal data is an important task in machine learning research because life involves streaming of data from multiple modalities (Baltrusaitis et al., 2017). Multimodal data fusion, which leverages the combination of multiple modalities, is a valuable strategy (Atrey et al., 2010; Calhoun and Sui, 2016; Hori et al., 2017; Liu et al., 2018). Its benefits including complementarity of information, higher prediction performance, and robustness (Baltrusaitis et al., 2017). However, multimodal fusion comes with challenges; Lahat et al. (2015) specifies them under two categories: (1) challenges at the data observation and acquisition level, and (2) challenges due to uncertainty in the data (such as noise, missing values, conflicting information). Challenges at the observation level can be managed by data pre-processing, e.g. with data resampling, to deal with different temporal resolutions across modalities) (Aung et al., 2016). However, challenges due to uncertainty require the design of models that can exploit complementarity or discrepancy across modalities (Lahat et al., 2015), an area which is less explored. Findings in previous work on multimodal fusion have highlighted the effectiveness of weighting different modalities based on some “importance” metric (Wilderjans et al., 2011; Şimşekli et al., 2013; Liberman et al., 2014; Kumar et al., 2007; Acar et al., 2011), which is the basis of the use of attention mechanisms in machine learning (Bahdanau et al., 2015). Despite the fact that uncertainty evolves through time in multimodal, sequential data (Lahat et al., 2015), relevant studies have not sufficiently explored mechanisms for both cross-modal cum temporal attention. For example, the architecture proposed by Beard et al. (2018) captures variations in importance along the time axis separately for the different modalities in the data. A drawback of their approach is that these variations are not simultaneously fused over the modalities.
To address this gap in multimodal data fusion, we propose the Global Workspace Network (GWN) which integrates variations in importance simultaneous through time and across modalities. Our GWN is inspired by the Global Workspace Theory (GWT) (Baars, 1997, 2002), which is a well-developed framework in cognitive science, and originally proposed as a model of human consciousness (Baars, 1988). The GWT states that concomitant cognitive processes compete for the opportunity to broadcast their current state (to peers) (Fountas et al., 2011). At each iteration, the winner (a single process or a coalition of processes) earns the privilege of contributing current information in a global workspace which can be accessed by all processes (including the winner) (Shanahan, 2008). This competition and broadcast cycle is believed to be ubiquitous in the perceptual regions of the brain (Baars, 1988). Although the literature contains architectures of biologically-realistic spiking neural networks based on GWT Shanahan (2008); Fountas et al. (2011), so far, to our knowledge, there has been no relevant implementation in machine learning. Mechanistically, the main concept of this theory can be implemented as the combination of a compete-and-broadcast procedure and an external memory structure. In contrast to the global workspace, which can be seen as a communication module, external memory here is used as the means to store information for later application (Shanahan, 2006). Taking the processing of each modality in multimodal data as analogous to specialised processes in the brain, the similarity between the compete-and-broadcast cycle and typical cross-modality attention mechanism is obvious. The repetitiveness of the cycle allows the pattern of attention to evolve over time and, given the external memory module, be used in the primary prediction task of the network.
In order to simulate the two main components of the GWT (compete-and-broadcast and external memory), we employed two widely-tested algorithms, the transformer (Vaswani et al., 2017)
and the Long Short-Term Memory (LSTM) neural network(Hochreiter and Schmidhuber, 1997; Gers et al., 1999) respectively. There are 3 key elements of the transformer that we leverage in the GWN. First is its self-attention mechanism (Cheng et al., 2016; Paulus et al., 2017) that we use as the GWN’s compete-and-broadcast procedure where each modality independently scores all modalities and integrates the data from them based on the resulting weights. The second valuable component of the transformer is its memory-based attention mechanism (Weston et al., 2015; Sukhbaatar et al., 2015)
. Drawing from its traditional application in Natural Language Processing (NLP) question answering tasks(Sukhbaatar et al., 2015; Miller et al., 2016)
, this unit further maps the feature vector into query, key and value spaces to increase the weighting depth and robustness(Hu, 2018). This additionally enables distributed competition versus broadcasting computations. In essence, the query and key forms can be used for the competition, while broadcast is performed on value form, which can have more expressive information that is not valuable for the competition. The third merit of the transformer is its bagging approach, i.e. the use of multiple heads in which multiple attention patterns are learnt in parallel, which has the advantage of increasing robustness. We used the LSTM as the basis for the complementary external memory because it has been shown as effective for learning long-term dependencies (Lipton, 2015).
The contributions of this paper are as follows:
The GWN architecture, a novel approach to fusion of sequential data from multiple modalities.
We evaluate the proposed GWN architecture on the EmoPain dataset (Aung et al., 2016), which consists of motion capture and electromyography (EMG) data collected from patients with chronic lower back pain and healthy control participants while they performed exercise movements. This dataset is representative of real-life data with continuously-streamed multiple modalities, each with varying degrees of uncertainty.
Analysis of the GWN’s outputs demonstrating its effectiveness in handling uncertainty in data.
2 Related Work
Attention in the temporal dimension
In the literature on neural networks for multimodal data, attention performed on the time axis is usually separated by modality, and the resulting context vectors from each modality are fused as non-temporal features. A representative case of this approach is the Recursive Recurrent Neural Network (RRNN) architecture proposed byBeard et al. (2018). In their work, different modalities (video, audio, and subtitles) extracted from a subtitled audiovisual dataset were divided into segments of uttered sentences and each segment was used an input to the network. For each modality in a segment, a bi-directional LSTM layer was used to extract features. At a given time step, attention computation is performed for each modality separately and the outputs are concatenated over all modalities together with the current state of a shared memory, which the authors implemented with a GRU cell (Cho et al., 2014). The outcome is then used to update the state of shared memory. An advantage of this work is that since each modality was encoded separately, they do not have to follow a common time axis, which allows each modality to optimally exploit its inherent temporal properties. However, as this method cannot account for attention between modalities, different modalities (some potentially more noisy than others) affect the final prediction equally and thus the problem of cross-modal uncertainty variation remains unsolved.
Attention across modalities Several related studies have considered the relation between modalities in fusing them. The typical approach (Wilderjans et al., 2011; Şimşekli et al., 2013; Liberman et al., 2014) is the use of modality weighting although not particularly based on attention mechanisms (Bahdanau et al., 2015). One study that does explicitly use an attention mechanism is the work of Hori et al. (2017) on the modelling of video description. Their approach leverages attention between different modalities using an encoder-decoder architecture (Bahdanau et al., 2015) with separate encoders for each modality and a single decoder. Features of each modality are encoded separately and the decoder weights them to generate a context vector as an output. A similar study Caglayan et al. (2016)
applies multimodal attention in neural machine translation where images are leveraged in translating the description texts from one language to another. The image and text modalities were first encoded using pre-trained ResNet-50(He et al., 2015) and bi-directional GRU (Cho et al., 2014) respectively. Then, attention scores were computed for these encodings. The common approach of encoding the temporal data before computing attention is appropriate for obtaining modality-specific feature representation, however, it does not allow in-depth capture of the complex interactions between modalities through time. In addition, it is not suitable for online process of live-streamed data.
The GWN architecture we propose addresses these limitations by considering both the interaction of multiple modalities and the temporal variations in this interaction. It is indeed a more intuitive approach to processing a stream of multimodal data, by weighting multiple modalities at each timestep.
3 Global Workspace Network (GWN)
The architecture of the GWN is shown in Figure 1. The network consists of five components: an input unit, a mapping block, an attention module, an external memory, and a prediction block. These components are described in detail below.
3.1 Mapping Inputs to a Common Feature Space
Consider modalities that they have an identical sampling rate, i.e. for each data instance, each modality in that instance can be written as , where denotes the common temporal length (common across modalities) of the data instance. The dimensionality at a given time may nevertheless be different across these modalities, i.e. . The attention mechanism of the GWN requires identical dimension across modalities and so, it is necessary to have a module for mapping the modalities into the same dimensions.
, we take the approach of using multiple autoencoders(Vincent et al., 2008) that each learn a common feature space for multiple modalities. Assuming that the common feature space has a dimensionality of , the mapping function in the encoder for each autoencoder outputs a vector with dimensionality of 2010) non-linearity, i.e.
where is the data instance sampled at modality and time ; and , , , and are trainable parameters of function. The findings of Cybenko (1989) suggest that such encoding should be capable of mapping different modalities into a common feature space. can then be obtained by summing the outputs across the encoders
This is based on previous work in Bollegala and Bao (2018). The decoders have the same form as the encoders, i.e.
where is the reconstruction of data instance sampled at modality and time ; and , , , and are trainable parameters of decoder. A sum of the mean squared error loss for each autoencoder can be used to train the full mapping module.
Figure 2 provides an illustration with an example of two modalities mapped into a common feature space and then reconstructed, based on two autoencoders. After pre-training the autoencoders, the encoders are used directly as the mapping function in the GWN. The pre-trained parameters in the encoders then serve as initial values for the mapping block in the GWN. Though this approach introduces more learnable parameters, the findings of Hinton et al. (2006) suggest that unsupervised pre-training on shallow layers can improve the performance of a deep network.
For the subsequent attention module, the output vector from each modality’s mapping are merged by stacking, to form a matrix .
3.2 The Attention Module
The attention module is a single layer of the transformer encoder described in (Vaswani et al., 2017) with the difference that, in the GWN, the input is a set of different modalities for a number of data instances at a specific time , rather than data sequences (i.e. multiple time steps and instances) based on a single modality. Since the input is already in matrix form, the following multi-head attention calculation can be performed:
where is a set of heads and is a trainable matrix. Each context matrix for a specific head is calculated as
The query, key, and value matrices of a specific head at time are calculated as:
Here, the query, key, and value are variations of the input , based on the idea of memory-based attention mechanism Miller et al. (2016). Note that the trainable matrices , , and are reused on different time steps but are independent for different heads .
As shown in Figure 1
, there are two residual connections(He et al., 2015) in the attention module. Each of the residual connection is followed by a layer normalisation (Lei Ba et al., 2016). The first residual connection can be represented as:
Here, the assumption of identical dimensionality for residual connection is satisfied as and . The subsequent feed forward layer and the final output of the attention module, respectively, are:
3.3 External Memory
The external memory is implemented as an LSTM cell (Hochreiter and Schmidhuber, 1997) with updates:
where the input vector is the flattened form of ,
is the sigmoid function:
is the hyperbolic tangent function
and denotes the Hadamard product (i.e. element-wise product). is the recurrent state at time step , and consists of a memory cell and the output at that time step, with
as an hyperparameter that indicates the size of the external memory. The initial stateis set with zeros. , , and represent forget, input, and output gates respectively (Hochreiter and Schmidhuber, 1997; Gers et al., 1999). All the gates have the same dimensionality . The output vector in the last recurrent state is used by the final prediction component.
The final prediction module consists of a feed forward layer with one hidden layer activated with a ReLU followed by a softmax function. The layer serves as a simple non-linear transformation from the external memory and can be applied at any time step, making it suitable for online prediction with streaming data. The equations are given as
where is the prediction result mapped into the distribution . Both and have the same dimensionality, the size of label .
To evaluate the proposed GWN architecture, we conducted experiments on the multimodal EmoPain dataset Aung et al. (2016). In Section 4.1, the dataset, data preprocessing, and experiment tasks are introduced. Section 4.2 describes the methods and metrics used for evaluation against a baseline model. Finally, Section 4.3 presents the performance and empirical analyses of the GWN.
4.1.1 The EmoPain Dataset
The EmoPain dataset (Aung et al., 2016) is suitable given that it consists of sequential data from multiple modalities and in unconstrained settings where there are bound to be uncertainties (e.g. in form of sensor noise) in the data, and in varying degrees over time. The data was collected from 22 patients with chronic low back pain and 28 healthy control participants and includes motion capture (MC) and muscle activity data based on surface electromyography (EMG). The data for each participant was acquired while they performed physical exercises that put demands on the lower back. For each exercise, there were two levels of difficulty. There is the normal trial, for 7 types of exercise ((1) balancing on preferred leg, (2) sitting still, (3) reaching forward, (4) standing still, (5) sitting to standing and standing to sitting, (6) bending down, and (7) walking). There is additionally the difficult trial, where four of these exercise types were modified to increase the level of physical demand, i.e. (8) balancing on each leg, (9) reaching forward while standing holding a 2 kg dumbbell, (10) sitting to standing and return to sitting initiated upon instruction, (11) walking with 2 kg weight in each hand, starting by bending down to pick up the weights, and exercises (2) and (4) repeated without modification. The data was acquired so as to build automatic detection models for pain and related cognitive and affective states, and so after each exercise type, patients self-reported the level of pain they experienced, on a scale of 0 to 10 (0 for no pain and 10 for extreme pain) (Jensen and Karoly, 1992). In this paper, we used the subset of the EmoPain dataset with the self-reported pain labels available and where consent was given for further use of the data. This subset consists of 14 patients with chronic pain and 8 healthy control participants resulting in a total of 200 exercise instances.
4.1.2 Evaluation Experiment Tasks
The proposed GWN architecture was evaluated on two classification tasks based on the multimodal EmoPain dataset:
Pain Level Detection Task
The aim of this task is to detect the level of a person with chronic pain. The motivation for creating such system is to endow technology with the capability for supporting physical rehabilitation by providing timely feedback or prompts, and personalised recommendations tailored to the pain level of a person with chronic pain. For example, a person with low level pain may be reminded to take breaks at appropriate times and not overdo, whilst a person with high pain may be reminded to breath to reduce tension which may further increase pain levels (Olugbade et al., 2019).
A formal description of the task is as follows. Given M and E, denoting MC and EMG data, for an unseen subject known to have chronic pain (i.e. the event
), infer the probability
that the data corresponds to one of three levels of pain. A random variablerepresents the level of chronic pain and is . In this paper, 0 represents zero level pain, i.e. pain self-report = 0, 1 represents low level pain, i.e 0 pain self-report 5), and 2 represents high level pain, i.e pain self-report 5).
Healthy-vs-Patient Discrimination Task
The healthy control participants were assumed to have no pain. However, patients with chronic pain who reported pain as 0 were not considered to be in the same class as these participants. Hence, a separate model may be needed to first distinguish a person with chronic pain from healthy participants.
The formal definition of the task is as follows. Given M and E, infer the probability that the data belongs to a person with chronic pain. A random variable represents the event that an unseen subject has chronic pain, and with 0 for healthy and 1 for chronic pain person.
Figure 3 shows the number of exercise instances for each class, for the Healthy-vs-Patient Discrimination Task and Pain Level Detection Task respectively.
4.1.3 Data Preprocessing
Here, we describe the preprocessing performed to prepare the data for the evaluation experiments.
Dealing with A High Sampling Rate
The EMG data of the EmoPain dataset had been downsampled from 1000Hz to 60Hz for consistency with the MC data. However, 60Hz results in high dimensionality whereas preliminary experiments suggest that 10Hz may be sufficient for the Healthy-vs-Patient Discrimination Task. Thus, we downsampling both MC and EMG data further, to 10 Hz to be suitable for the Healthy-vs-Patient Discrimination Task. The original 60Hz was found to be more appropriate for the Pain Level Detection Task.
Padding for Uniform Sequence Lengths
, we used pre-padding rather than post-padding to obtain uniform time sequence lengths for different data instances. Further, we used zero padding, which is the common approach used in modelling when assuming no prior knowledge about the input data(Shi et al., 2015).
Dealing with Imbalanced Data
The total number of exercise instances available for training and evaluation was 200, which is a limited amount for training a neural network. We employed data augmentation, particularly creating new instances from the original by rotating them, to address this problem. Preliminary experiments that we performed show that rotation about y-axis, which is along the cranial-caudal, outperforms the mirror reflection augmentation used in Olugbade et al. (2018). This augmentation approach used four angles, 0°, 90°, 180°, and 270°, and resulted in four times the original data size. For each newly created instance, only the original MC data was changed by the rotation; for these instances, the original EMG data was used unchanged as they are not affected by the orientations.
4.2 Evaluation Methods
4.2.1 Baseline Model
A simple concatenation (CONCATN) architecture, which is representative of the traditional multimodal data fusion approach, was used as the baseline network against which we evaluated our GWN architecture. This baseline allows evaluation of the contribution of the GWN’s mapping and attention components to its performance. The CONCATN has identical external memory and prediction units. Hence, it can be seen as a network that does not pay particular attention to different modalities over time, but rather treats them equally through time.
In the CONCATN, multiple modalities are concatenated along the feature axis and fed into a LSTM network. The feed forward equations are
where is the number of modalities, is a memory cell and is the hidden state. Initial states and have values of zero. Assuming the dimensionality of each modality input at a specific time is , the dimensionality of the concatenated vector is . The dimensionalities of and have the same values as in the GWN model. The prediction module is also identical to the GWN model, i.e. the last LSTM output is fed into a feed forward network with one hidden layer activated with ReLU (Nair and Hinton, 2010) non-linearity.
4.2.2 Validation Technique
In the experiments carried out, we used the leave-one-subject-out cross-validation (LOSOCV), where the data for a single subject is left out for testing in each fold as is the standard approach for evaluating the generalisation capability of a model to unseen subjects. However, for statistical tests to compare the proposed GWN with the baseline CONCATN, the LOSOCV has the limitation of overlapping training sets across folds that has the risk of high Type I error(Dietterich, 1998). Thus, in this work, we additionally perform CV (i.e. 5 random replications of 2-fold CV) which has a lower risk of Type I errors (Dietterich, 1998) for the purpose of model comparison. The advantage of the 2-fold CV is that there are no overlap between training sets.
For both LOSOCV and CV, we perform Wilcoxon signed-rank test (Wilcoxon, 1945) to compare the proposed GWN and the baseline CONCATN.
4.3 Results and Discussion
4.3.1 Comparison with the Baseline
Both the GWN and the CONCATN baseline model are trained with optimisation algorithm (Adam (Kingma and Ba, 2014)), learning rate (0.001), and batch size (32), which were chosen by grid search. The dimensionality of LSTM cell, which is the shared hyperparameter of the two models, are also kept the same, i.e. 64. The performance of the GWN can be seen in Table 1 comparison with the CONCATN baseline model, based on accuracy (ACC), Matthews Correlation Coefficient (MCC) (Matthews, 1975), and F1 scores.
Our results show that the GWN significantly outperforms the baseline for the Health-vs-Patient Discrimination task (significance level = 0.05) with F1 score of 0.913 based on LOSOCV, averaged over the two classes. The effect size is =0.768 for the CV and =0.628 for the LOSOCV. As expected, due to smaller training data size in the
CV, it gives lower performance estimation than the LOSOCV for both the baseline CONCATN and the GWN. Although only marginally significant in this case, the GWN also outperforms the baseline CONCATN in the Pain Level Detection Task, effect size=0.596, for the CV.
4.3.2 Attention Patterns
An additional advantage of the proposed GWN model is that the attention patterns obtained in modelling can provide insight into the relevance of each modality through time. In our experiments, we found 5 attention patterns (see Figure 4 for further specification of each pattern):
- Favours-Itself-Always (FIA)
The given modality always pays attention to itself and never switches attention to the other modality.
- Favours-Other-Sometimes (FOS)
The given modality mostly pays attention to itself but sometimes switches its attention to the other modality.
- Favours-Itself-and-Other-in-Balance (FIOB)
The given modality pays balanced attention to itself and the other modality.
- Favours-Itself-Sometimes (FIS)
The given modality mostly pays attention to the other modality but sometimes switches attention to itself.
- Favours-Other-Always (FOA)
The given modality always pays attention to the other modality and never to itself.
Figure 5 gives an example of the FOS pattern. In this case, modality 1 (EMG) pays attention to itself most of the time (98.54%), with a few switches (6 times) to modality 0 (MC).
The frequency of occurrence of each of the five attention cases are shown in Table 2 (row 3). It can be seen that MC tends to always pay attention to either only itself or mostly to the EMG (higher FIA and FOA frequencies), whereas the EMG balances its attention (higher FOS, FIOB and FIS frequencies). One possible explanation is that, since the dimensionality of EMG (4) is much lower than the dimensionality of MC data (78), EMG is always trying to balance the difference in information. In contrast, the modality of MC is rich in information, and so can afford to pay 100 percent attention to itself.
4.3.3 Evaluating How The GWN Deals with Uncertainty in Data
In order to further examine the behaviour of the GWN model with respect to uncertainties in the data, noise was added to one modality at a time. We experimented with different levels of noise. We expected that if the GWN manages uncertainty in data, the modality without added noise would pay less attention to the noisy modality.
Table 3 presents the result of adding noise. A Wilcoxon Signed-Rank test showed no significant (significance level of = 0.05) difference between the accuracy of the GWN model with and without noise in the MC data, based on the LOSOCV () or with and without noise in the EMG also based on the LOSOCV (). This suggests that the proposed GWN may be tolerant to this level of noise.
Table 2 shows the GWN’s behaviour with the noisy input (row 4 for noisy MC and row 5 for noisy EMG), separated based on the detected attention patterns. Compared with frequencies of the 5 attention cases without added noise, with the noisy MC data, the frequency of FIA for the MC decreases while its frequencies of FOS, FIS, and FOA increase. This indicates that the MC modality is able to recognise noise in itself and rely more on the other modality (EMG). This is also evident in the increase in mean switch frequency.
In contrast, having a noisy EMG (see row 5 in Table 2) does not result in the same behaviour. Compared with the frequencies of the 5 attention cases (see row 3), the frequency of the EMG’s FIA with noisy EMG unexpectedly increases. The frequencies of FOS and FIS also do not increase. Only the FOA frequencies shows expected albeit slight increase. In addition, the mean of switch frequency shows no increment. These results suggest that the EMG modality is less sensitive to its noisiness. One explanation is that the amount of noise added to EMG is not sufficient enough to influence the feature representation. Another possible reason is that the system is sensitive to precise amount of information being lost per modality. Since the dimensionalities of MC and EMG are different, 78 and 4 respectively, the noise added to MC corrupts more information than when added to the EMG, leading to a more sensitive MC in the case of the former.
Here we proposed the GWN, a novel neural network architecture for multimodal fusion in temporal data. At each time step, multiple modalities compete for broadcasting information, and each broadcast is accumulated over time. We find that the GWN outperforms baseline multimodal fusion by concatenation, for pain level detection based on the EmoPain dataset. Our analysis further highlights the selectivity of the different modalities in this dataset. Moreover, modality-specific noise manipulations revealed the ability of GWN to deal with changes in uncertainty over time. We believe that our system presents a promising direction for future research in multimodal neural networks, while promoting a close connection with cognitive neuroscience. Such interdisciplinary links can be fruitful for both communities and help to propel each other forward.
Scalable tensor factorizations for incomplete data. Chemometrics and Intelligent Laboratory Systems 106, pp. 41–56. External Links: Cited by: §1.
- Multi-level multimodal common semantic space for image-phrase grounding. CoRR abs/1811.11683. External Links: Cited by: §3.1.
- Multimodal fusion for multimedia analysis: a survey. Multimedia Syst. 16 (6), pp. 345–379. External Links: Cited by: §1.
- The automatic detection of chronic pain-related expression: requirements, challenges and the multimodal emopain dataset. IEEE Trans. Affect. Comput. 7 (4), pp. 435–451. External Links: Cited by: 1st item, §1, §4.1.1, §4.
- A cognitive theory of consciousness. Cambridge University Press, Cambridge, MA. Cited by: §1.
- In the theater of consciousness. Oxford University Press, New York, NY. Cited by: §1.
- The conscious access hypothesis: origins and recent evidence. Trends in Cognitive Sciences 6 (1), pp. 47–52. External Links: Cited by: §1.
- Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, External Links: Cited by: §1, §2.
- Multimodal machine learning: A survey and taxonomy. CoRR abs/1705.09406. External Links: Cited by: §1.
- Multi-modal sequence fusion via recursive attention for emotion recognition. In Proceedings of the 22nd Conference on Computational Natural Language Learning, Brussels, Belgium, pp. 251–259. External Links: Cited by: §1, §2.
- Learning word meta-embeddings by autoencoding. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, USA, pp. 1650–1661. External Links: Cited by: §3.1.
- Multimodal attention for neural machine translation. CoRR abs/1609.03976. External Links: Cited by: §2.
- Multimodal fusion of brain imaging data: a key to finding the missing link(s) in complex mental illness. Biological Psychiatry: Cognitive Neuroscience and Neuroimaging 1 (3), pp. 230 – 244. Note: Brain Connectivity in Psychopathology External Links: Cited by: §1.
- Long short-term memory-networks for machine reading. CoRR abs/1601.06733. External Links: Cited by: §1.
- On the properties of neural machine translation: encoder-decoder approaches. CoRR abs/1409.1259. External Links: Cited by: §2, §2.
- Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems 2, pp. 303–314. Cited by: §3.1.
- Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation 10 (7), pp. 1895–1923. External Links: Cited by: §4.2.2.
- Effects of padding on lstms and cnns. ArXiv abs/1903.07288. Cited by: §4.1.3.
A neuronal global workspace for human-like control of a computer game character. In 2011 IEEE Conference on Computational Intelligence and Games (CIG’11), Vol. , pp. 350–357. External Links: Cited by: §1.
- Learning to forget: continual prediction with lstm. In 1999 Ninth International Conference on Artificial Neural Networks ICANN 99. (Conf. Publ. No. 470), Vol. 2, pp. 850–855 vol.2. External Links: Cited by: §1, §3.3.
- Deep residual learning for image recognition. CoRR abs/1512.03385. External Links: Cited by: §2, §3.2.
- A fast learning algorithm for deep belief nets. Neural Comput. 18 (7), pp. 1527–1554. External Links: Cited by: §3.1.
- Long short-term memory. Neural Comput. 9 (8), pp. 1735–1780. External Links: Cited by: §1, §3.3.
- Attention-based multimodal fusion for video description. CoRR abs/1701.03126. External Links: Cited by: §1, §2.
- An introductory survey on attention mechanisms in NLP problems. CoRR abs/1811.05544. External Links: Cited by: §1.
- Self-report scales and procedures for assessing pain in adults. Handbook of pain assessment, pp. 135–151. Cited by: §4.1.1.
- Adam: a method for stochastic optimization. Note: cite arxiv:1412.6980Comment: Published as a conference paper at the 3rd International Conference for Learning Representations, San Diego, 2015 External Links: Cited by: §4.3.1.
- Handling imbalanced datasets: a review. GESTS International Transactions on Computer Science and Engineering 30, pp. 25–36. Cited by: §4.1.3.
- A method for judicious fusion of inconsistent multiple sensor data. IEEE Sensors Journal 7 (5), pp. 723–733. External Links: Cited by: §1.
- Multimodal data fusion: an overview of methods, challenges, and prospects. Proceedings of the IEEE 103 (9), pp. 1449–1477. External Links: Cited by: §1.
- Layer normalization. arXiv abs/1607.06450, pp. . External Links: Cited by: §3.2.
- New algorithm for integration between wireless microwave sensor network and radar for improved rainfall measurement and mapping. Atmospheric Measurement Techniques 7 (10), pp. 3549–3563. External Links: Cited by: §1, §2.
- A critical review of recurrent neural networks for sequence learning. CoRR abs/1506.00019. External Links: Cited by: §1.
- Weakly paired multimodal fusion for object recognition. IEEE Transactions on Automation Science and Engineering 15 (2), pp. 784–795. External Links: Cited by: §1.
- Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochimica et Biophysica Acta (BBA) - Protein Structure 405 (2), pp. 442 – 451. External Links: Cited by: §4.3.1.
- Key-value memory networks for directly reading documents. CoRR abs/1606.03126. External Links: Cited by: §1, §3.2.
Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10, USA, pp. 807–814. External Links: Cited by: §3.1, §4.2.1.
- Automatic detection of reflective thinking in mathematical problem solving based on unconstrained bodily exploration. CoRR abs/1812.07941. External Links: Cited by: §4.1.3.
- How can affect be detected and represented in technological support for physical rehabilitation?. ACM Trans. Comput.-Hum. Interact. 26 (1), pp. 1:1–1:29. External Links: Cited by: §4.1.2.
- A deep reinforced model for abstractive summarization. CoRR abs/1705.04304. External Links: Cited by: §1.
- A cognitive architecture that combines internal simulation with a global workspace. Consciousness and Cognition 15 (2), pp. 433–449. External Links: Cited by: §1.
- A spiking neuron model of cortical broadcast and competition. Consciousness and Cognition 17 (1), pp. 288 – 303. External Links: Cited by: §1.
- Convolutional LSTM network: A machine learning approach for precipitation nowcasting. CoRR abs/1506.04214. External Links: Cited by: §4.1.3.
- Optimal weight learning for coupled tensor factorization with mixed divergences. In 21st European Signal Processing Conference (EUSIPCO 2013), Vol. , pp. 1–5. External Links: Cited by: §1, §2.
- Weakly supervised memory networks. CoRR abs/1503.08895. External Links: Cited by: §1.
- Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. External Links: Cited by: §1, §3.2.
Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th International Conference on Machine Learning, ICML ’08, New York, NY, USA, pp. 1096–1103. External Links: Cited by: §3.1.
- Automatic detection of protective behavior in chronic pain physical rehabilitation: A recurrent neural network approach. CoRR abs/1902.08990. External Links: Cited by: §4.1.3.
- Memory networks. CoRR abs/1410.3916. Cited by: §1.
- Individual comparisons by ranking methods. Biometrics Bulletin 1 (6), pp. 80–83. External Links: Cited by: §4.2.2.
- Simultaneous analysis of coupled data matrices subject to different amounts of noise. The British journal of mathematical and statistical psychology 64, pp. 277–90. External Links: Cited by: §1, §2.