1 Introduction
In general terms, a modality refers to the way in which something happens or is experienced [2]. To our best knowledge, many existing works have demonstrated that Neural Network can achieve an excellent result in single modality processing such as image classification [23], speech synthesis [13]
[26]. In the field of data, multimodal is used to represent different forms of data, or different formats of the same form, which generally represents text, picture, audio and video [11, 22]. Hence, multimodal data processing have attracted a wide attention from the academia, especially for multimodal fusion which is one of the original topics in multimodal machine learning
[2]. Neural Networks is expected to tackle the multimodal fusion problem [18] and has been used extensively to fuse information for text, image and audio [14, 15], gesture recognition [17], and video or image description generation [21, 27], since the earliest investigation of AVSR [20]. However, almost of all these studies focus on text, images or speech modes rather than multimodal timeseries which is a critical ingredient across many domains, so how to effectively process multimodal data still need further study. Many methods have been launched to process simple single mode timeseries data [1, 29], which have achieved the best result in their respective field. But they have no way to directly use multimodal timeseries data, for example multisensor data, medical timeseries data.The problem of features fusion is challenging in multimodal irregular timeseries data processing[4]. For example, for clinical data, patient’s electronic health records can be abstracted into thousands of interrelated medical events with temporal information, including complex allergy history, family genetic history, drug list, hospitalization records and other historical records. Different event has almost absolutely different frequency of recording. E.g. patient’s hospitalization records may be only once a few years, but medication records could be many times a year. Not only different events have different recording frequencies, but also the same type events have significant differences in their different nature. For example, attributes of drug taking events such as drug type, dose and test events include specific indicators and comparison results with normal range values. In order to integrate the features of these events, we must describe these dependencies.
In order to solve the above problems in multimodal irregular timeseries events, in this paper, the following contributions is presented in this paper: (1) We propose a new features fusion method to deal with multimodal data, where the features of complex data are fused into a common feature subspace. This method can be applied to different multimodal data. (2) We explore different encoding methods for temporary features, and found a method to embed the temporary features into the nontemporary features, which allows us to better deal with timeseries data (3) We propose a model called FGLSTM which developed from the Recurrent Neural Network such as Phased LSTM [16] to deal with the problem of irregular timeseries data. Our proposed model filters the input features by feature gate while recording the complex temporal relationship between different features. (4) We compare with other models, and the experiment results based on the real data demonstrate that the prediction performance of our model is significantly improved.
2 Related Work
2.1 Multimodal Fusion problem
Multimodal fusion mainly refers to the comprehensive processing of multimodal data by computer, which is responsible for fusing the information of each mode to perform target prediction [22]. Tensor Fusion Network (TFN) [28]
is a multimodal network for features fusion through matrix operation to directly fuse the three features vectors of the data with three modes (such as text, image and audio). However, since TFN calculates the correlation between the elements of different modes through the tensor outer product between modes, it will greatly increase the dimension of features tensor and result in a too large model that is difficult to train. Lowrank Multimodal Fusion
[14]uses a low rank matrix to decompose the weight, and hence the TFN process is changed into a single linear transformation of each mode. Then the received multidimensional point by Lowrank Multimodal Fusion can be regarded as the sum of multiple low rank vectors, and thus the number of parameters in the model is reduced. Although Lowrank Multimodal Fusion is an upgrade of TFN, once the features are too long, it is still easy to explode parameters. Multimodal Adversarial Representation Network
[10] adds a dual discriminator countermeasure network based on multimodal fusion (ordinary attention fusion), which captures dynamic commonness and invariance respectively. Multimodal Bottleneck Transformer [15] uses a shared token between two Transformer, so that this token becomes a communication bottleneck of different modes to save computational attention. In this way, multimodal interaction can be limited to several shared tokens. Compared with the above researches, we pay more attention to multimodal timeseries events, and the above researches can also be regarded as special cases of multimodal timeseries events.2.2 Timeseries Forecasting
Recurrent neural network (RNN) is a neural network used to process sequence data. Theoretically, RNN can store longterm memory and update the previous state according to the current input at any time, but in fact, it is very difficult. In another word, RNN is difficult to solve the problem of longterm dependence [5]. LSTM [6] is a special RNN, which is mainly to solve the problems of gradient disappearance and gradient explosion in the process of long sequence training. Compared with ordinary RNN, LSTM has better performance in long sequences, but LSTM can only maintain a longterm dependence within about 50 time steps. Phased LSTM [16] can solve the problem that LSTM can not process irregular input sequences. By integrating different sampling frequencies or irregularly sampled data on phase gate, Phase LSTM can remember signals with different periods, and the state can propagate for a long time. When the processing sequence reaches thousands of steps, LSTM is almost unavailable, while Phased LSTM performs well. But Phased LSTM is not suitable for modeling the complex event sequence with thousands of event types. HELSTM [12] is proposed to deal with heterogeneous temporal events in longterm dependence, but it can only extract event types while the features relationship of events can not be obtained. Transformer [26] is a powerful architecture that can achieve excellent performance on a variety of sequential learning tasks, which does not perform recursion on the sequence, but processes the feedforward model of the whole sequence simultaneously. Recent research shows that transformer has the potential to improve the prediction ability [24]. However, transformer has some serious problems that make it unable to be directly applied to multimodal irregular timeseries data, such as quadratic time complexity, high memory utilization and the inherent limitations of encoderdecoder architecture [29]. In addition to the above problems, the biggest problem of Transformer is that the model contains no recurrence and no convolution, which results in the input tensor can not contain the time relationship of the input sequence effectively[3].
3 Methodology
For timeseries S in a given scene, the features of the sequence are consist of dynamic events with length , and each event occurs at the same or different time. We arrange it according to the chronological order of events. Among the events that occur at the same time, the events recorded earlier are arranged in front. Each timeseries S corresponds to a discrete label, which indicates the state of the object at a certain time in the future. For example, in Clinical Endpoint Prediction Task, 0 or 1 indicates the patient’s status (death or not) at a certain point in time in the future. So prediction of the results of timeseries S with this corresponding discrete label is defined as the classification of timeseries S.
3.1 Features fusion
Each timeseries S contains many types of events, and each event has its own time of occurrence. We use to represent the feature space where S is located, to represent nontemporal information, that is, the feature space where the events is located, and to represent the feature space where temporal information is located. Formally, a sequence , defines each element as , with being the nontemporal features at time and as an temporal features, and is the interval between the occurrence time of this event and the time when the first event of this timeseries occurs. The features vector are defined over a joint space : . The resulting permutationinvariant set is: . For each event we define , where is the type of event, we use to represent the feature space where type is located; is the attribute of the event, we use to represent the feature space where attribute is located. So the feature space of event : , is obviously a joint space, where is the discrete feature space and is the continuous feature space. Similarly, attribute consists of two parts: , where is the type of attribute, and is the specific value of attribute. The feature vector of attribute are defined over a joint space : , where is the discrete feature space of and is the continuous feature space of .
For each type of event, it can contain multiple types of attributes. While for different types of events, it may contain the same type of attributes or different types of attributes. Therefore, it is difficult to find the feature space of events directly, and we need to characterize the complex relationship between different events. We demonstrate the nontemporal features fusion method as shown in Fig. 1, where is the encoded dimension: (1) Select the first threedimensional feature of attribute, fill up the deficiencies with 0, encode and as , respectively, and then use to get a new threedimensional feature; (2) Use convolution kernel to increase the dimension of the features obtained in the previous step, and then use convolution kernel to reduce the dimension to onedimensional features after being processed by the activation function; (3) Stack the features obtained in the previous step with the features encoded by event, then use convolution kernel to increase the dimension, after processing by the activation function, use convolution kernel to reduce the dimension to obtain the onedimensional nontemporal features. For of continuous feature space, we do not simply encode with convolution or fully connected layers, instead encode with the help of of discrete feature space, as shown in the formula:
(1) 
Where and is the tensor after embedding , and is the result after encoding .
For the fusion of temporal and nontemporal features, many studies directly adopt the additive method, such as the most famous Transformer architecture [26]. The fusion of temporal and nontemporal features is not a simple additive relationship, so the method shown in the Fig. 3 is proposed. Firstly, stack the temporal and nontemporal features, then increase the dimension of the two features used convolution structure. After processing by the activation function, we eventually fuse features into onedimensional tensor on another convolution structure.
Because the time interval between events is not equal, and the time of each event is a very important feature that can not be ignored. We add “time encoding” to the input embeddings and use two methods to encode time:
3.1.1 Function Encoding
We use sine and cosine functions of different frequencies just as “positional encoding” [26]:
(2) 
(3) 
where is the dimension, . That is, each dimension of the time encoding corresponds to a sinusoid. We chose this function because for any fixed time offset , can be represented as a linear function of . The time encoding have the same dimension as the embeddings, so that the two can be summed.
3.1.2 Convolution Encoding
We use the convolution structure to learn time encoding:
(4) 
(5) 
For the temporal features with length and dimension 1, that is, the size , use the convolution kernel of to learn the matrix with size , change the temporal features into the matrix with size . After the activation function, use the convolution kernel of to learn the matrix with size again, and change the temporal features into a matrix with the size of . Finally get the temporal features with the size of after activation function.
3.2 Model Architecture
Long shortterm memory (LSTM) [6] is an important ingredient for modern deep RNN architectures. The FGLSTM extends the LSTM model by adding a new feature gate , and the Fig. 3 shows the FGLSTM model. The is the input features at time , and others are basically consistent with ordinary LSTM. The feature gate has two factors: a feature filter and a time gate.
The combination of features and time gates only allows the features of certain kinds of features to be input into the neuron, and makes the neuron open only in a specific cycle. This ensures that each neuron will only capture the features of specific types of events and sample them, which solves the problem of poor training effect caused by the complexity and diversity of time and long event sequence.
The opening and closing of this feature gate is controlled by the features and time. Updates to the cell state and are permitted only when the gate is open. We proposed a particularly successful formulation of the feature gate as following:
(6) 
where , and are the parameters to be learned, is hidden size, is output size. and is the activation function, is the tensor input at time , and is the time gate [16].
Compared with traditional RNN and other excellent variants of RNN [9], FGLSTM can choose to update the learned parameters at the time point of irregular sampling. This allows the FGLSTM to work with asynchronously sampled irregular timeseries data. We can then rewrite the regular LSTM cell update equations for and , using proposed cell updates and mediated by the feature gate :
(7)  
(8)  
(9)  
(10)  
(11)  
(12)  
(13) 
To sum up, for a neuron, only when it meets the type conditions of the corresponding feature gate, and the features information in its sampling period, neron will be updated. Therefore, it can be considered that this neuron represents the state of a certain type of features in a certain sampling period. This is because the feature gate
, can be seen as a binary classifier to chose the cluster of features types responsible for each neuron. In addition, neurons do not update any information in the closing stage and maintain a perfect memory of past information, i.e.
if for . Therefore, other neurons that track other features can directly use the information of this set of features, even if they are far away from each other in sequence indexing. Because of this special mechanism, FGLSTM can have much diverse and longer memory for modeling the dependency of multiple features.We use a Softmax layer to predict the true label
of the learned features tensor of sequence in the given decision times. This consists of two linear transformations with a activation in the middle.(14) 
We use crossentropy to calculate the classification loss of the prediction and true label of each sample as follows:
(15) 
We can sum up the losses of all the samples in one minibatch to get the total loss for back propagation.
4 Experiments
Dataset  Target 0  Target 1  Total 

training set  475291  59328  534619 
validation set  61698  6540  68238 
evaluation set  143373  19622  162995 
The dataset used in this experiment is generated by Intensive Care Unit patient medical record data (MIMICIII) of Beth Israel Deaconess Medical Center in the United States [7]. More than 20000 patient samples in MIMICIII were extracted from the dataset, covering more than 4000 kinds and a total of more than 20 million multimodal irregular timeseries data. In the experiment, the dataset is divided into training set,validation set and evaluation set, with a ratio of . Table 1
shows the data distribution of the dataset, which is divided into two classes. All experiments were implemented by Pytorch
[19], optimized by Adam optimization algorithm [8], with the learning rate of 0.0001 and the other parameters are selected as default parameters. We set the random number seed to 1 to ensure the repeatability of the experimental results. Unless otherwise specified, (the dimension after features coding) is 256, the batchsize is 128. The detailed parameter settings of different experiments are described below. All the experiments are conducted on a single Nvidia RTX 3090 GPU (24GB memory), which is sufficient for all the baselines.4.1 Evaluating Metrics
AUC (the area under Receiver Operating Characteristic curve) and AP (Average Precision) [25] are uesd in this paper. AUC is the area of ROC curve and the xaxis, and AP is the area of PRC (precision recall curve) and the xaxis, both of which are robust to the imbalanced data of positive and negative samples.
4.2 Comparing Methods
Because the proposed FGLSTM is a variant based on the classical LSTM [6], we choose the classical LSTM and three other excellent variants including BILSTM, Phase LSTM [16] and HELSTM [12]. Recently, Transformer architecture has achieved the best performance in many problems, so we discuss the ability of Transformer related architecture to deal with multimodal irregular timeseries. We chose the vanilla Transformer [26] and further select one of excellent variants in it called Informer [29]
. Because our experiment does not involve the generation process, therefore, only the encoder part of the Transformer architecture is used, and get the final output directly through a fully connected feedforward network. For LSTM related architectures, only use one layer. For Informer, the number of layers in the original author’s open source code is selected, that is,
. For Transformer, in order to better compare with Informer, we selecte the same encoder layers as Informer. In addition, is changed to 256, which is consistent with LSTM architecture, and there is no change in the parameter settings of Transformer related architecture.4.3 Experimental Result
4.3.1 Nontemporal Features fusion methods
Model  LSTM  BiLSTM  Phased LSTM  HELSTM  Transformer  Informer  FGLSTM  Count  
Our Method  AUC  75.63  75.59  72.52  74.21  75.69  76.05  78.85  12 
AP  34.96  34.92  30.45  32.44  34.31  34.93  38.90  
Other Method  AUC  68.59  70.36  68.95  76.35  64.23  75.91  76.37  2 
AP  25.94  26.85  26.63  34.80  21.71  32.85  36.42  
Count  0  0  0  0  0  0  4   
In many previous studies, the processing methods of features from different feature spaces are only simple addition. According to this idea, a method is proposed as a comparative experiment, as shown below:
(16) 
Where , and are the tensor encoded by , and respectively. In this experiment, the coding method without considering the temporary features. We uniformly choose the (2) (3) proposed above. For the fusion method of temporal features and nontemporal features, the addition method is directly selected, and the rest are discussed in detail below.
Table 2 shows the experimental results of AUC and AP on Table 1 dataset with different model architectures and different nontemporal features fusion methods. It is obvious that, compared with the common methods, the proposed method of nontemporary features fusion has better performance. Except for the best performance in HELSTM framework, our proposed method has advantages in all other frameworks. The most obvious improvement is the Transformer framework, which has increased by in AUC and in AP. However, for the excellent Informer framework proposed for singlemodal timeseries, the improvement is not very obvious. The AUC and AP have only increased by and respectively, which shows that the Informer framework is not very sensitive to feature fusion methods. If we do not pay much attention to features fusion methods, Informer framework is indeed a good choice. For our proposed model FGLSTM, the best performance of all models is obtained in different nontemporal feature fusion methods, and the AUC and AP are also improved by and respectively. Although the improvement is not very obvious, it also proves the superiority of our proposed model itself. In general, different feature fusion methods have great impact on the performance of different models, but excellent models are not particularly sensitive to feature fusion methods.
4.3.2 Temporal and nontemporal features fusion methods
The advanced of the proposed nontemporal features fusion method has been proved. Therefore, in this experiment, we verify the progressiveness of our proposed temporal and nontemporal features fusion method. In order to explore whether it is necessary to upgrade the dimension, we set up a group of control experiments to fuse the two features after stack directly with the help of convolution kernel.
Model  FE  CE  
add  convadd  add  convadd  
AUC  AP  AUC  AP  AUC  AP  AUC  AP  AUC  AP  AUC  AP  
LSTM  75.63  34.96  76.88  36.38  76.87  36.35  79.47  39.50  79.09  38.56  79.18  38.53 
BiLSTM  75.59  34.92  78.28  37.67  76.98  37.46  79.04  38.56  77.95  36.06  79.64  39.43 
Phased LSTM  72.52  30.45  74.82  33.57  72.83  30.69  76.34  34.03  74.54  32.90  78.01  36.89 
HELSTM  74.21  32.44  76.78  34.73  75.46  33.24  77.06  35.86  76.79  34.72  75.47  34.37 
Transformer  75.69  34.31  76.14  34.88  75.85  34.87  76.41  35.19  75.65  34.22  76.26  35.13 
Informer  76.05  34.93  76.19  35.66  75.96  34.86  78.11  35.29  75.99  34.90  76.05  34.12 
FGLSTM  78.85  38.90  80.67  41.94  78.47  38.01  79.33  39.09  81.20  42.69  80.89  41.59 
Count  0  0  0  8  2  4 
Table 3 shows our experimental results. Where FE is function encoding, CE is convolution encoding, add is a direct addition method, convadd is our method, and is a comparative method without dimension upgrading. For the FE method without learning parameters, it can be seen that the convadd method has achieved the most advanced experimental results in different models, while the method without dimension upgrading is not as good as the convadd method. But it is still better than the direct addition method in many models. For the CE method of learning parameters, it can be seen that no matter what kind of temporal and nontemporal feature fusion method, CE is better than FE, but none of the three feature fusion methods always has best performance in all models. Because our upgraded convadd method also has parameters to learn, we believe that as long as the dataset is larger, the upgraded convadd method can still be better than other methods in different models. Finally, for different models, different time coding methods and different feature fusion methods are used. Our FGLSTM model is better than other models, which is enough to prove the robustness of our FGLSTM. It also shows that the variants of LSTM are not necessarily inferior to the models of Transformer series.
4.3.3 Experimental comparison of different length timeseries
In order to verify the proposed model in this paper has stronger ability to capture the temporal dependence between features than other models, in this experiment, different models are input with different lengths of timeseries data, ranging from 100 to 800. For the temporal feature coding method, choose CE. For the nontemporal feature fusion method, use our own method. For the temporal and nontemporal feature fusion method, choose the method without dimension upgrading. Fig. 4(a) and Fig. 4(b) show the results of this experiment, for Transformer, when the length of timeseries is 600 and 800, the Transformer failure for the outofmemory, so we set the is 64 to make 24GB memory enough. From the experimental results, we can draw the following conclusions:
Firstly, the timeseries information is effective for the prediction results. When the input length is less than 400, most models will be improved with the increase of the length of the input sequence. Secondly, compared with other models, FGLSTM is better at capturing the timing dependency in timeseries. When the sequence length exceeds 400 and becomes longer and longer, the performance of the model is not improved much in AUC, but the AP is still improved steadily. However, other models can not capture the timing dependence under ultra long sequences, so they have not been greatly improved, and even the effect has become worse. Finally, we can see that the classical LSTM model is superior to Transformer and its variant model Informer, which shows that the timeseries information extraction of Transformer series models is still slightly insufficient.
5 Conclusion
This paper proposes a features fusion framework and FGLSTM model updated on the basis of LSTM. The model can well deal with multimodal irregular timeseries data. At the same time, we also explore how to better encode time features and how to better integrate temporal features and nontemporal features, which is particularly important for irregular timeseries data. Firstly, through the temporal features coding method and features fusion framework, the representation tensor obtained by the model can fuse the features and temporal dependency between different nontemporal information, effectively capture the temporal dependency under ultra long sequences and the feature information of a minority events. Then, input the representation tensor of the obtained timeseries into the FGLSTM, due to the existence of feature gates, the model can automatically adapt to the multiscale sampling frequency of multisource complex data, asynchronously track the temporal information and feature information of different events. Finally, the experiments demonstrate that the method proposed in this paper has better performance than other typical methods on real datasets. The method in this paper is promising to expand and popularize, and can be further migrated to diverse fields, especially for multisource asynchronous sampling sensor data and behavior recording data.
References
 [1] Armandpour, M., Kidd, B., Du, Y., Huang, J.Z.: Deep personalized glucose level forecasting using attentionbased recurrent neural networks. arXiv preprint arXiv:2106.00884 (2021)
 [2] Baltrušaitis, T., Ahuja, C., Morency, L.P.: Multimodal machine learning: A survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence 41(2), 423–443 (2018)

[3]
Dai, Z., Yang, Z., Yang, Y., Carbonell, J.G., Le, Q.V., Salakhutdinov, R.: Transformerxl: Attentive language models beyond a fixedlength context. In: ACL (1) (2019)
 [4] Fu, Y., Cao, L., Guo, G., Huang, T.S.: Multiple feature fusion by subspace learning. In: Proceedings of the 2008 international conference on Contentbased image and video retrieval. pp. 127–134 (2008)
 [5] Hochreiter, S., Bengio, Y., Frasconi, P., Schmidhuber, J., et al.: Gradient flow in recurrent nets: the difficulty of learning longterm dependencies (2001)
 [6] Hochreiter, S., Schmidhuber, J.: Long shortterm memory. Neural computation 9(8), 1735–1780 (1997)
 [7] Johnson, A.E., Pollard, T.J., Shen, L., LiWei, H.L., Feng, M., Ghassemi, M., Moody, B., Szolovits, P., Celi, L.A., Mark, R.G.: Mimiciii, a freely accessible critical care database. Scientific data 3(1), 1–9 (2016)
 [8] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. Computer Science (2014)
 [9] Koutnik, J., Greff, K., Gomez, F., Schmidhuber, J.: A clockwork rnn. In: International Conference on Machine Learning. pp. 1863–1871. PMLR (2014)
 [10] Li, X., Wang, C., Tan, J., Zeng, X., Ou, D., Ou, D., Zheng, B.: Adversarial multimodal representation learning for clickthrough rate prediction. In: Proceedings of The Web Conference 2020. pp. 827–836 (2020)

[11]
Liu, J., Li, T., Xie, P., Du, S., Teng, F., Yang, X.: Urban big data fusion based on deep learning: An overview. Information Fusion
53, 123–133 (2020)  [12] Liu, L., Shen, J., Zhang, M., Wang, Z., Liu, Z.: Deep learning based patient representation learning framework of heterogeneous temporal events data. Big Data Research 5(1), 2019003 (2019)
 [13] Liu, P., Cao, Y., Liu, S., Hu, N., Li, G., Weng, C., Su, D.: Varatts: Nonautoregressive texttospeech synthesis based on very deep vae with residual attention. arXiv preprint arXiv:2102.06431 (2021)
 [14] Liu, Z., Shen, Y., Lakshminarasimhan, V.B., Liang, P.P., Zadeh, A.B., Morency, L.P.: Efficient lowrank multimodal fusion with modalityspecific factors. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (2018)
 [15] Nagrani, A., Yang, S., Arnab, A., Jansen, A., Schmid, C., Sun, C.: Attention bottlenecks for multimodal fusion. Advances in Neural Information Processing Systems 34 (2021)
 [16] Neil, D., Pfeiffer, M., Liu, S.C.: Phased lstm: Accelerating recurrent network training for long or eventbased sequences. In: NIPS (2016)
 [17] Neverova, N., Wolf, C., Taylor, G., Nebout, F.: Moddrop: adaptive multimodal gesture recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 38(8), 1692–1706 (2015)
 [18] Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: ICML (2011)
 [19] Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: An imperative style, highperformance deep learning library. Advances in neural information processing systems 32, 8026–8037 (2019)
 [20] Potamianos, G., Neti, C., Gravier, G., Garg, A., Senior, A.W.: Recent advances in the automatic recognition of audiovisual speech. Proceedings of the IEEE 91(9), 1306–1326 (2003)
 [21] Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., Sutskever, I.: Zeroshot texttoimage generation. arXiv preprint arXiv:2102.12092 (2021)
 [22] REN, Z., WANG, Z., KE, Z., LI, Z., Wushour·Silamu: Survey of multimodal data fusion. Computer Engineering and Applications 57(18), 16 (2021)

[23]
Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning. pp. 6105–6114. PMLR (2019)
 [24] Tsai, Y.H.H., Bai, S., Yamada, M., Morency, L.P., Salakhutdinov, R.: Transformer dissection: An unified understanding for transformer’s attention via the lens of kernel. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP). pp. 4344–4353 (2019)
 [25] Turpin, A., Scholer, F.: User performance versus precision measures for simple search tasks. In: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. pp. 11–18 (2006)
 [26] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Advances in neural information processing systems. pp. 5998–6008 (2017)
 [27] Wu, C., Liang, J., Ji, L., Yang, F., Fang, Y., Jiang, D., Duan, N.: Nüwa: Visual synthesis pretraining for neural visual world creation. arXiv preprint arXiv:2111.12417 (2021)
 [28] Zadeh, A., Chen, M., Poria, S., Cambria, E., Morency, L.P.: Tensor fusion network for multimodal sentiment analysis. arXiv preprint arXiv:1707.07250 (2017)
 [29] Zhou, H., Zhang, S., Peng, J., Zhang, S., Li, J., Xiong, H., Zhang, W.: Informer: Beyond efficient transformer for long sequence timeseries forecasting. In: Proceedings of AAAI (2021)