Introduction
Multivariate time series(MTS) data have extensive practical applications in industry, health care, geoscience, biology, etc. However, MTS universally carries missing observations in diverse data contexts. It has been noted that these missing entries provide informative features of the original data sourcesRubin (1976). As a result, missing data’s discriminative characteristics must be considered when dealing with MTSrelated tasks.
In literature, the MTSrelated tasks encountering missing data are often solved by the procedures shown in the upper part of Figure 1
. Under the twophase pipeline, missing imputation is treated as the datacleaning stage, and then independent models oriented towards multiple downstream tasks are conducted individually in the application stage. Many valuable works have contributed to missing data imputation in MTS under this framework. The methodologies cover many fields, including tensor decomposition, RNNbased prediction, autoencoders, and generativeadversarial theory, etc. However, there are several drawbacks for present actual applications. The traditional therapies produce a complete dataset as an intermediate product which inevitably brings imputation bias to missing entries. Besides, the feature extracting procedures are repeatedly developed in both phases, and multiple downstream models have to be trained independently, which always results in significant redundancy.
Moreover, missing data in MTS has its unique missing patterns impairing the concurrent treatments mentioned above. Apart from the common patterns listed in Rubin’s workRubin (1976), MTS missing data also confronts some reluctant patterns, often referring as ”line missing” and ”block missing,”Cini et al. (2021) contributing to the disconnection of measurement within a global sensory integration during network noresponse or downtime. These patterns deprive the MTS’s timecontinuous property, and the missing entries are jointly missing simultaneously. As a result, many conventional imputation methods mentioned above become invalid even under a relatively small missing ratio.
Based on these considerations and inspired by the pretrain idea in NLP and CV field, we novelly proposed a generalized MTS pretrain model called DBTDMAE. Our model adopts a typical pretrain procedure depicted in the lower part of Figure 1. In order to avoid the training redundancy problem caused by obtaining the complete dataset in the conventional pipeline, DBTDMAE utilizes an autoregressive architecture under a masked learning mechanism and directly learns from the unlabeled incomplete MTS data to get generic encoded MTS representation. As for handling the tricky missing pattern appearing in MTS, we propose a dynamic missing positional embedding(DPE) technique giving all missing entries with effective representations without bringing in extra imputation bias.
For detailed implementation, we propose a TCNbasedBai et al. (2018) unit, called Dynamic Bidirectional TCN(DBT
), as the basic encoder unit to capture temporal correlations from a bidirectional data context in MTS. The entire DMAE model is designed to extract multitime scale features and perform effective deep fusion to obtain universally applicable encoded features. For the DMAE training progress, the specialized datafeedin and loss strategy guarantees adequate training of all substructures, and a warmup training trick is applied to accelerate and stabilize the convergence progress. After the pretrain phase, downstreamtaskoriented finetuning can be quickly delivered by substituting the decoder of the welltrained DMAE. In this paper, we choose the multistep prediction and MTS classification task as the downstream task examples, and the finetuning could rapidly converge in less than several epochs. In general, our main contributions can be concluded as follows:

A novel MTS pretrain framework under missing data is proposed. Under this framework, the mentioned biasedimputation problem is avoided by applying dynamic missing mask tokens derived from extensive unlabeled MTS data. At the same time, the straightforward downstream task finetuning procedure directly solves the redundant training problem in the conventional pipeline.

The proposed DBTDMAE holds distinguished adaptability to dynamic timevarying MTS input. The dynamic intrinsic nature of DPE, DK, and ASF mechanisms enables the model to draw a profoundly underlying temporal correlation in multiple scales and both time orientations and gives a preferable generalization performance.

The pretrain effectiveness is evaluated by two downstream tasks under six realworld open datasets whose fields range from industry, climates, wearable devices, and speech recognition.
The rest of this article is organized as follows: Section Notation and Problem Statement describes the notation used in this article and the problem formulation of DBTDMAE. Section Model Architecture introduces the DBTDMAE in detail. Section Experiments includes all the experiments, including comparative studies, ablation experiments, and model interpretation experiments. Finally, a concise conclusion is made at the end of the paper.
Related Works
Missing Data Imputation
Under deep learning background, there are mainly three types of methodology of missing data imputation: predictionbased, autoencoderbased, and GANbased methods. For predictionbased methods, some works
Kök and Özdemir (2020); Tkachenko et al. (2020) transform the imputation into an MTS prediction problem while using RNNbased models to predict the missing value. Some other worksChe et al. (2018); Cao et al. (2018); Tang et al. (2020) integrate the missing prediction as an intermediate step in time series prediction. In terms of autoencoderbased methods, some other worksMiranda et al. (2011, 2012); Pan et al. (2022) take the missing parts as random noises and recover the missing value with the output of the delicatelydesigned autoencoder. Moreover, with the recent advancement of generative adversarial theory, many worksWeihan (2020); Luo et al. (2019); Yoon et al. (2018)follow the basic generative adversarial idea with the utilization of deep learning neural networks to train the specifically structured generators and discriminators and generate the value of the missing parts.
Pretrain Models
In 2016, the Google Brain research team proposed the seq2seq pretrain modelRamachandran et al. (2016). Next up, in 2018, BERTDevlin et al. (2018) was carried out by the Google AI Language research group and GPTRadford et al. (2018) by the OpenAI research team in the same year. Also, GPTv2Radford et al. (2019) and GPTv3Brown et al. (2020)
were proposed in succession in the next few years. Furthermore, in 2021, the Facebook AI Research team led by Kaiming proposed the Masked Autoencoder
He et al. (2021) pretrain model in CV.Notation and Problem Statement
Given MTS, , , where is the length of the sequence, we use to denote the kth attribute of length and employ
to denote the attribute vector at timeentry
. Meanwhile, due to the missing phenomenon overwhelming in MTS data, we also introduce the binary missing mask matrix as the same shape as the , in which indicates the attribute at timeentry is a valid value, while indicates the value is missing.Typically, in our work, DBTDMAE implements the nattribute sequential input selfmapping. Given the input and mask matrix , the DBT model intends to project the input
into a hidden representation
through an encoder , where represents the number of hidden states, and reconstruct the complete by a decoder :(1) 
(2) 
where denotes the reconstructed result.
Apart from the DBTDMAE problem statement, different downstream tasks have independent problem expressions. In this paper, we only refer to two types of downstream tasks: multistep prediction and classification. The followings are the problem statement for each task.
Multistep Prediction Problem Statement
Prediction is a common task among all the MTS applications. We apply 1step, 3step, and 5step prediction experiments in this work. For 1step prediction, the exact next target value is predicted; for 3step and 5step prediction, the target values of the following three or fivetime points are estimated. To be more specific, we expect to establish the functional relationship between target value and previously observed data as follows:
(3) 
where denotes the prediction steps.
MTS Classification Problem Statement
MTS Classification is a substantial task with applications in many fields, including but not limited to wearable devices, monitoring systems, and speech recognition. Here, our objective is to find a nonlinear multivariate probability distribution
, which takes MTS data and mask matrixas input and outputs the probability that this series belongs to each class
.(4)  
(5) 
Model Architecture
Dynamic Bidirectional Temporal Convolution Networkbased Denoising Mask Autoencoder, abbreviated to DBTDMAE, is fully discussed in this section. The fundamental DBT unit and DBT block is first introduced. Then, the specially designed DPE mechanism is enclosed to conduct the missing data representation. At last, the overall DMAE structure is beheld with the datafeedin and loss strategy, while the warmup training trick is also explicitly included.
Dynamic Bidirectional TCN Unit and Block
The dynamic bidirectional TCN unit(DBT) is an ameliorated TCN unit used as the basic unit of the DBT block, which will be included in this section later. The original TCN is a residual architecture with two sequentially stacked Casual Dilated Convolutional(CDC) layers and a ReLU nonlinear activation. Based on this structure, we make two main modifications to the original TCN in general, whose interior structure is shown in Figure
2.First, since our pretrain model is made to learn the underlying temporal relations within MTS, it is reasonable to integrate sequential information from both time directions. As a result, we apply the ”time flipping” trick to the input MTS. As shown in Figure 2, we adopt two independent CDC networks corresponding to timeforward and timebackward convolution. Afterward, a fullyconnected layer merges information from opposite directions and maintains the appropriate shape of the next CDC layer required.
Second, MTS’s multivariable characteristic leads to a wide range of input fields containing mutative conditions, and therefore a fixed combination of reception kernels could be insufficient. Based on this consideration, we propose the dynamic kernel(DK) to replace the convolution kernel in CDC layers. The detail is depicted on the right side of Figure 2. Inspired by the idea of the work of Chen, Y.Chen et al. (2020), we use the input attention mechanism to fuse multiple learnable kernel groups into one. Specifically, we apply two average pooling upon the input squeezing the input length
and variable dimension to obtain global vector entries. Then a linear projector and a SoftMax layer are employed on the global entries to get
attention weights . At last, the output kernel parameter is aggregated from groups of the kernel with the weights. The calculation process can be expressed as Equal.(6)(8), where indicates the unit vector with all entries equal to 1; is the output kernel arguments; is the weights of the th candidate kernel group; is the penalty factor to alleviate the onehot phenomenon in softmax. Dimensions of all the parameters are coherent with the input , which is easy to obtain.(6)  
(7)  
(8) 
DBT block functions as an independent computing module in the overall model. One DBT block consists of three parallel DBT units with BatchNorm and an Attention Scale Fusion(ASF) layer incorporating all temporal features. The exact structure is depicted in Figure 3. The three parallel DBT units have different kernel sizes to capture features in multiple sequence scales. After three feedforward paths, an ASF layer is attached to integrate features under different scales. Firstly, we draw an average pooling on the concatenated parallel output, only maintaining the hiddenfeature dimension and squeezing the other dimensions to 1. Then, the average vector is used as a reference entry toward each of shape ( 3). The following attention function assigns each a weight , and the final output is computed as the inner product between the weight vector and the . The Attention function we use here is calculated as follows:
(9)  
(10)  
(11) 
where is the global feature projection operator( is the attention hidden states); is also the penalty factor; is the local feature projection operator; is the attention vector; the represents the final output of DBT Block at time t.
Dynamic Positional Embedding
Without the fully imputed MTS as the input, the original MTS must be well represented in the proposed pipeline. Learning from the masked language modelingLiu et al. (2021) coming up with numerous famous NLP pretrain models Devlin et al. (2018); Liu et al. (2019), we apply a random mask on the original MTS input and treat the masked entries equivalent to missing parts that are required to be restored. Unlike the mask token embedding method using the hardcoded or fully learnable embedding in NLP pretrain models, we consider that the missing token representation should combine timevarying and variablevarying characteristics, which means that the missing data appears at different time or variables would have discrepant token representations. As a result, we adopt a specialized endtoend missing positional embedding technique, called dynamic positional embedding(DPE), to generate the timevariablevarying missing tokens.
Specifically, DPE is implemented by a random masking(RM) procedure and a DBT block. Apart from the initially missing entries, observations of a masking ratio are artificially masked, and both the masked and missing entries are embedded by the bidirectional scanning of a DBT block upon the masked data. Furthermore, the DBT unit is an endtoend convolution architecture so that it has the representation ability to generate multiple positional embeddings at only one glance and could theoretically bear and process any missing pattern, including the ”line missing” and ”block missing” mentioned above. This whole DPE process can be formulated as Equal.(12)(15)
(12)  
(13)  
(14)  
(15) 
where is a random noise strengthening the robustness of the model with as its parameter; denotes the artificial masking function, which randomly sets zeros within the original missing mask matrix; is the elementwise multiplication.
AutoEncoder Architecture
The entire DBTDMAE architecture is designed to establish an autoregressive projection with valid missing entry representation under the missing data problem in the MTS. A DBTbased encoder and feedforward decoder constitute the holistic pretrain model.
The encoder consists of sequential DBT blocks. Within each block, all DBT units share the same dilation size, while the stacked blocks have gradually enlarged dilate sizes to integrate local and global temporal features in MTS. In practice, we use three consecutive DBT blocks stacked on top of each other, and their dilation size grows exponentially with base two. In terms of the decoder, a more complicatedlystructured decoder may contribute to a better performance of the autoregressive reconstruction. However, in DBTDMAE, we desire to obtain a generalized feature representation of the input MTS. With that being the case, a simple straight decoder is ideal for its low expressiveness, and thus a simple fullyconnected feedforward network is introduced as the decoder in our work. The overall architecture of DBT is depicted in Figure 4.
Training Strategy
It is worth mentioning that both original MTS and the one after RM are fed into the model. The red and blue lines in Figure 4 denote the two forward paths, and the reconstructed results are denoted as and , respectively. The objective function is as Equal.(16), where is the missing mask matrix while is the mask matrix after RM; denotes the random masking ratio. The first part of the function focuses on the manually masked part, while the second part pays attention to reconstructing all observed data. This combination is critical because combining the two parts enforces DBT to learn both the original data distribution and the missing representation at the same missing entry. Additionally, the two losses are weightedcombined corresponding to each missing ratio, which aims to balance the probabilities between the model seeing the artificially masked missing embedding and the corresponding original observed data. Furthermore, it has been proved by our ablation experiments that the pretrain performance would remarkably margin if the first part vanish.
(16)  
For more detail of the training process, noticing the heavy utilization of softmax operation in the DBT unit and the sequentially stacked DMAE structure, direct training of the entire model could lead to oscillation, and the converge speed is slow. Therefore, we proposed a socalled ”warmup” training trick to give the network a better initialization firsthand. In several beginning epochs, we artificially replace all softmax with uniformed weights to neutralize the underfitting problem brought by the nearly onehot output of softmax. In this process, all the model structures, especially all candidate kernels in DK, are sufficiently initialized to a relative ideal position. After several epochs’ warmup, the softmax layers then come into play. Also, the penalty factors used in Equal.(7) and Equal.(10) are set to a relatively large positive number to avert the sudden saltation within the model once the softmax resumes work.
Downstream Finetuning
As for the finetuning detail, in this paper, we include multistep prediction and classification task finetuning. DBTDMAE’s original decoder is substituted for prediction and classification tasks with a new feedforward network. The last layer’s output dimension is set to the target prediction step in the prediction task, while in the classification task, the dimension is set the same as total categories, and a softmax layer is attached at the end of the feedforward networks. The finetuning loss settings are also simple. MSE loss is used for the prediction task, and CrossEntropy loss is used for the classification task.
Experiments
This section conducts various experiments, including downstream task comparison, model hyperparameter selection, and ablation studies. All the experiments are conducted using an AMAX workstation with Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz , RAM 256GB, and RTX3090 .
Dataset Settings
We used a total of 6 realworld openaccess datasets to develop a downstream prediction task and an MTS classification task. The first three, namely Appliances Energy Prediction Data(AEPD), Beijing MultiSite AirQuality Data(BMAD), and SML2010(SML), are MTS datasets archived in UCI Machine Learning Repository
Asuncion and Newman (2007). These three datasets are used for the predictionrelated task. For the classification task, subsets of FaceDetection, JapeneseVowels, and SpokenArabicDigits are chosen from the UEA multivariate time series classification archiveBagnall et al. (2018). For training and validation sake, we randomly split each dataset with a ratio of 9:1. Also, we normalize the numerical values for all tasks using the ZScore method for regular training and fair comparability. We artificially generate missing values for tasks involving missing values under the missing ratio of 5%, 10%, and 20%.
Method  MR  APED  BMAD  SML  UEAF  UESS  UEAJ  
1s  3s  5s  1s  3s  5s  1s  3s  5s  
LSTNet  0.05  0.136  0.132  0.137  0.112  0.106  0.108  0.127  0.148  0.153  72%  11%  1% 
0.10  0.158  0.170  0.180  0.120  0.135  0.141  0.132  0.157  0.176  70%  11%  1%  
0.20  0.197  0.208  0.216  0.197  0.221  0.232  0.177  0.183  0.194  65%  10%  1%  
TPARNN  0.05  0.185  0.189  0.219  0.185  0.215  0.221  0.108  0.130  0.135  69%  97%  1% 
0.10  0.197  0.214  0.225  0.177  0.225  0.239  0.120  0.145  0.155  69%  96%  1%  
0.20  0.215  0.232  0.248  0.243  0.264  0.278  0.149  0.160  0.182  63%  93%  1%  
DARNN  0.05  0.129  0.136  0.129  0.067  0.102  0.105  0.123  0.127  0.134  66%  64%  13% 
0.10  0.147  0.158  0.166  0.087  0.118  0.124  0.129  0.137  0.141  64%  63%  10%  
0.20  0.158  0.184  0.197  0.094  0.136  0.144  0.187  0.161  0.149  61%  56%  9%  
DBTDMAE  0.05  0.075  0.081  0.078  0.074  0.094  0.081  0.099  0.100  0.123  76%  100%  100% 
0.10  0.090  0.071  0.071  0.081  0.088  0.085  0.106  0.123  0.130  75%  99%  100%  
0.20  0.087  0.088  0.100  0.085  0.097  0.094  0.119  0.109  0.132  76%  100%  100% 
Performance Indexes
There are three types of performance indexes needing to specify.
Pretrain Index
In this paper, we apply Mean Square Error(MSE) to evaluate the pretrain model’s autoregressive reconstruction performance. We particularly evaluate the both missing and observed values by the performance indexes with and suffix.
(17)  
(18) 
where denotes the sample number and is the total sample number; denotes the reconstructed input ; is the 1norm operand.
Prediction Index
In terms of predictionrelated tasks, we adopt MAE indicator to evaluate the prediction error between the output of downstream finetuning and the groundtruth value. The MAE index for steps prediction is as follows:
(19) 
Classification Index
We use the precision indicator for the MTS classification task, which manifests the correctly categorized proportion of samples. The indicator is as follow:
(20) 
where is the indicator function that returns one if the condition is true; otherwise, it returns 0.
Baselines Settings
In terms of prediction baselines, we choose three different State of the Art models, namely LSTNetLai et al. (2018), TPARNNShih et al. (2019) and DARNNQin et al. (2017). In recent literature, these three models are wellrecognized sequential prediction and estimation methods using deep learning. For the classification task, we append a softmax layer at the rear of the three models above; therefore, the output is the probability under which MTS input belongs to each category.
Comparative Study
This subsection compares DBTDMAE’s prediction and finetuning classification performances with the three baselines above. The missing ratio(MR) is artificially set to 5%, 10%, and 20% for each dataset. The results in the first three datasets manifest the prediction performances computed as Equal.(19), and the 1s, 3s, and 5s denote 1step, 3step, and 5step prediction, respectively. The last three columns show the comparative results of the classification task, the precision indexes are computed as Equal.(20). The best performance indicator in each case is written in bold. According to the results, finetuning of DBTDMAE outperforms the three compared baselines, especially when the missing ratio is high.
HyperParameter Selection
In this subsection, we conduct experiments to discuss the DBTDMAE’s hyperparameters effect on model convergence and performances. Three crucial parameters are taken into account: hidden states number , random masking ratio , and kernel size combination . We develop a grid search of the hyperparameter fields. The are selected from [16, 32, 64, 128]; is chosen from [0.05, 0.1, 0.2, 0.3]; is chosen from [{3, 5, 7}, {2, 3, 5}, {5, 7, 11}, {2, 3, 7}]. For brevity, we only present the performance using the SML dataset here.
In Figure.5, we visualize the grid search results containing training and validation MSE indexes under different hyperparameter combinations. Specifically, the restoration performances at missing and observed entries are plotted separately, and the labels are marked with (v) and (m), respectively. It is worth mentioning that there is a rapid loss descent after epoch 10, which accurately reflects the effect of the warmup training trick. In conclusion, in order to balance the model’s performance, converge speed and model volume, we finally choose the hyperparameter combination that .
Ablation Studies
Our algorithm has four critical substructures: DPE, RM, DBT, and ASF. The ablation studies are delivered in four independent parts further to verify the efficiency and brevity of our model:

w/o DPE: substitute DPE by the hardcode embedding.

w/o RM: remove random masking process and first part of the loss function.

w/o DK: substituted DBT unit by the original TCN.

w/o ASF: substituted ASF by linear concatenation.
All the ablation experiment results below are obtained under the AEPD dataset and missing ratio equal to 0.2.
Case  Evaluation Metrics  

DBTDMAE  0.016  0.021  0.088 
w/o DPE  0.023  0.032  0.096 
w/o RM  0.028  0.160  0.124 
w/o DK  0.019  0.025  0.101 
w/o ASF  0.021  0.024  0.092 
Table.2 shows the ablation experiments’ result. We evaluate the pretrain performance within both observed and missing locations and the downstream 3step prediction precision index is also included for support.
The results from cases w/o DPE and w/o RM manifest that the dynamic missing embedding is exceedingly meaningful in the whole algorithm. When the random masking process is removed from the missing representation procedure, the reconstruction performance on the missing part dramatically drops. In contrast, the performances in the observed parts remain unaffected. This unbalanced performance degradation can be viewed as a failure signal in the missing value representation.
In terms of experiment cases w/o DK and w/o ASF, ablation results confirm these substructures’ dynamic feature extraction ability. We also visualize the dynamic kernel parameters and dynamic missing positional embedding values in Figure 6, which also confirms the validity of the proposed DBT structure.
Conclusion
In this paper, we propose a universal MTS pretrain model to obtain downstreamgeneralized encoded representation conquering the missing data problem. Our model utilizes the proposed DBT as the basic unit, and adopts the dynamic positional embedding and masklearning mechanism to construct the autoregressive DMAE. With simple finetuning, the proposed model is suitable for various downstream tasks, including prediction, classification, condition estimation, etc. Various experiments with open realworld datasets manifest the superiority of the proposed DBTDMAE when confronting the missing data problem in the MTS context.
References
 UCI machine learning repository. Irvine, CA, USA. Cited by: Dataset Settings.
 The uea multivariate time series classification archive, 2018. arXiv preprint arXiv:1811.00075. Cited by: Dataset Settings.
 An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271. Cited by: Introduction.
 Language models are fewshot learners. Advances in neural information processing systems 33, pp. 1877–1901. Cited by: Pretrain Models.
 Brits: bidirectional recurrent imputation for time series. Advances in neural information processing systems 31. Cited by: Missing Data Imputation.
 Recurrent neural networks for multivariate time series with missing values. Scientific reports 8 (1), pp. 1–12. Cited by: Missing Data Imputation.

Dynamic convolution: attention over convolution kernels.
In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
, pp. 11030–11039. Cited by: Dynamic Bidirectional TCN Unit and Block.  Filling the g_ap_s: multivariate time series imputation by graph neural networks. In International Conference on Learning Representations, Cited by: Introduction.
 Bert: pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: Pretrain Models, Dynamic Positional Embedding.
 Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377. Cited by: Pretrain Models.
 Deepmdp: a novel deeplearningbased missing data prediction protocol for iot. IEEE Internet of Things Journal 8 (1), pp. 232–243. Cited by: Missing Data Imputation.
 Modeling longand shortterm temporal patterns with deep neural networks. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 95–104. Cited by: Baselines Settings.

Pretrain, prompt, and predict: a systematic survey of prompting methods in natural language processing
. arXiv preprint arXiv:2107.13586. Cited by: Dynamic Positional Embedding.  RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv. Note: arXiv:1907.11692 [cs] External Links: Link, Document Cited by: Dynamic Positional Embedding.

E2gan: endtoend generative adversarial network for multivariate time series imputation
. InProceedings of the 28th international joint conference on artificial intelligence
, pp. 3094–3100. Cited by: Missing Data Imputation.  Reconstructing missing data in state estimation with autoencoders. IEEE Transactions on power systems 27 (2), pp. 604–611. Cited by: Missing Data Imputation.
 Reconstructing missing data in state estimation with autoencoders. IEEE Transactions on Power Systems 27 (2), pp. 604–611. External Links: Document Cited by: Missing Data Imputation.
 Imputation of missing values in time series using an adaptivelearned medianfilled deep autoencoder. IEEE Transactions on Cybernetics (), pp. 1–12. External Links: Document Cited by: Missing Data Imputation.
 A dualstage attentionbased recurrent neural network for time series prediction. arXiv preprint arXiv:1704.02971. Cited by: Baselines Settings.
 Improving language understanding by generative pretraining. Cited by: Pretrain Models.
 Language models are unsupervised multitask learners. OpenAI blog 1 (8), pp. 9. Cited by: Pretrain Models.
 Unsupervised pretraining for sequence to sequence learning. arXiv preprint arXiv:1611.02683. Cited by: Pretrain Models.
 Inference and missing data. Biometrika 63 (3), pp. 581–592. Cited by: Introduction, Introduction.
 Temporal pattern attention for multivariate time series forecasting. Machine Learning 108 (8), pp. 1421–1441. Cited by: Baselines Settings.
 Joint modeling of local and global temporal dynamics for multivariate time series forecasting with missing values. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 5956–5963. Cited by: Missing Data Imputation.
 An approach towards increasing prediction accuracy for the recovery of missing iot data based on the grnnsgtm ensemble. Sensors 20 (9), pp. 2625. Cited by: Missing Data Imputation.
 MAGAN: a masked autoencoder generative adversarial network for processing missing iot sequence data. Pattern Recognition Letters 138, pp. 211–216. Cited by: Missing Data Imputation.
 Gain: missing data imputation using generative adversarial nets. In International conference on machine learning, pp. 5689–5698. Cited by: Missing Data Imputation.