Log In Sign Up

DBT-DMAE: An Effective Multivariate Time Series Pre-Train Model under Missing Data

by   Kai Zhang, et al.

Multivariate time series(MTS) is a universal data type related to many practical applications. However, MTS suffers from missing data problems, which leads to degradation or even collapse of the downstream tasks, such as prediction and classification. The concurrent missing data handling procedures could inevitably arouse the biased estimation and redundancy-training problem when encountering multiple downstream tasks. This paper presents a universally applicable MTS pre-train model, DBT-DMAE, to conquer the abovementioned obstacle. First, a missing representation module is designed by introducing dynamic positional embedding and random masking processing to characterize the missing symptom. Second, we proposed an auto-encoder structure to obtain the generalized MTS encoded representation utilizing an ameliorated TCN structure called dynamic-bidirectional-TCN as the basic unit, which integrates the dynamic kernel and time-fliping trick to draw temporal features effectively. Finally, the overall feed-in and loss strategy is established to ensure the adequate training of the whole model. Comparative experiment results manifest that the DBT-DMAE outperforms the other state-of-the-art methods in six real-world datasets and two different downstream tasks. Moreover, ablation and interpretability experiments are delivered to verify the validity of DBT-DMAE's substructures.


Time Series Cluster Kernel for Learning Similarities between Multivariate Time Series with Missing Data

Similarity-based approaches represent a promising direction for time ser...

Handling Missing Data with Graph Representation Learning

Machine learning with missing data has been approached in two different ...

A Transformer-based Framework for Multivariate Time Series Representation Learning

In this work we propose for the first time a transformer-based framework...

Time Series Generation with Masked Autoencoder

This paper shows that masked autoencoders with interpolators (InterpoMAE...

Irregularly-Sampled Time Series Modeling with Spline Networks

Observations made in continuous time are often irregular and contain the...

Deep Dynamic Effective Connectivity Estimation from Multivariate Time Series

Recently, methods that represent data as a graph, such as graph neural n...

Lossy Compression for Lossless Prediction

Most data is automatically collected and only ever "seen" by algorithms....


Figure 1:

In conventional procedures, the missing data imputation is treated as the data pre-processing, so multiple downstream task models have to be trained individually. In our proposed pre-train designs, the fully-encoded generalized feature can be easily fine-tuned and oriented to diverse downstream tasks after introducing the missing embedding and masked self-mapping pre-training.

Multivariate time series(MTS) data have extensive practical applications in industry, health care, geoscience, biology, etc. However, MTS universally carries missing observations in diverse data contexts. It has been noted that these missing entries provide informative features of the original data sourcesRubin (1976). As a result, missing data’s discriminative characteristics must be considered when dealing with MTS-related tasks.

In literature, the MTS-related tasks encountering missing data are often solved by the procedures shown in the upper part of Figure 1

. Under the two-phase pipeline, missing imputation is treated as the data-cleaning stage, and then independent models oriented towards multiple downstream tasks are conducted individually in the application stage. Many valuable works have contributed to missing data imputation in MTS under this framework. The methodologies cover many fields, including tensor decomposition, RNN-based prediction, auto-encoders, and generative-adversarial theory, etc. However, there are several drawbacks for present actual applications. The traditional therapies produce a complete dataset as an intermediate product which inevitably brings imputation bias to missing entries. Besides, the feature extracting procedures are repeatedly developed in both phases, and multiple downstream models have to be trained independently, which always results in significant redundancy.

Moreover, missing data in MTS has its unique missing patterns impairing the concurrent treatments mentioned above. Apart from the common patterns listed in Rubin’s workRubin (1976), MTS missing data also confronts some reluctant patterns, often referring as ”line missing” and ”block missing,”Cini et al. (2021) contributing to the disconnection of measurement within a global sensory integration during network no-response or downtime. These patterns deprive the MTS’s time-continuous property, and the missing entries are jointly missing simultaneously. As a result, many conventional imputation methods mentioned above become invalid even under a relatively small missing ratio.

Based on these considerations and inspired by the pre-train idea in NLP and CV field, we novelly proposed a generalized MTS pre-train model called DBT-DMAE. Our model adopts a typical pre-train procedure depicted in the lower part of Figure 1. In order to avoid the training redundancy problem caused by obtaining the complete dataset in the conventional pipeline, DBT-DMAE utilizes an auto-regressive architecture under a masked learning mechanism and directly learns from the unlabeled incomplete MTS data to get generic encoded MTS representation. As for handling the tricky missing pattern appearing in MTS, we propose a dynamic missing positional embedding(DPE) technique giving all missing entries with effective representations without bringing in extra imputation bias.

For detailed implementation, we propose a TCN-basedBai et al. (2018) unit, called Dynamic Bidirectional TCN(DBT

), as the basic encoder unit to capture temporal correlations from a bidirectional data context in MTS. The entire DMAE model is designed to extract multi-time scale features and perform effective deep fusion to obtain universally applicable encoded features. For the DMAE training progress, the specialized data-feed-in and loss strategy guarantees adequate training of all substructures, and a warm-up training trick is applied to accelerate and stabilize the convergence progress. After the pre-train phase, downstream-task-oriented fine-tuning can be quickly delivered by substituting the decoder of the well-trained DMAE. In this paper, we choose the multi-step prediction and MTS classification task as the downstream task examples, and the fine-tuning could rapidly converge in less than several epochs. In general, our main contributions can be concluded as follows:

  1. A novel MTS pre-train framework under missing data is proposed. Under this framework, the mentioned biased-imputation problem is avoided by applying dynamic missing mask tokens derived from extensive unlabeled MTS data. At the same time, the straightforward downstream task fine-tuning procedure directly solves the redundant training problem in the conventional pipeline.

  2. The proposed DBT-DMAE holds distinguished adaptability to dynamic time-varying MTS input. The dynamic intrinsic nature of DPE, DK, and ASF mechanisms enables the model to draw a profoundly underlying temporal correlation in multiple scales and both time orientations and gives a preferable generalization performance.

  3. The pre-train effectiveness is evaluated by two downstream tasks under six real-world open datasets whose fields range from industry, climates, wearable devices, and speech recognition.

The rest of this article is organized as follows: Section Notation and Problem Statement describes the notation used in this article and the problem formulation of DBT-DMAE. Section Model Architecture introduces the DBT-DMAE in detail. Section Experiments includes all the experiments, including comparative studies, ablation experiments, and model interpretation experiments. Finally, a concise conclusion is made at the end of the paper.

Related Works

Missing Data Imputation

Under deep learning background, there are mainly three types of methodology of missing data imputation: prediction-based, auto-encoder-based, and GAN-based methods. For prediction-based methods, some works

Kök and Özdemir (2020); Tkachenko et al. (2020) transform the imputation into an MTS prediction problem while using RNN-based models to predict the missing value. Some other worksChe et al. (2018); Cao et al. (2018); Tang et al. (2020) integrate the missing prediction as an intermediate step in time series prediction. In terms of auto-encoder-based methods, some other worksMiranda et al. (2011, 2012); Pan et al. (2022) take the missing parts as random noises and recover the missing value with the output of the delicately-designed auto-encoder. Moreover, with the recent advancement of generative adversarial theory, many worksWeihan (2020); Luo et al. (2019); Yoon et al. (2018)

follow the basic generative adversarial idea with the utilization of deep learning neural networks to train the specifically structured generators and discriminators and generate the value of the missing parts.

Pre-train Models

In 2016, the Google Brain research team proposed the seq2seq pre-train modelRamachandran et al. (2016). Next up, in 2018, BERTDevlin et al. (2018) was carried out by the Google AI Language research group and GPTRadford et al. (2018) by the OpenAI research team in the same year. Also, GPT-v2Radford et al. (2019) and GPT-v3Brown et al. (2020)

were proposed in succession in the next few years. Furthermore, in 2021, the Facebook AI Research team led by Kaiming proposed the Masked Autoencoder

He et al. (2021) pre-train model in CV.

Notation and Problem Statement

Given MTS, , , where is the length of the sequence, we use to denote the kth attribute of length and employ

to denote the attribute vector at time-entry

. Meanwhile, due to the missing phenomenon overwhelming in MTS data, we also introduce the binary missing mask matrix as the same shape as the , in which indicates the attribute at time-entry is a valid value, while indicates the value is missing.

Typically, in our work, DBT-DMAE implements the n-attribute sequential input self-mapping. Given the input and mask matrix , the DBT model intends to project the input

into a hidden representation

through an encoder , where represents the number of hidden states, and reconstruct the complete by a decoder :


where denotes the reconstructed result.

Apart from the DBT-DMAE problem statement, different downstream tasks have independent problem expressions. In this paper, we only refer to two types of downstream tasks: multi-step prediction and classification. The followings are the problem statement for each task.

Multi-step Prediction Problem Statement

Prediction is a common task among all the MTS applications. We apply 1-step, 3-step, and 5-step prediction experiments in this work. For 1-step prediction, the exact next target value is predicted; for 3-step and 5-step prediction, the target values of the following three or five-time points are estimated. To be more specific, we expect to establish the functional relationship between target value and previously observed data as follows:


where denotes the prediction steps.

MTS Classification Problem Statement

MTS Classification is a substantial task with applications in many fields, including but not limited to wearable devices, monitoring systems, and speech recognition. Here, our objective is to find a non-linear multivariate probability distribution

, which takes MTS data and mask matrix

as input and outputs the probability that this series belongs to each class



Model Architecture

Dynamic Bidirectional Temporal Convolution Network-based Denoising Mask Auto-encoder, abbreviated to DBT-DMAE, is fully discussed in this section. The fundamental DBT unit and DBT block is first introduced. Then, the specially designed DPE mechanism is enclosed to conduct the missing data representation. At last, the overall DMAE structure is beheld with the data-feed-in and loss strategy, while the warm-up training trick is also explicitly included.

Dynamic Bidirectional TCN Unit and Block

Figure 2: DBT: two-directional features are fused between each two adjacent CDC layers, and the dynamic kernel is adaptive to mutative input.
Figure 3: DBT block: an independent computing module that consists of three parallel DBT Units and an Attention Scale Fusion(ASF) Layer to fuse multi-scale features.

The dynamic bidirectional TCN unit(DBT) is an ameliorated TCN unit used as the basic unit of the DBT block, which will be included in this section later. The original TCN is a residual architecture with two sequentially stacked Casual Dilated Convolutional(CDC) layers and a ReLU non-linear activation. Based on this structure, we make two main modifications to the original TCN in general, whose interior structure is shown in Figure


First, since our pre-train model is made to learn the underlying temporal relations within MTS, it is reasonable to integrate sequential information from both time directions. As a result, we apply the ”time flipping” trick to the input MTS. As shown in Figure 2, we adopt two independent CDC networks corresponding to time-forward and time-backward convolution. Afterward, a fully-connected layer merges information from opposite directions and maintains the appropriate shape of the next CDC layer required.

Second, MTS’s multi-variable characteristic leads to a wide range of input fields containing mutative conditions, and therefore a fixed combination of reception kernels could be insufficient. Based on this consideration, we propose the dynamic kernel(DK) to replace the convolution kernel in CDC layers. The detail is depicted on the right side of Figure 2. Inspired by the idea of the work of Chen, Y.Chen et al. (2020), we use the input attention mechanism to fuse multiple learnable kernel groups into one. Specifically, we apply two average pooling upon the input squeezing the input length

and variable dimension to obtain global vector entries. Then a linear projector and a SoftMax layer are employed on the global entries to get

attention weights . At last, the output kernel parameter is aggregated from groups of the kernel with the weights. The calculation process can be expressed as Equal.(6)-(8), where indicates the unit vector with all entries equal to 1; is the output kernel arguments; is the weights of the th candidate kernel group; is the penalty factor to alleviate the one-hot phenomenon in softmax. Dimensions of all the parameters are coherent with the input , which is easy to obtain.


DBT block functions as an independent computing module in the overall model. One DBT block consists of three parallel DBT units with BatchNorm and an Attention Scale Fusion(ASF) layer incorporating all temporal features. The exact structure is depicted in Figure 3. The three parallel DBT units have different kernel sizes to capture features in multiple sequence scales. After three feedforward paths, an ASF layer is attached to integrate features under different scales. Firstly, we draw an average pooling on the concatenated parallel output, only maintaining the hidden-feature dimension and squeezing the other dimensions to 1. Then, the average vector is used as a reference entry toward each of shape ( 3). The following attention function assigns each a weight , and the final output is computed as the inner product between the weight vector and the . The Attention function we use here is calculated as follows:


where is the global feature projection operator( is the attention hidden states); is also the penalty factor; is the local feature projection operator; is the attention vector; the represents the final output of DBT Block at time t.

Dynamic Positional Embedding

Without the fully imputed MTS as the input, the original MTS must be well represented in the proposed pipeline. Learning from the masked language modelingLiu et al. (2021) coming up with numerous famous NLP pre-train models Devlin et al. (2018); Liu et al. (2019), we apply a random mask on the original MTS input and treat the masked entries equivalent to missing parts that are required to be restored. Unlike the mask token embedding method using the hard-coded or fully learnable embedding in NLP pre-train models, we consider that the missing token representation should combine time-varying and variable-varying characteristics, which means that the missing data appears at different time or variables would have discrepant token representations. As a result, we adopt a specialized end-to-end missing positional embedding technique, called dynamic positional embedding(DPE), to generate the time-variable-varying missing tokens.

Specifically, DPE is implemented by a random masking(RM) procedure and a DBT block. Apart from the initially missing entries, observations of a masking ratio are artificially masked, and both the masked and missing entries are embedded by the bidirectional scanning of a DBT block upon the masked data. Furthermore, the DBT unit is an end-to-end convolution architecture so that it has the representation ability to generate multiple positional embeddings at only one glance and could theoretically bear and process any missing pattern, including the ”line missing” and ”block missing” mentioned above. This whole DPE process can be formulated as Equal.(12)-(15)


where is a random noise strengthening the robustness of the model with as its parameter; denotes the artificial masking function, which randomly sets zeros within the original missing mask matrix; is the element-wise multiplication.

Auto-Encoder Architecture

Figure 4: The overall architecture of DBT DMAE.

The entire DBT-DMAE architecture is designed to establish an auto-regressive projection with valid missing entry representation under the missing data problem in the MTS. A DBT-based encoder and feedforward decoder constitute the holistic pre-train model.

The encoder consists of sequential DBT blocks. Within each block, all DBT units share the same dilation size, while the stacked blocks have gradually enlarged dilate sizes to integrate local and global temporal features in MTS. In practice, we use three consecutive DBT blocks stacked on top of each other, and their dilation size grows exponentially with base two. In terms of the decoder, a more complicatedly-structured decoder may contribute to a better performance of the auto-regressive reconstruction. However, in DBT-DMAE, we desire to obtain a generalized feature representation of the input MTS. With that being the case, a simple straight decoder is ideal for its low expressiveness, and thus a simple fully-connected feedforward network is introduced as the decoder in our work. The overall architecture of DBT is depicted in Figure 4.

Training Strategy

It is worth mentioning that both original MTS and the one after RM are fed into the model. The red and blue lines in Figure 4 denote the two forward paths, and the reconstructed results are denoted as and , respectively. The objective function is as Equal.(16), where is the missing mask matrix while is the mask matrix after RM; denotes the random masking ratio. The first part of the function focuses on the manually masked part, while the second part pays attention to reconstructing all observed data. This combination is critical because combining the two parts enforces DBT to learn both the original data distribution and the missing representation at the same missing entry. Additionally, the two losses are weighted-combined corresponding to each missing ratio, which aims to balance the probabilities between the model seeing the artificially masked missing embedding and the corresponding original observed data. Furthermore, it has been proved by our ablation experiments that the pre-train performance would remarkably margin if the first part vanish.


For more detail of the training process, noticing the heavy utilization of softmax operation in the DBT unit and the sequentially stacked DMAE structure, direct training of the entire model could lead to oscillation, and the converge speed is slow. Therefore, we proposed a so-called ”warm-up” training trick to give the network a better initialization firsthand. In several beginning epochs, we artificially replace all softmax with uniformed weights to neutralize the underfitting problem brought by the nearly one-hot output of softmax. In this process, all the model structures, especially all candidate kernels in DK, are sufficiently initialized to a relative ideal position. After several epochs’ warm-up, the softmax layers then come into play. Also, the penalty factors used in Equal.(7) and Equal.(10) are set to a relatively large positive number to avert the sudden saltation within the model once the softmax resumes work.

Downstream Fine-tuning

As for the fine-tuning detail, in this paper, we include multi-step prediction and classification task fine-tuning. DBT-DMAE’s original decoder is substituted for prediction and classification tasks with a new feedforward network. The last layer’s output dimension is set to the target prediction step in the prediction task, while in the classification task, the dimension is set the same as total categories, and a softmax layer is attached at the end of the feedforward networks. The fine-tuning loss settings are also simple. MSE loss is used for the prediction task, and Cross-Entropy loss is used for the classification task.


This section conducts various experiments, including downstream task comparison, model hyper-parameter selection, and ablation studies. All the experiments are conducted using an AMAX workstation with Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz , RAM 256GB, and RTX3090 .

Dataset Settings

We used a total of 6 real-world open-access datasets to develop a downstream prediction task and an MTS classification task. The first three, namely Appliances Energy Prediction Data(AEPD), Beijing Multi-Site Air-Quality Data(BMAD), and SML2010(SML), are MTS datasets archived in UCI Machine Learning Repository

Asuncion and Newman (2007). These three datasets are used for the prediction-related task. For the classification task, subsets of FaceDetection, JapeneseVowels, and SpokenArabicDigits are chosen from the UEA multivariate time series classification archiveBagnall et al. (2018)

. For training and validation sake, we randomly split each dataset with a ratio of 9:1. Also, we normalize the numerical values for all tasks using the Z-Score method for regular training and fair comparability. We artificially generate missing values for tasks involving missing values under the missing ratio of 5%, 10%, and 20%.

1s 3s 5s 1s 3s 5s 1s 3s 5s
LSTNet 0.05 0.136 0.132 0.137 0.112 0.106 0.108 0.127 0.148 0.153 72% 11% 1%
0.10 0.158 0.170 0.180 0.120 0.135 0.141 0.132 0.157 0.176 70% 11% 1%
0.20 0.197 0.208 0.216 0.197 0.221 0.232 0.177 0.183 0.194 65% 10% 1%
TPA-RNN 0.05 0.185 0.189 0.219 0.185 0.215 0.221 0.108 0.130 0.135 69% 97% 1%
0.10 0.197 0.214 0.225 0.177 0.225 0.239 0.120 0.145 0.155 69% 96% 1%
0.20 0.215 0.232 0.248 0.243 0.264 0.278 0.149 0.160 0.182 63% 93% 1%
DARNN 0.05 0.129 0.136 0.129 0.067 0.102 0.105 0.123 0.127 0.134 66% 64% 13%
0.10 0.147 0.158 0.166 0.087 0.118 0.124 0.129 0.137 0.141 64% 63% 10%
0.20 0.158 0.184 0.197 0.094 0.136 0.144 0.187 0.161 0.149 61% 56% 9%
DBT-DMAE 0.05 0.075 0.081 0.078 0.074 0.094 0.081 0.099 0.100 0.123 76% 100% 100%
0.10 0.090 0.071 0.071 0.081 0.088 0.085 0.106 0.123 0.130 75% 99% 100%
0.20 0.087 0.088 0.100 0.085 0.097 0.094 0.119 0.109 0.132 76% 100% 100%
Table 1: MAE results on prediction task and precision indicators on classification task

Performance Indexes

There are three types of performance indexes needing to specify.

Pre-train Index

In this paper, we apply Mean Square Error(MSE) to evaluate the pretrain model’s auto-regressive reconstruction performance. We particularly evaluate the both missing and observed values by the performance indexes with and suffix.


where denotes the sample number and is the total sample number; denotes the reconstructed input ; is the 1-norm operand.

Prediction Index

In terms of prediction-related tasks, we adopt MAE indicator to evaluate the prediction error between the output of downstream fine-tuning and the ground-truth value. The MAE index for steps prediction is as follows:


Classification Index

We use the precision indicator for the MTS classification task, which manifests the correctly categorized proportion of samples. The indicator is as follow:


where is the indicator function that returns one if the condition is true; otherwise, it returns 0.

Baselines Settings

In terms of prediction baselines, we choose three different State of the Art models, namely LSTNetLai et al. (2018), TPA-RNNShih et al. (2019) and DARNNQin et al. (2017). In recent literature, these three models are well-recognized sequential prediction and estimation methods using deep learning. For the classification task, we append a softmax layer at the rear of the three models above; therefore, the output is the probability under which MTS input belongs to each category.

Comparative Study

This subsection compares DBT-DMAE’s prediction and fine-tuning classification performances with the three baselines above. The missing ratio(MR) is artificially set to 5%, 10%, and 20% for each dataset. The results in the first three datasets manifest the prediction performances computed as Equal.(19), and the 1s, 3s, and 5s denote 1-step, 3-step, and 5-step prediction, respectively. The last three columns show the comparative results of the classification task, the precision indexes are computed as Equal.(20). The best performance indicator in each case is written in bold. According to the results, fine-tuning of DBT-DMAE outperforms the three compared baselines, especially when the missing ratio is high.

Hyper-Parameter Selection

In this subsection, we conduct experiments to discuss the DBT-DMAE’s hyper-parameters effect on model convergence and performances. Three crucial parameters are taken into account: hidden states number , random masking ratio , and kernel size combination . We develop a grid search of the hyper-parameter fields. The are selected from [16, 32, 64, 128]; is chosen from [0.05, 0.1, 0.2, 0.3]; is chosen from [{3, 5, 7}, {2, 3, 5}, {5, 7, 11}, {2, 3, 7}]. For brevity, we only present the performance using the SML dataset here.

(a) loss curve with different s
(b) loss curve with different s
(c) loss curve with different s
Figure 5: MSE loss curves at both observed and missing entries with different , , and combinations

In Figure.5, we visualize the grid search results containing training and validation MSE indexes under different hyper-parameter combinations. Specifically, the restoration performances at missing and observed entries are plotted separately, and the labels are marked with (v) and (m), respectively. It is worth mentioning that there is a rapid loss descent after epoch 10, which accurately reflects the effect of the warm-up training trick. In conclusion, in order to balance the model’s performance, converge speed and model volume, we finally choose the hyper-parameter combination that .

Ablation Studies

Our algorithm has four critical sub-structures: DPE, RM, DBT, and ASF. The ablation studies are delivered in four independent parts further to verify the efficiency and brevity of our model:

  1. w/o DPE: substitute DPE by the hard-code embedding.

  2. w/o RM: remove random masking process and first part of the loss function.

  3. w/o DK: substituted DBT unit by the original TCN.

  4. w/o ASF: substituted ASF by linear concatenation.

All the ablation experiment results below are obtained under the AEPD dataset and missing ratio equal to 0.2.

Case Evaluation Metrics
DBT-DMAE 0.016 0.021 0.088
w/o DPE 0.023 0.032 0.096
w/o RM 0.028 0.160 0.124
w/o DK 0.019 0.025 0.101
w/o ASF 0.021 0.024 0.092
Table 2: Ablation Studies Result

Table.2 shows the ablation experiments’ result. We evaluate the pre-train performance within both observed and missing locations and the downstream 3-step prediction precision index is also included for support.

The results from cases w/o DPE and w/o RM manifest that the dynamic missing embedding is exceedingly meaningful in the whole algorithm. When the random masking process is removed from the missing representation procedure, the reconstruction performance on the missing part dramatically drops. In contrast, the performances in the observed parts remain unaffected. This unbalanced performance degradation can be viewed as a failure signal in the missing value representation.

In terms of experiment cases w/o DK and w/o ASF, ablation results confirm these substructures’ dynamic feature extraction ability. We also visualize the dynamic kernel parameters and dynamic missing positional embedding values in Figure 6, which also confirms the validity of the proposed DBT structure.

(a) DPE along with the time axis
(b) DK parameters along with the time axis
Figure 6: Visualization of DBT-DMAE’s dynamic characteristics.


In this paper, we propose a universal MTS pre-train model to obtain downstream-generalized encoded representation conquering the missing data problem. Our model utilizes the proposed DBT as the basic unit, and adopts the dynamic positional embedding and mask-learning mechanism to construct the auto-regressive DMAE. With simple fine-tuning, the proposed model is suitable for various downstream tasks, including prediction, classification, condition estimation, etc. Various experiments with open real-world datasets manifest the superiority of the proposed DBT-DMAE when confronting the missing data problem in the MTS context.


  • A. Asuncion and D. Newman (2007) UCI machine learning repository. Irvine, CA, USA. Cited by: Dataset Settings.
  • A. Bagnall, H. A. Dau, J. Lines, M. Flynn, J. Large, A. Bostrom, P. Southam, and E. Keogh (2018) The uea multivariate time series classification archive, 2018. arXiv preprint arXiv:1811.00075. Cited by: Dataset Settings.
  • S. Bai, J. Z. Kolter, and V. Koltun (2018) An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271. Cited by: Introduction.
  • T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, and A. Askell (2020) Language models are few-shot learners. Advances in neural information processing systems 33, pp. 1877–1901. Cited by: Pre-train Models.
  • W. Cao, D. Wang, J. Li, H. Zhou, L. Li, and Y. Li (2018) Brits: bidirectional recurrent imputation for time series. Advances in neural information processing systems 31. Cited by: Missing Data Imputation.
  • Z. Che, S. Purushotham, K. Cho, D. Sontag, and Y. Liu (2018) Recurrent neural networks for multivariate time series with missing values. Scientific reports 8 (1), pp. 1–12. Cited by: Missing Data Imputation.
  • Y. Chen, X. Dai, M. Liu, D. Chen, L. Yuan, and Z. Liu (2020) Dynamic convolution: attention over convolution kernels. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    pp. 11030–11039. Cited by: Dynamic Bidirectional TCN Unit and Block.
  • A. Cini, I. Marisca, and C. Alippi (2021) Filling the g_ap_s: multivariate time series imputation by graph neural networks. In International Conference on Learning Representations, Cited by: Introduction.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: Pre-train Models, Dynamic Positional Embedding.
  • K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick (2021) Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377. Cited by: Pre-train Models.
  • İ. Kök and S. Özdemir (2020) Deepmdp: a novel deep-learning-based missing data prediction protocol for iot. IEEE Internet of Things Journal 8 (1), pp. 232–243. Cited by: Missing Data Imputation.
  • G. Lai, W. Chang, Y. Yang, and H. Liu (2018) Modeling long-and short-term temporal patterns with deep neural networks. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 95–104. Cited by: Baselines Settings.
  • P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig (2021)

    Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing

    arXiv preprint arXiv:2107.13586. Cited by: Dynamic Positional Embedding.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv. Note: arXiv:1907.11692 [cs] External Links: Link, Document Cited by: Dynamic Positional Embedding.
  • Y. Luo, Y. Zhang, X. Cai, and X. Yuan (2019)

    E2gan: end-to-end generative adversarial network for multivariate time series imputation


    Proceedings of the 28th international joint conference on artificial intelligence

    pp. 3094–3100. Cited by: Missing Data Imputation.
  • V. Miranda, J. Krstulovic, H. Keko, C. Moreira, and J. Pereira (2011) Reconstructing missing data in state estimation with autoencoders. IEEE Transactions on power systems 27 (2), pp. 604–611. Cited by: Missing Data Imputation.
  • V. Miranda, J. Krstulovic, H. Keko, C. Moreira, and J. Pereira (2012) Reconstructing missing data in state estimation with autoencoders. IEEE Transactions on Power Systems 27 (2), pp. 604–611. External Links: Document Cited by: Missing Data Imputation.
  • Z. Pan, Y. Wang, K. Wang, H. Chen, C. Yang, and W. Gui (2022) Imputation of missing values in time series using an adaptive-learned median-filled deep autoencoder. IEEE Transactions on Cybernetics (), pp. 1–12. External Links: Document Cited by: Missing Data Imputation.
  • Y. Qin, D. Song, H. Chen, W. Cheng, G. Jiang, and G. Cottrell (2017) A dual-stage attention-based recurrent neural network for time series prediction. arXiv preprint arXiv:1704.02971. Cited by: Baselines Settings.
  • A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever (2018) Improving language understanding by generative pre-training. Cited by: Pre-train Models.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. OpenAI blog 1 (8), pp. 9. Cited by: Pre-train Models.
  • P. Ramachandran, P. J. Liu, and Q. V. Le (2016) Unsupervised pretraining for sequence to sequence learning. arXiv preprint arXiv:1611.02683. Cited by: Pre-train Models.
  • D. B. Rubin (1976) Inference and missing data. Biometrika 63 (3), pp. 581–592. Cited by: Introduction, Introduction.
  • S. Shih, F. Sun, and H. Lee (2019) Temporal pattern attention for multivariate time series forecasting. Machine Learning 108 (8), pp. 1421–1441. Cited by: Baselines Settings.
  • X. Tang, H. Yao, Y. Sun, C. Aggarwal, P. Mitra, and S. Wang (2020) Joint modeling of local and global temporal dynamics for multivariate time series forecasting with missing values. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 5956–5963. Cited by: Missing Data Imputation.
  • R. Tkachenko, I. Izonin, N. Kryvinska, I. Dronyuk, and K. Zub (2020) An approach towards increasing prediction accuracy for the recovery of missing iot data based on the grnn-sgtm ensemble. Sensors 20 (9), pp. 2625. Cited by: Missing Data Imputation.
  • W. Weihan (2020) MAGAN: a masked autoencoder generative adversarial network for processing missing iot sequence data. Pattern Recognition Letters 138, pp. 211–216. Cited by: Missing Data Imputation.
  • J. Yoon, J. Jordon, and M. Schaar (2018) Gain: missing data imputation using generative adversarial nets. In International conference on machine learning, pp. 5689–5698. Cited by: Missing Data Imputation.