With the fast growth of sensor applications, the amount of sensor data has increased tremendously in recent years . The missing value has been a major impediment to the development of the sensor data analysis process . Missing data values can be caused by various factors, including sensor failure, data loss, irregular sampling data, manual recording errors, sensor maintenance, and debugging . Missing value is a widespread and difficult problem to avoid; it complicates future data analysis for researchers and engineers.
Currently, there are primarily two types of methods  for dealing with missing values, namely, deletion and imputation. Deletion is the act of directly deleting data with missing values; this will not only result in the data loss of specific information but also lead to incomplete time series, thereby affecting the subsequent data analysis work. The imputation methods 
are divided into traditional machine learning methods and deep learning methods. Traditional machine learning methods include neighbor-based methods, constraint-based methods , regression-based methods , statistical-based methods , matrix factorization-based methods 
, expectation maximization-based methods and imputation multivariate imputation by chained equation-based methods [38, 1]
. These methods are better suited to situations with a small amount of data and a low missing rate. Deep learning methods include fully connected neural network (FCNN)-based methods
, convolutional neural network (CNN)-based methods
, recurrent neural network (RNN)-based methods[40, 21, 18, 6, 3, 35, 20, 28, 24]
, generative adversarial network (GAN)[22, 12, 25, 16], attention [36, 23], and transformer . Deep learning methods are better suited with a large amount of data and a high missing rate.
Although deep learning methods have made significant progress in the problem of missing value imputation, their assumptions and settings are quite different. In practice, the characteristics of the missing values for sensor data are complicated, with factors such as missing rate, missing location, and maximum missing length, all of which will influence the difficulty of the imputation task. Making many assumptions in advance may result in unexpected outcomes that cannot be used in practice. The current deep learning imputation methods have three types of issues.
First, the measurement indexes of missing value imputation warrant further study . To measure the imputation effect of missing values point to point, the current practice is to use the indicators from the mean squared error (MSE) category, particularly, MSE, mean absolute error (MAE), root MAE (RMAE), and root MSE (RMSE). They are supervised indexes, which means that the operation steps are to first remove a subset of the values based on the original data, use the model to impute this subset of the values, and finally use RMSE or MAE to measure the gap between the removed and imputed values. MSE and MAE are mainly used at imputation tasks . But, they overemphasize the distinction between dropped and imputed values while neglecting the difference between the missing and imputed values. Furthermore, when a part of the data is removed in advance, the method cannot be used completely for model training and imputation. This issue is especially prominent when some data has a high missing rate.
Second, as we all know, when processing a large amount of time-series data in deep learning, the data is divided into small segments, each of which is a subsequence of the original time-series data, also known as a sample in deep learning. As a result of this crucial step, the missing rate can be divided into two categories: local missing and global missing rate, with the global missing rate representing the missing rate of all data and the local missing rate representing the missing rate of each short segment. In essence, the model should account for the local missing rate. The local missing rate for different data segments can range from 0% to 100%, depending on the missing rate of the data and the length of the segment. By contrast, the global missing rate is fixed. However, most of the existing deep learning imputation methods [39, 11, 40, 21, 18, 6, 3, 35, 12, 25, 23, 22, 20, 31, 36, 16, 28, 24] do not distinguish between these two concepts. At this point, two subproblems arise.
i) The imputation task and the model’s target are not aligned. The task that needs to be accomplished is to fill in the missing values in the data. However, it is easy to transform the problem of imputing fixed global missing rates into imputing fixed local missing rates during evaluation. Missing data in actual sensor data easily cluster together; hence, the local missing rate easily becomes zero.
ii) The choice of using disguised missing data is unrealistic. Most of the aforementioned studies only consider nearly or completely complete data on missing value imputation. Removing values with different missing rates makes it easier to train and evaluate the model with more complete data. However, removing and calculating the loss when there are missing values in the sample data is unclear in the training phase; in the testing phase, how to select test data and what indicators to be used to measure the model’s quality are also very sensitive. In practice, the imputed data contains missing values; how to train and evaluate this type of data is a topic worth discussing.
Third, more adaptable imputation strategies are required for different missing value problems. However, the conventional approach of treating the missing value imputation as a one-stage problem also overlooks the assumptions of multiple local missing rates. Fig. 1 shows a one-stage flowchart. Liu et.al  proved that the iterative approach improves imputation performance. The missing rate of data should always be kept constant when dealing with data with different missing conditions using the same method. This can be accomplished through an iterative approach. However, most researchers use the strategy of fixed length segmentation processing for long-length sensor data; however, fixed length imputation cannot deal with large missing sections appropriately.  has proven that RMSE increases concerning the gap under the condition of a fixed sample length, i.e., the larger the missing gap in the data, the more information the model requires.
To address the aforementioned issues, we propose a Multistage Large Segment Imputation Framework (MLSIF) in this study. The MLSIF introduces a new statistical indicator that allows missing data values to participate in the entire stage of the training and evaluation process, allowing the MLSIF to impute sensor data with missing data. To improve the imputation effect, the framework design uses an iterative multistage imputation method and uses different data segment lengths in each imputation stage to better process different missing conditions. The real sensor data is used in the experimental part to verify the model’s quality, and by simulating the missing situation of real data, the problem of separation from the real situation caused by artificial regulations is avoided. The deep learning model, NRTSI , is used as the basic imputation model structure of the framework. The main contributions of this study are summarized as follows.
A new statistical indicator is presented. The newly proposed indicator can include the missing values from the data itself in the entire training and testing process and can be generated based on the distribution of the data without removing the original data. Therefore, it is more consistent with real-world situations and can better measure the effect of imputation to a certain extent.
A new method of constructing missing values is adopted, i.e., using complete or nearly complete data to simulate the missing situation of real data to avoid the deviation from the actual situation caused by an artificial setting.
A new MLSIF is proposed that can handle the missing value imputation task for actual missing data. The large-missing gaps are divided into different stages and imputed among the multistage. MLSIF dynamically changes the length of training and imputing data based on the missing situation of the current data, and longer missing gaps are provided with longer observation data, which means more information. Hence, the real-life sensor data can be handled.
The experimental results on both the benchmark and real sensor data show the effectiveness of the proposed MLSIF. The superiority of the multistage imputation strategy and the mixture loss have been highlighted, and the effect of missing value imputation has been improved to some extent, especially for the large missing gap imputation problem.
Ii Problem Formulation
This study considers the problem of missing value imputation in univariate sensor data, a class of time series data. Let be a sequence of sensor data with length , where , and some may be missing. To identify the missing values, define a mask sequence that corresponds to , where
. Denote as a sample, which is a subsequence of , where and are the set of positive integers. The problem we need to solve is how to impute the missing part.
Consequently, can be decomposed into multiple subsequences with length without crossover. Define a set that contains all the subsequences of , where represents the number of samples. When the data cannot be completely segmented, the last few data will form an with the previous consecutive data. Thus can be calculated using Equation (2).
where is least integer function.
Iii Related Works
Iii-a Deep Learning Models
Recently, there have been several studies on imputation deep learning models in the field of missing value imputation. It should be noted that sensor data is a type of time-series data. Therefore, time-series models can be applied to sensor data.
FCNN, RNN, and CNN were applied as the three basic structures in the early days of neural network development. As the first proposed network structure, FCNN can achieve good results while performing several tasks, including missing value imputation . However, due to its insensitivity to time and a large number of parameters, it is gradually being replaced in time-series data by RNN and in image data by CNN. RNN is frequently mentioned by researchers in missing value imputation tasks as a network structure designed for time-series data, such as unidirectional RNN [35, 18, 32], bidirectional RNN [3, 6, 44], and variant RNN [40, 20]. CNN has achieved unparalleled results in image processing; however, it has not shown unique advantages in time series-data. Guo  attempted to use CNN for missing value imputation and obtained acceptable results.
The breadth and depth of neural networks have steadily increased with the exponential growth of data and processing throughout this century. Some new network, such as GAN and attention, have gradually emerged. GAN has shown promising results in image generation; thus, some researchers have attempted to apply it for missing value generation [21, 22, 12, 25, 16]. Attention is a model designed for language sequence data that can also be used for missing value imputation in time-series data. GLIMA  believes that the information between the part and the whole data should be fully considered; therefore, it constructs a structure that can not only extract local information but also consider the whole information. Ma  used attention to extract information from data to realize missing value imputation.
Based on the attention structure, the transformer structure begins to show better results on a wide range of tasks. It is possible to think of it as a compound neural network structure based on attention. NRTSI  uses a nonregression method to impute the missing values in the field of missing value imputation. It reinterprets time series as a set of (time, data) tuples and proposes a time-series imputation method based on a permutation equivariance model, achieving excellent results so far in time-series imputation experimental results.
Iii-B Imputation Frameworks
Compared with imputation models that use processed sequences and samples, a more integral and comprehensive approach is to use a strategic framework to complete the imputation, which refers to the entire imputation process from missing data to complete data.
Some researchers have attempted to solve the problem of missing value imputation using a framework. Farhangfar 
provided a framework that can be used in almost any method to generate weights representing the quality of each estimate to perform boosting. Applying it to an imputation method can, on average, significantly improve the imputation accuracy while maintaining the same asymptotic computational complexity. Rahman subsequently proposed a framework for imputing missing values based on coappearance, correlation and similarity analysis. It proposes a novel missing value imputation technique that uses existing dataset patterns such as co-occurrence of attribute values, correlations between attributes, and similarity of attribute values.
Although some general frameworks exist, there are some flaws for specific problems, such as their inability to distinguish between local and global missing rates and their poor performance on the large gap missing problem.
Iii-C Performance Metrics
Metrics are the criteria by which a model is measured, evaluated and selected in the task of imputing missing values. There are three commonly used metrics : MSE-like evaluation (MSE, MAE, RMSE, and RMAE), and . Assume two sequences of time series data and of equal length, where . Let be the obvious values removed artificially and be the imputed values of the corresponding position. MSE and MAE can be defined using Equations (3) and (4).
RMSE and RMAE are the root of MSE and MAE, respectively. Moreover, and are defined as follows:
are the mean and standard deviation ofand .
To use the above indicators, missing values must be artificially created in nonmissing value data. Specifically, first, remove a portion of the values from the data and subsequently impute the removed values. Furthermore, the imputation effect is measured by comparing the difference between the removed values and the imputed values.
MSE-type metrics have a higher tolerance for results that are close to the mean; however, they are sensitive to extreme values, resulting in less attention being given to the overall data distribution. and are commonly used in statistical analysis and focus on the difference between the total data and the mean, which is the measure of dispersion. More importantly, the above metrics must have corresponding true values before they can be calculated, i.e., one must first know the corresponding original value before measuring it. This requirement is unrealistic in practice. When the rate of missing data is low, it can be used to assess how effectively the imputed values are performing. However, when the missing rate of the data is high, the artificially constructed missing values will jeopardize the data’s integrity and reduce the amount of valid information in the data.
Iii-D Imputation Losses
In practice, the loss directs the model training. However, it is difficult to directly find computable targets for most tasks in practice; hence approximating methods are used to achieve the goal. The goal of the missing value imputation task is to minimize the expectation of loss between missing and imputed values in Equation (7).
represents a loss function,is the complete time-series data, represents element-wise multiplication and is an imputation model with parameter . denotes the observed portion of and denotes the missing portion of .
This task is difficult to achieve, because is not known. Therefore, the traditional regression method sets an approximate loss construction method shown in Equation (8).
where represents the two-norm.
However, T.M. Choi  emphasized that the loss determined by the traditional regression method does not correspond to the task to be completed. The difference between the regressed and observed values is calculated using Equation (8). There is an implicit assumption in this loss that when the imputed values are close enough to the real values, the imputation result is satisfactory. Although this loss is simple to calculate, it differs significantly from Equation (7). It calculates the regression loss rather than the imputation loss, which does not match the target well. Therefore, training imputation networks with Equation (8) can be called implicit training. For these reasons, T.M. Choi  proposed a new training method for explicit training based on random drop imputation with self-training (RDIS). Random drop data are generated by randomly removing existing values in the time-series data . The random drop data is denoted as and , where
The loss function of RDIS can be expressed as follows:
where represents the dropped part of the observations, and represents the remaining part of the observations after dropping.
Compared with Equation (8), Equation (10) is more advanced, where it adds a second section to the loss function, making it more similar to Equation (7). It removes the data and converts the information that will be used as input to the model. This approach is useful for model training when the sample missing rate is low. When the sample missing rate is high, the information of the input data is further destroyed.
Compared to the regression loss, the nonregression method [20, 31] only outputs the missing position values. Therefore, the loss can only be calculated by artificially removing certain observed data. The loss function is shown in Equation (11).
The above loss (11) is a step closer to the objective Equation (7) than the Equation (10), which eliminates the influence of the observation data on the loss, and guides model training by directly calculating the loss of the missing data. However, all of these losses are faced with two problems. First, when the sample missing rate is high, adding missing values degrades the original data. Second, when MSE-like loss is used as a loss function to guide model training, the imputation result will be very close to the mean because of its sensitivity to extreme values .
Overall, the four aspects of work mentioned above are crucial for the missing value imputation task. The effect of imputation is determined by model and loss, the framework by how imputation is performed, and metrics by how good the imputation is. The framework, in particular, controls the input and output, the model imputes the input data and outputs the result, the loss guides the model’s training direction, and the metric is the standard for measuring the imputation quality. The four aspects are interconnected and independent of each other. Improvements in any of these four areas may benefit the missing value imputation task. This study improves and enhances the three problems in the section I from the perspectives of loss function and metrics, experimental design, and framework.
Iv Statistical Indexes Variation Loss and Evaluation Indicator
This section proposes a statistical indicator that can be calculated directly on the original data with missing values. This is referred to as statistical indexes variation (SIV). SIV, like the MSE-type indicator, could be used both as a loss to guide the model training as well as an evaluation index to assess the quality of imputation results.
: MCAR, missing at random, and missing not at random. It is difficult to say which type of assumptions are appropriate for real sensor data with missing values. However, the statistical characteristics of the time-series data with and without missing values can be calculated and compared, and these characteristics should not differ significantly when a small amount of data is missing. Consequently, we begin by presenting four statistical indexes: mean, standard deviation, skewness, and kurtosis, each of which indicates distinct data distribution characteristics. Then, SIV is used to calculate the change in statistical indexes before and after data imputation.
Iv-a Statistical Indexes
For any given sequence , its statistical characteristics, such as mean () and standard deviation (), the raised power of the corresponding order skewness (S), and kurtosis (K), are calculated using Equations (12 - 15).
where is the element in , .
The mean () describes the middle point of the sample set, and the standard deviation () describes the average of the Euclidean distances between each sample point in the sample set and the mean. The skewness (S) indicates that a distribution “leans” one way or the other and has an asymmetric tail . This is the amount of data distributed on both sides of the distribution center. The sample data becomes more biased to the right when the skewness is positive. Nevertheless, when the skewness is negative, the sample data becomes more biased to the left. Kurtosis (K) is associated with the distribution’s tail, shoulder, and peak . Generally, the smaller the kurtosis, the flatter the data distribution, and the greater the kurtosis, and the more concentrated the data distribution. Skewness and kurtosis, however, can be thought of as the second- and third-order distances from each sample point in the sample set to the mean. We increase S and K to the power of the corresponding order to unify the dimensions.
Assuming there are two sequences and , the SIV is calculated using Equation (16).
where can be obtained by Equation (17).
Notably, SIV ssesses the differences in statistical features between the two sequences. SIV has no requirement for the sequence length; it can measure two sequences of different lengths. Thus, it can be used to directly measure the difference between missing sequences before and after imputation. SIV can be used to the missing value imputation problem denoted by Equation (18).
where and are the observed and completed sequences following imputation, respectively. This equation can be applied as a loss in training objective functions as well as an evaluation index in model selection.
The SIV loss function is notable for its ability to be calculated directly on original and imputed data. Furthermore, rather than calculating missing values point to point, SIV considers the data distribution characteristics segment by segment. However, it should be noted that these statistical indicators neglect the time information in time-series data. Simultaneously, when the missing data rate is low and the number of imputed values is relatively low, it is simple to validate the effectiveness of SIV. When the missing rate of data is high, it is unclear if the statistical indicators before and after imputation will differ significantly.
SIV, as an evaluation indicator, can reflect the quality of the imputation effect to some extent. In particular, we present two SIV indexes in experiments. The first calculates the SIV of the overall data before and after imputation using all data as an object; the second calculates and sums the SIV of each piece of data before and after imputation using each piece of data as an object. The first result is referred to as ” Global SIV”, and the value obtained by the second is referred to as ” Local SIV.”
The SIV proposal addresses the issue of MSE being overused in the task of missing value imputation. On the one hand, SIV can be used as a loss to participate in the model’s training. SIV, on the other hand, can assess the quality of missing value imputation from a certain perspective. The most important aspect is that SIV can make the missing values in the data participate in the model’s training and evaluation.
V Multistage Imputation Framework
In this section, we present MLSIF for the missing value imputation problem. MLSIF adds a cyclic process of selecting and imputing the data to the single-stage imputation method. MLSIF, in particular, employs multistage and dynamic data length tricks. Multistage is reflected in the cyclic structure of the framework, and each cycle is a stage. Each stage imputes the data while keeping the missing rate lower than . The term dynamic data length means that the data length will change at each stage. The MLSIF flowchart is shown in Fig. 2.
Previous experimental results in the literature  show that as the missing rate of data increases, the imputation accuracy decreases. Furthermore, Liu  demonstrated that the iterative approach improves imputation performance. Therefore, iterative multistage is used to echo the dynamic data length. It can alleviate the problem of insufficient effective information caused by large-missing gaps, which makes imputation difficult, if not impossible. Additionally, the data imputed at each stage is the simplest data to impute. Therefore, the goal of MLSIF using dynamic length is to keep the missing rate at a low level.
In MLSIF, a mixture loss, combined with MSE and SIV, is used to guide the model training. This is because using only the MSE loss causes the model to easily cluster imputed values around the mean, whereas using only SIV causes the loss to be unable to capture temporally characterized values. Consequently, the final result is randomly distributed within the data distribution range with no regularity.
V-a Framework Process
In each stage in MLSIF, the missing data are imputed once, and each stage contains the four steps described below.
Step 1: Select the samples whose missing rate is lower than by Algorithm 1
The goal of this step is to select high-quality samples for subsequent training. The missing rate is introduced to make it easier to impute the selected samples. The samples with missing rates less than are selected.
In Algorithm 1, the input is data with missing values, and the output is a set of samples (segment of X) with a missing rate less than , called
. Additionally, there are also two hyperparameters that must be predetermined:and . First, initialize the splitting length to 0 and create an empty set as a container for samples with a missing rate less than . By increasing the splitting length , the data is divided into small segments according to the length , called samples. is the set of all samples. Set the variable . As the set is initially empty, this step is only relevant after the second iteration of the loop. Subsequently, iterate over all samples and add those with a missing rate less than to the . The loop ends only when there are missing values in set .
Step 1 involves selecting samples with no missing values and samples with a missing rate less than . All of these samples are safer to train and easier to impute than using all samples or the samples with a missing rate greater than .
Step 2: Train the selected samples to get the imputation model by Algorithm 2
The goal of this step is to train the imputation model using the samples selected in step 1 and the MSE + SIV loss.
is selected as the imputation model. Additionally, the model training epochand the parameters are required. In this equation, represents the proportion of values dropped during training, represents the proportional relationship between the imputed and observed values in the , and represents the proportional relationship between and the in the model .
In this step, we use the model to learn about the data’s characteristics. The SIV and MSE coexist. Therefore, we combine SIV and MSE to form the model’s loss to guide model training. The model’s loss is defined in Equation (19).
where hyperparameter represents the weight between the MSE and SIV.
This algorithm contains two nested loops. The first layer of loops represents the number of the model training iterations, and the second layer represents the traversal of the training set. The first step of each sample’s operation drops a portion of the ample’s values based on the proportion of . The dropped data is denoted by , whereas the remaining data is denoted by . As the results of previous stages are retained in subsequent stages, can be further subdivided into and , depending on whether the dropped data is imputed or observed data. The model can then be used to impute the missing values, which are denoted as . Corresponding to the positions of and , is divided into and . The purpose of splitting the output is to minimize error propagation. There is an error with each imputation iterations, so for multistage imputation, the weights of the original and imputed values should not be the same. is to control the weight between the two parts. Finally, update model the parameters by minimizing , where the loss is calculated using the formula given in Equation (19).
The innovation of this step is primarily the selection of the loss function. We combine MSE and SIV as the model loss, allowing them to all adhere to their respective strengths.
Step 3: Impute the missing values of the selected sample by Algorithm 3
The goal of this step is to use the trained model to impute the missing values in the training data.
In Algorithm 3, the model and the selected samples set are input. Use the model to impute the missing values for the samples in . Finally, the training data is obtained with no missing values. However, outside of the algorithm, the imputed data will replace the missing values in the corresponding missing positions of the original data for later imputation stages.
Step 4: Are all missing values imputed?
In this step, determine whether all missing values have been imputed and subsequently decide whether to continue the loop. If yes, output the result and end. If no, proceed to Step 1.
V-B Algorithm Summary and Example
MLSIF investigates flexible imputation strategies. Its entire process constitutes the four steps listed above, which are all interconnected. Step 1 selects data for Steps 2 and 3, and Step 2 can only train the model on that data. The corresponding model in Step 3 is trained using the training data, and only the missing values in the training data can be imputed, i.e., the first step serves as the foundation for all subsequent steps, and both the training and imputing steps are required.
Overall, there is a strong link between the various stages. Algorithm 2 states that the values imputed at previous stages will be provided as information for subsequent stages. This is because imputation becomes more difficult as the number of missing values increases. When the number of missing values decreases, the difficulty of imputation decreases; hence, we begin with low-difficulty tasks first while also providing more information for later high-difficulty imputation tasks; these are the advantages of the multistage and dynamic lengths in MLSIF.
An example is used to explain the entire imputation process in Fig. 3. We are given sensor data with a length of 240. First, segment the data with a length of 24 and obtain 10 pieces of data, as shown in Fig (a)a, implying 10 samples. Then, filter each piece of data and bring in the data that meets the conditions to train the model. Only the second, fourth, and seventh pieces of data in this example do not meet the requirements (missing rate is less than 10%); therefore, the remaining data (with * in Fig. (a)a) are fed into the model for training. Impute the selected samples after training, as shown in Fig (b)b. As the missing values are still present, we proceed to Stage 2.
In Stage 2, after receiving the imputed result from the previous stage, repeat the operation with a longer segmentation length. The segmentation result is shown in Fig (c)c. After the imputation of this stage is completed, it is found that the data no longer contains missing values. Hence, we obtain the final result as shown in Fig (d)d. We will only show the final result in the Experimental section.
The proposed framework is evaluated in this section by comparing some baselines to actual sensor data. We select NRTSI  as the basic imputation model for the proposed framework. The results are visualized. For the dataset, we use one University of California (UCI) air quality  and four geological sensor datasets collected by physical sensors and uploaded to the GitHub 222https://github.com/BomBooooo/MLSIF/tree/main.
Vi-a UCI Air Quality Dataset
This dataset includes 9358 hourly averaged responses from a set of five metal oxide chemical sensors embedded in an air quality chemical multisensor device. During the experimental design process, this study employed a strategy to mitigate the reality-experimental split when artificially simulating missing values. Rather than randomly removing real data, this strategy simulates the absence of real data with a high missing rate on data with a low missing rate. Thus, the imputation result is what is required, not just better in theory. This experiment specifically selects the data with a low missing rate ( ”C6H6(GT)”, ”PT08.S1(CO)”, ”PT08.S2(NMHC)”, ”PT08.S3(NOx)”, ”PT08.S4(NO2)”, and ”PT08.S5(O3)” ) on the dataset  and removes the data corresponding to the missing position of the data ( ”NOx(GT)” ) with high missing rate, comparing the difference of the dropped and imputed data.
First, we compare the effectiveness of the proposed multistage framework and the SIV indicator. We compared the difference between the imputation without a multistage framework and SIV (OFOS), imputation without a multistage framework but with SIV (OFWS), imputation with the multistage framework but without SIV (WFOS), and imputation with the multistage framework and SIV (WFWS).
The imputation result diagram on ”C6H6(GT)” and the metrics of all imputation results are shown in Fig. 4 and Table I, respectively. Other data result shown in diagrams are available on GitHub 333https://github.com/BomBooooo/MLSIF/tree/main/experiment%201. In Fig. 4, the first subpicture shows the original data, and the last four subpictures represent four imputation results of OFOS, OFWS, WFOS, and WFWS respectively.
As shown in Fig. 4, the first and most obvious phenomenon is the difference between the model with and without the multistage framework. When comparing OFOS and OFWS to WFOS and WFWS, when the imputation model does not use the multistage framework, there is a clear horizontal line in the imputation result near the mean. OFOS and OFWS are rarely imputed by mean in the section of the green wireframe in Fig. 4. One likely explanation is that when the missing length of the sample is similar to or equal to the length of the sample input, the model is unable to impute it due to insufficient effective information input into the model, resulting in the output being comparable to the input. This is why the fixed length model is insufficient to address the issue of long-missing segments. Consequently, the framework’s benefit is clear.
Comparing WFOS with WFWS, the model in WFOS is trained using the multistage framework and MSE loss, and the imputation results are clustered around the mean. Conversely, the imputation results of WFWS trained with the multistage framework and the mixed loss are almost consistent with the real data distribution. Even when comparing the first subimage (original data), it is difficult to distinguish the differences with unaided eyes. In terms of imputation outcomes, the WFWS appears superior to the other three models, as shown in Fig. 4.
Furthermore, we examine the quality of the imputation results based on the evaluation indicators. We use MSE, MAE, , , Global SIV, and Local SIV as the comparison indicators. The result is shown in Table I. Except for and indexes, which improve as their value increases, all other indexes improve as their value decreases. The best results have been highlighted in bold. Labels at the end of the values indicate better () or worse () than OFOS (baseline).
|Dataset||Case||MSE||MAE||Global SIV||Local SIV|
In the comparison, the optimal index value is bolded. Labels at the end of the values indicate better () or worse () than OFOS (Baseline).
The first four indicators (MSE, MAE, , and ) show similar trends and achieve optimal values on the same model except ”C6H6(GT)” and ”PT08.S3(NOx)”. It can be observed that most of these indicators are optimal in WFWS, whereas some of the first four indicators are optimal in OFOS. One notable situation is that once the multistage framework is introduced, WFOS does not have a significant improvement over OFOS. One possible explanation is that after the framework is introduced, the model imputes all positions of missing data positions, whereas before the framework was introduced, the model only imputed some values, i.e., after the introduction of the framework, the model makes more attempts, and more attempts mean larger losses. After all missing values are imputed, results similar to or even better than those in OFOS can be achieved, which may help explain the effectiveness of WFWS. We believe that the absence of temporal information in the new metrics is the reason why OFWS is not better than OFOS.
The two metrics proposed in this study are shown in the last two columns. Global SIV calculates the variable of the statistical index of the overall data before and after imputation, whereas Local SIV calculates and sums the variable of the statistical index of each piece of data during the imputation process. The introduction of the multistage framework considerably improves the imputation results on these two metrics. Among them, the most noticeable improvement is in Local SIV, and the value is reduced by dozens. More details about the relationship between the specific trend of these two indicators and other indicators as well as the relationship between these two indicators and the imputation results are further explored in Experiment 2.
From Experiment 1, it is shown that the use of the framework can solve the issue of insufficient imputation caused by fixed length and one-stage. Furthermore, using the mixed loss can improve the imputation effect of the model and alleviate the problem of MSE as the model loss.
In this experiment, we explore the impact of the weights between MSE and SIV in the mixed loss. Simultaneously, the relationship between SIV and MSE as a metric is determined by analyzing the imputation results. Here, we present and analyze the imputation results of ”PT08.S2(NMHC)”, as shown in Fig. 5 and Fig. 6. They show the model’s imputation details for small- and large-missing gaps. Other compared results on other datasets can be found on the GitHub 444https://github.com/BomBooooo/MLSIF/tree/main/experiment%202. In Fig. 5 and Fig. 6, the blue points represent the real data, red points represent the imputed data, and orange points represent the removed data. The distribution of the original and imputed data is represented by the blue and red lines on the right side of each image, respectively. The corresponding variation law of each index with is explained and shown in Fig. 7, where the dots circled in red represent the optimal value’s location.
In Fig. 5, the presence of a horizontal line close to the mean throughout the imputation results is the most noticeable. This phenomenon is alleviated when the is greater than 0.98. The same phenomenon can also be observed from the data distribution diagram. When is low, the imputed data cluster around the mean and gradually spreads out as increases. When , the distribution of imputed values is almost identical to that distribution of the original data.
In Fig. 6, there are large missing gaps in the data. The most obvious phenomenon from the distribution map is that the imputed data become looser as increases. When is low, the imputed data tend to cluster around the mean. As increases, the data becomes more dispersed, and, when equals 1, the data is completely dispersed, with the degree of dispersion being close to that of the original data. For the above two losses, MSE-type loss can capture certain temporal characteristics; however, it is prone to making imputations around the mean. SIV cannot capture temporal characteristics but allows the model to learn the data’s discrete situation; therefore, combining the two yields better results. Figures 5 and 6 show that most models perform well with small missing gaps. However, MLSIF impressively solves large-missing gap problems.
In Fig. (a)a - (d)d, the changing trend of these indicators is consistent. Generally, as increases, the metrics improve first and then deteriorate. When is small, the changes in these indicators are subtle, and most optimal values are found between 0.8 and 0.98. When exceeds 0.98, the indicators rapidly deteriorate. The reason for this is that when is large, MSE plays an insignificant role in the loss, allowing the SIV to play a dominant role, with the imputed data scattered within a certain range, resulting in these indicators having large values. One thing is certain: using the SIV to form the mixed loss improves the results on both metrics; however, using only the SIV makes the result worse than using only the MSE.
In Fig. (e)e and (f)f, the Global SIV and Local SIV metrics show a clear downward trend. This result was consistent with expectations. As the proportion of SIV loss increases, the value of the final imputed result using SIV as a metric also decreases. Consequently, unlike the others, these two indicators did not change significantly at . The optimal values are all obtained at around . At this time, the SIV plays a dominant role in the loss. It shows that between and , MSE and SIV can be traded off, resulting in both being relatively low.
By observing and analyzing the experimental results, we discovered that the Global SIV is more concerned with the difference between the imputed and original value. The Global SIV is not particularly large if there are few points that deviate from the discrete degree of the original data. The Local SIV calculates the changes in four statistical indicators for each small sample. Consequently, when the Global SIV is smaller, the data appear more compact. Furthermore, when the Local SIV is smaller, the data are more coherent. Therefore, the Local SIV can be used to assess the consistency between the imputed and true value.
In summary, Experiment 2 investigates the mixing loss further and investigated its impact on various indicators and imputation results. The experimental results show that using the mixed loss is better than using the MSE or the SIV loss alone in all metrics. Furthermore, it was found that most of the optimal values are obtained at around . These six measurement methods correspond to three types of measurement angles as metrics. The first type (MSE, MAE) focuses on the difference between each imputation and removed points. The second type () focuses on the distribution between the imputed and mean of the dropped data. The third type (SIV) focuses on the distribution difference between the original and imputed data.
Vi-B Physical Sensor Data
Due to the difficulty of acquiring and maintaining geological sensor data, there are frequently a significant number of missing values, making the subsequent analysis and research difficult. The sensitive information of the timestamp and numerical value of these data have been processed and uploaded to the GitHub 555https://github.com/BomBooooo/MLSIF/tree/main/.
The selected data in this experiment, named 35717443_temp, contains more than 20,000 data points and has approximately 3% missing values. In this section, we test the framework’s performance on these data with different missing rate scenarios. In contrast to other methods for randomly constructing missing values, the strategy used in this study is to construct large segments of missing values, i.e., a random position is selected and a portion of the original data is removed near this position. We compare the following methods when confronted with such a dataset:
ImputeTS (From R Package) : we choose two of these methods, namely, linear imputation and structural model and Kalman smooth imputation, named na.interpolation and na.kalman, respectively.
BRITS and BRITS-I : a method based on recurrent neural networks for missing value imputation in time-series data.
CSDI : a time-series imputation method that utilizes score-based diffusion models conditioned on observed data.
NAOMI : a nonautoregressive approach to impute long-range sequences given arbitrary missing patterns.
NRTSI : reformulate time series as permutation-equivariant sets do not impose any recurrent structures to impute missing data.
The imputation results and corresponding metrics (MSE and Global SIV) are shown in Fig. 8 and Table II, respectively. In Fig. 8, only the results of deep learning methods in data imputation are shown due to space constraints, whereas the metrics for all methods are displayed in Table II.
It can be observed from Fig. 8 that compared with other deep learning methods, MLSIF has a good imputation ability in the face of small- and large-missing segments. Especially when dealing with large-missing segments, MLSIF can still impute the missing parts according to the known data. Other deep learning methods fail when confronted with the same problem, imputing either the same or random values. The results of the metrics in Table II are consistent with the intuition. MLSIF can almost achieve the best results under both metrics when the missing rate is greater than 20%. Other methods are still competitive when the missing rate is less than 10%. It is also worth noting that, in most cases, the effect of deep learning is better than that of statistical methods in terms of indicators.
In this experiment, we apply the model to three real geological sensor datasets with missing values (45710421_x, 45710421_y, and 45710422_x). There are approximately 20,000 timestamps in each dataset used in this experiment, with a missing rate greater than 30%.
The experimental results are shown in Fig. 9. From the overall point of view of the data, the imputation results of the statistical methods are very good when small segments are missing, almost indistinguishable from the original data. However, traces of imputation can be seen in the missing positions of large segments. When there are no real values in a sample, the model training becomes out of control, which is a problem for these deep learning methods. This type of sample is removed during training. The results show that while the validation loss is decreasing (low to 0.01 level), the imputation results are not improving. As in the previous experiment, when faced with a large segment of missing values, the results of imputation using other deep learning methods are either the same value or a random value around a certain value.
The MLSIF can not only impute good results for small missing gaps but also successfully impute large-missing gaps. The MLSIF imputation results are almost consistent with the original data distribution in the positions where small segments are missing. The imputation results are nearly identical to the original data trend and adhere to the data distribution in the large missing gap position in the middle.
Table III shows the Global SIV based on three geosensor data. As these datasets contain many missing values, removing regret affects the data’s integrity, which also causes MSE, MAE, , and from being calculated. Local SIV is excluded from the table because only MLSIF can calculate this metric.
The application of MLSIF to practical problems is reflected in Experiment 4. It has been discovered that many models fail to produce good imputation results when faced with significant missing data gaps; however, MLSIF can handle this problem and produce good results.
Vi-C Computational Efficiency Analysis
Contrary to the one train-and-impute framework, MLSIF requires many cycles of training and imputing. Taking the six datasets in Experiment 3 as an example, they must go through 12, 59, 108, 151, 200 and 246 cycles of training and imputing to obtain the results. Compared to the single-stage model, its time will increase as the cycles increase. In practical, the following tricks will help save time:
For models that are not sensitive to sequence length, such as NRTIS , a certain number of training iterations can be reduced by inheriting the parameters of the previous stage in each stage; (The measure taken in this paper)
For models that are sensitive to sequence length, such as NAOMI , one can reduce a certain number of training iterations by inheriting the same input length model parameters.
Generally, statistical methods are much more computationally efficient than deep learning methods; however, they sacrifice a certain level of accuracy. In contrast, deep learning methods trade time for precision, whereas MLSIF speeds up more time for higher precision.
This study proposes SIV and MLSIF for sensor data missing value imputation. The introduction of SIV loss improves the imputation models, and the SIV metric measures the imputation effect effectiveness. MLSIF uses the multistage imputation method, which uses the imputation results of previous stages as knowledge to facilitate the learning of the model and improve the effect of imputation. During the imputation process, the framework dynamically adjusts the data length according to the unimputed data. This approach can be used to adaptively deal with different degrees of missing tasks. In the experimental design, we jumped out of the inherent assumptions, by simulating the actual situation of the real missing data, to avoid the phenomenon that the verification process deviates from reality. This paper only discusses one-dimensional sensor data. Extending the proposed method to multidimensional data imputation should be researched in future studies. Additionally, more measurement methods based on missing values must be explored.
-  (2011) Multiple imputation by chained equations: what is it and how does it work?. International journal of methods in psychiatric research 20 (1), pp. 40–49. Cited by: §I.
-  (2017) Univariate and multivariate skewness and kurtosis for measuring nonnormality: prevalence, influence and estimation. Behavior research methods 49 (5), pp. 1716–1735. Cited by: §IV-A.
-  (2018) Brits: bidirectional recurrent imputation for time series. arXiv preprint arXiv:1805.10572. Cited by: §I, §I, §III-A, 3rd item.
-  (2014) A vision of iot: applications, challenges, and opportunities with china perspective. IEEE Internet of Things journal 1 (4), pp. 349–359. Cited by: §I.
-  (2020) RDIS: random drop imputation with self-training for incomplete time series data. arXiv preprint arXiv:2010.10075. Cited by: §III-D.
Sequence-to-sequence imputation of missing sensor data.
Australasian Joint Conference on Artificial Intelligence, pp. 265–276. Cited by: §I, §I, §III-A.
-  (2008) On field calibration of an electronic nose for benzene estimation in an urban pollution monitoring scenario. Sensors and Actuators B: Chemical 129 (2), pp. 750–757. Cited by: §VI-A, §VI.
-  (2020) Time series data imputation: a survey on deep learning approaches. arXiv preprint arXiv:2011.11347. Cited by: §I.
-  (2007) A novel framework for imputation of missing values in databases. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans 37 (5), pp. 692–709. Cited by: §III-B, §V.
-  (1994) Supervised learning from incomplete data via an em approach. In Advances in neural information processing systems, pp. 120–127. Cited by: §I.
-  (2019) A data imputation method for multivariate time series based on generative adversarial network. Neurocomputing 360, pp. 185–197. Cited by: §I, §I, §III-A.
-  (2020) Time-series imputation and prediction with bi-directional generative adversarial networks. arXiv preprint arXiv:2009.08900. Cited by: §I, §I, §III-A.
-  (2021) DLIN: deep ladder imputation network. IEEE Transactions on Cybernetics. Cited by: §IV.
-  (2008) Nearest neighbor imputation of species-level, plot-scale forest structure attributes from lidar data. Remote Sensing of Environment 112 (5), pp. 2232–2245. Cited by: §I.
-  (2004) Methods for imputation of missing values in air quality data sets. Atmospheric Environment 38 (18), pp. 2895–2907. Cited by: §I, §I, §III-C, §III-D.
-  (2019) Misgan: learning from incomplete data with generative adversarial networks. arXiv preprint arXiv:1902.09599. Cited by: §I, §I, §III-A.
-  (2020) Missing value imputation: a review and analysis of the literature (2006–2017). Artificial Intelligence Review 53 (2), pp. 1487–1509. Cited by: §I.
-  (2016) Modeling missing data in clinical time series with rnns. Machine Learning for Healthcare 56. Cited by: §I, §I, §III-A.
-  (2020) Missing value imputation for industrial iot sensor data with large gaps. IEEE Internet of Things Journal 7 (8), pp. 6855–6867. Cited by: §I, §V.
-  (2019) Naomi: non-autoregressive multiresolution sequence imputation. arXiv preprint arXiv:1901.10946. Cited by: §I, §I, §III-A, §III-D, 5th item, 2nd item.
-  (2018) Multivariate time series imputation with generative adversarial networks. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 1603–1614. Cited by: §I, §I, §III-A.
-  (2019) E2gan: end-to-end generative adversarial network for multivariate time series imputation. In AAAI Press, pp. 3094–3100. Cited by: §I, §I, §III-A.
-  (2019) CDSA: cross-dimensional self-attention for multivariate, geo-tagged time series imputation. arXiv preprint arXiv:1905.09904. Cited by: §I, §I, §III-A.
-  (2019) End-to-end incomplete time-series modeling from linear memory of latent variables. IEEE transactions on cybernetics 50 (12), pp. 4908–4920. Cited by: §I, §I.
Generative semi-supervised learning for multivariate time series imputation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, pp. 8983–8991. Cited by: §I, §I, §III-A.
-  (2017) ImputeTS: time series missing value imputation in r.. R J. 9 (1), pp. 207. Cited by: 2nd item.
Scalable tensor factorizations with missing data.. Technical report Sandia National Laboratories. Cited by: §I.
-  (2021) Uncertainty-aware variational-recurrent imputation network for clinical time series. IEEE Transactions on Cybernetics. Cited by: §I, §I.
-  (2014) Fimus: a framework for imputing missing values using co-appearance, correlation and similarity analysis. Knowledge-Based Systems 56, pp. 311–327. Cited by: §III-B.
-  (1976) Inference and missing data. Biometrika 63 (3), pp. 581–592. Cited by: §IV.
-  (2021) NRTSI: non-recurrent time series imputation. arXiv preprint arXiv:2102.03340. Cited by: §I, §I, §I, §III-A, §III-D, §V-A, 6th item, 1st item, §VI.
-  (2018) End-to-end time series imputation via residual short paths. In Asian conference on machine learning, pp. 248–263. Cited by: §III-A.
-  (2015) SCREEN: stream data cleaning under speed constraints. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 827–841. Cited by: §I.
-  (2014) Research directions for the internet of things. IEEE internet of things journal 1 (1), pp. 3–9. Cited by: §I.
-  (2019) Recurrent imputation for multivariate time series with missing values. In 2019 IEEE International Conference on Healthcare Informatics (ICHI), pp. 1–3. Cited by: §I, §I, §III-A.
-  (2020) GLIMA: global and local time series imputation with multi-directional attention learning. In 2020 IEEE International Conference on Big Data (Big Data), pp. 798–807. Cited by: §I, §I, §III-A.
-  (2021) CSDI: conditional score-based diffusion models for probabilistic time series imputation. Advances in Neural Information Processing Systems 34, pp. 24804–24816. Cited by: 4th item.
-  (2011) Mice: multivariate imputation by chained equations in r. Journal of statistical software 45, pp. 1–67. Cited by: §I.
-  (2018) Gain: missing data imputation using generative adversarial nets. In International Conference on Machine Learning, pp. 5689–5698. Cited by: §I, §I, §III-A.
-  (2017) Multi-directional recurrent neural networks: a novel method for estimating missing data. In Time Series Workshop at the 34th International Conference on Machine, pp. 1–5. Cited by: §I, §I, §III-A.
-  (2005) Zoo: s3 infrastructure for regular and irregular time series. arXiv preprint math/0505527. Cited by: 1st item.
Time series data cleaning: from anomaly detection to anomaly repairing. Proceedings of the VLDB Endowment 10 (10), pp. 1046–1057. Cited by: §I.
-  (2016) Sequential data cleaning: a statistical approach. In Proceedings of the 2016 International Conference on Management of Data, pp. 909–924. Cited by: §I.
-  (2019) SSIM—a deep learning approach for recovering missing time series sensor data. IEEE Internet of Things Journal 6 (4), pp. 6618–6628. Cited by: §III-A.
Viii Biography Section
Jin-Sheng Yang received a bachelor’s degree from Sichuan University in 2016. Currently, he is a postgraduate student at Hainan University. His main research direction is time series data analysis.
Yuan-Hai Shao received his B.S. degree in information and computing science from College of Mathematics, Jilin University, a master’s degree in applied mathematics, and a Ph.D. degree in operations research and management in College of Science from China Agricultural University, China, in 2006, 2008 and 2011, respectively. Currently, he is a Full Professor at the School of Management, Hainan University, Haikou, China. His research interests include support vector machines, optimization methods, machine learning and data mining. He has published over 100 refereed papers on these areas, including IEEE TPAMI, IEEE TNNLS, IEEE TC, PR, and NN.
Chun-Na Li received her Master’s degree and Ph.D degree in Department of Mathematics from Harbin Institute of Technology, China, in 2009 and 2012, respectively. Currently, she is a professor at Management School, Hainan University. Her research interests include optimization methods, machine learning and data mining.
Wen-si Wang received his Master’s and Ph.D. degrees in Microelectronics from Tyndall National Institute, Republic of Ireland, in 2008 and 2012, respectively. He was a visiting scholar with the Georgia Institute of Technology, Atlanta in 2012. From 2013 to 2015, he was with Tyndall National Institute as Post-doc and Assistant Researcher. Since 2015, he has been with the Beijing University of Technology as an Associate Professor. He is also the co-founder of a medical R&D company SuperVision with its focus on A.I. in medical applications. He has published over 30 papers and filed over 20 patents in this area.