Soft Sensing Transformer: Hundreds of Sensors are Worth a Single Word

by   Chao Zhang, et al.
Seagate Technology LLC

With the rapid development of AI technology in recent years, there have been many studies with deep learning models in soft sensing area. However, the models have become more complex, yet, the data sets remain limited: researchers are fitting million-parameter models with hundreds of data samples, which is insufficient to exercise the effectiveness of their models and thus often fail to perform when implemented in industrial applications. To solve this long-lasting problem, we are providing large scale, high dimensional time series manufacturing sensor data from Seagate Technology to the public. We demonstrate the challenges and effectiveness of modeling industrial big data by a Soft Sensing Transformer model on these data sets. Transformer is used because, it has outperformed state-of-the-art techniques in Natural Language Processing, and since then has also performed well in the direct application to computer vision without introduction of image-specific inductive biases. We observe the similarity of a sentence structure to the sensor readings and process the multi-variable sensor readings in a time series in a similar manner of sentences in natural language. The high-dimensional time-series data is formatted into the same shape of embedded sentences and fed into the transformer model. The results show that transformer model outperforms the benchmark models in soft sensing field based on auto-encoder and long short-term memory (LSTM) models. To the best of our knowledge, we are the first team in academia or industry to benchmark the performance of original transformer model with large-scale numerical soft sensing data.



There are no comments yet.


page 10


Auto-encoder based Model for High-dimensional Imbalanced Industrial Data

With the proliferation of IoT devices, the distributed control systems a...

BreizhCrops: A Satellite Time Series Dataset for Crop Type Identification

This dataset challenges the time series community with the task of satel...

Gated Transformer Networks for Multivariate Time Series Classification

Deep learning model (primarily convolutional networks and LSTM) for time...

Classification Models for Partially Ordered Sequences

Many models such as Long Short Term Memory (LSTMs), Gated Recurrent Unit...

Soft-Sensing ConFormer: A Curriculum Learning-based Convolutional Transformer

Over the last few decades, modern industrial processes have investigated...

Learned Dynamics of Electrothermally-Actuated Soft Robot Limbs Using LSTM Neural Networks

Modeling the dynamics of soft robot limbs with electrothermal actuators ...

IEEE BigData 2021 Cup: Soft Sensing at Scale

IEEE BigData 2021 Cup: Soft Sensing at Scale is a data mining competitio...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the last decades, the development of smart sensors has attracted a lot of attention from government, academia and industry. The European Union’s 20-20-20 goals (20% increase in energy efficiency, 20% reduction of CO2 emissions, and 20% renewable by 2020) rely on smart metering as one of their key enablers. Smart meters usually involve real-time or near real-time sensors, notification and monitoring. In 2013, Germany proposed the concept of Industry 4.0, the main aim of which is to develop smart factories for producing smart products. The US government in September 2020 announced that the US is providing more than $1 billion towards establishing research and hubs for Industry 4.0 technologies. Singapore’s current five-year SU$13.8 billion R&D is injecting more funds into expanding fields such as advanced manufacturing. China’s China Manufacturing 2025 goal is also to make the manufacturing process more intelligent. These initiatives require that we have better sensing technologies to understand and drive our processes. Sensors have the potential to contain information about process variables which can be exploited by data-driven techniques for smarter monitoring and control of manufacturing processes. Soft sensing is the general term used for the approaches and the algorithms that are used to estimate or predict certain physical quantities or product quality in the industrial processes based on the available sensing modalities, measurements, and knowledge.

As the industrial process have become more complicated and the size of available data has increased dramatically, there has been growing body of research on deep learning methods with applications in the soft sensing field. A recent survey on deep learning methods for soft sensor [28]

has illustrated the significance of the deep learning applications and reviewed the most recent studies in this field. The deep learning models are mostly based on autoencoder


, restricted Boltzmann machine


, convolutional neural network


, and recurrent neural network

[23]. The applications varies from traditional factories [30] to wearable IoT devices [22, 21]

There has been a variety of novel deep learning models such as variational autoencoder models which attempts to enhance the representation ability or augment the data [12, 13], semi-supervised ensemble learning model that quantifies the contribution of different hidden layers in stacked autoencoder [27]

, and gated convolutional transformer neural network that combines several state-of-art algorithms together to deals with a time-series data set

[9]. As the deep learning models become more and more complex, their capabilities to handle complex processes and large data sets also increase. However, in these studies researchers are still using very small data sets such as wastewater treatment plant and Debutanizer column [25, 7] containing low dimensional data with only hundreds to thousands of data samples. These small data sets are not sufficient to illustrate the effectiveness of these advanced deep learning models with millions of parameters. To solve this issue, we collected gigabytes of numerical sensor data from Seagate’s wafer manufacturing factories in USA and Ireland. These data sets contain high-dimensional time-series sensor data that is collected directly from the Seagate wafer factories with only the necessary anonymization, and they are big, complex, noisy and impossible to interpret in their raw form by humans.. In this article, We evaluate a soft sensing transformer model against the most commonly methods applied to soft sensing problems including models based on autoencoder and LSTM [11]. The key components of the original transformer model is maintained and the other parts of the architecture are modified to fit into our data sets and tasks.

Transformer, since it’s proposal in 2017 [29], together with it’s derivatives such as BERT[4], have been the most active research topic in the natural language processing (NLP) field as well as the top performer in many NLP tasks[17]. Due to its extraordinary representative capability, transformer model has also shown equally good performance in the computer vision area [10]. First proposed in 2020, vision transformer [5] and its variants have achieved the state-of-art performances on many computer vision benchmarks such as image classification, semantic segmentation and object detection [19, 31].

From texts in NLP, which can be regarded as categorical data, to images (two dimensional integer values) in computer vision, a natural further extension would be soft sensing data which is time series with continuous floating numbers. While the Bayes error rate [8]

in NLP and computer vision tasks are usually defined as human-level performance, our soft sensing task is impossible for a human to classify based on the hundreds of sensor values. We show in this paper that Transformer architecture not only works great for natural language and images, but also for numerical data, and it is able to represent the data that is not interpretable by human.

The rest of this paper is organized as the following: We discuss the soft sensing transformer model in section. 2, several industrial soft sensing data sets in section. 3, the results of the soft sensing transformer on these data sets in section. 4, and discussions and conclusions in section. 5.

2 Methodology

While implementing the soft sensing model, we follow the original transformer architecture as closely as possible. The input module of the model is modified to fit the time-series sensor data, and the output module is modified for multi-task classification problems. This is the first study for benchmark results on these large scale sensor data sets with deep learning methods, also the first study for transformer model applied on large scale numerical sensor data.

Fig. 1: Architecture of soft sensing transformer model

2.1 Soft Sensing Transformer (SST)

We illustrate the structure of the soft sensing transformer model in Fig. 1. Given that the data format of time-series sensor data is different from texts, we used a dense layer for the embedding at the starting point, which reduces the dimension of the input high dimensional sensor data. After this layer, the data format is the same as embedded sentences so that it can be feed into the transformer encoder without any modifications. Right before the encoder block, a positional encoding using sine and cosine functions of different frequencies is added as Equation. 1 and Equation. 2 to cover the information of relative positions of different time steps. stands for positional encoding, is the position of a time step, and

is the dimension of embedded vectors.


Since the SST model requires a fixed size of input data, we added padding to samples with too few time steps so that each sample has the same time length. The time length is chosen as 99 percentiles of the sequence lengths in the raw data to cover most of the data and exclude outliers. The padding masks are also applied accordingly. In the encoder, multi-head scaled dot product attention, feed forward and residual connections are set up in the the same way as in the original transformer paper

[29]. The multi-head attention is described as in Equation. 3, the query, key and value are projected to heads with the weight matrices . Each head has a dimension of , and a scaled dot-product attention is calculated for each head. Then the heads are concatenated and projected back to the original shape.


The Seagate data sets are contain measurement pass/fail information, and the SST model is built as an classification model. After the encoder blocks, a multi-layer perceptron (MLP) classifier is attached on top after a global average pooling. Because of the intrinsic complexity of the data, the classifier comprises a few individual binary classifiers. These binary classifiers partly share the input data and may be correlated with each other, resulting an inter-correlated multi-task problem (further discussed in Section.

3). In order to achieve the best performance in the multi-task learning, a weighting method based on uncertainty [14]

is applied, and we define the combined loss function as Equation.



where is the total loss, is the loss of the classification task, and is the uncertainty of the classification loss, which is trainable during the model fitting.

2.2 Optimization

Data imbalance

In the industrial settings, the data are highly imbalanced. As a classification model, we have only 1% to 2% of the data samples as positive. To deal with the imbalance, we experimented on both weighting methods and data sampling algorithms like SMOTE [2]. We found that class weighting gives the best efficiency and performance in our experiments. The weight of the task, label ( or ) is calculated based on the number of samples:


in which is the total number of sample, is the number of tasks, and is the number of samples for label in the task.

Combined with the uncertainty based multi-task learning as Equation. 4, the final loss function of SST model is defined as weighted cross entropy:


where and

are the true labels and predicted probabilities for the

sample in task for label . Note that the cross entropy loss is calculated in a multi-label classification manner and the loss for positive and negative cases are computed separately. We take for positive samples, and for negative samples. The weights for the positive and negative cases in a single binary classification task is also further tuned by

, which is the uncertainty or variance of the loss for label

in task . In this multi-task learning setting, we have ’tasks’ for the binary classifications.

Activation functions

For the transformer encoder part, a ReLU activation function

[20] is applied in the feed forward layer, which is consist of two dense layers that project the dimensional vector to dimension and project back to dimension, respectively. ReLu activation function is set for the first dense layer in the feed forward layers. As for the MLP classifier, we applied sigmoid activation functions for all three layers because we found that it produced more stable results than ReLu in this case.


L2 regularizers are applied to all the dense layers in SST model, with a regularization factor of . Dropout [26] is also applied to residual layers and embedding layers. We also applied dropout to each layer in the MLP block except for the final prediction layer. All dropout ratios are kept the same and a grid search in is performed to find the best dropout ratio.


We experimented with two kinds of optimizers: default adam optimizer [15] with fixed learning rate, and scheduled adam optimizer similar as in [29]. The learning rate scheduled optimizer has shown a more stable result, so it’s kept in further experiments.

For the scheduled adam optimizer, the parameters are set as , , . The learning rate is varied during the training process based on Equation. 7. is in SST model, is the training step, and is set as . An extra is added to tune the overall learning rate. A grid search for the in is performed to find the optimal factor.


Hyper-parameter tuning

There are a few hyper-parameters to be tuned for the SST model training. As shown in Table. I, in total 7 hyper-parameters are tuned using a grid search. The hyper-parameters include number of the encoder block (), the size of embedding layer (), the size of feed forward layer (), the dropout ratio, learning rate factor as in Equation. 7, batch size, number of heads for the multi-head attention layer (), and whether or not to use the uncertainty based weighting as in Equation. 4

. For the process of grid search, a smaller size of data are randomly sampled from the data sets, which contains 5000 samples for training and 3000 for validation. The best model is picked based on the validation results, evaluated by the area under a Receiver Operating Characteristic Curve (ROC-AUC)


Hyper-parameter Values
2, 3, 4
32, 128, 512
64, 128, 256
0.1, 0.3, 0.5
0.1, 0.3, 0.5
512, 1024, 2048
1, 2, 4
on, off
TABLE I: Hyper-parameter search space


Instead of setting a epoch number, we used an Keras early-stopping callback method in the training processing, with a patience of 100 epochs, and restore_best_weights=True. In this way, we got an optimized epoch number for each experiment without manually tuning.


All the models are trained on an AWS instance with an NVIDIA Tesla V100 SXM2 GPU. It took around 20ms for each step, and about 30 minutes for the entire training. The grid search for hyper-parameters took about 36 hours. All the models are written with TensorFlow

[1] version 2.2 and Keras[3].

3 Data

Fig. 2: High-level workflow of wafer manufacturing. Each wafer goes through multiple processing stages, each stage has corresponding metrology, in which a few quality control measurements are performed. The measurement results are used to decide whether the wafer is in a good shape to go to the next stage. Figure from

To fill the gap of publicly available large scale soft sensing data sets, we queried and processed several gigabytes of data sets from Seagate manufacturing factories in both the US and Ireland. These data sets contain high dimensional time-series sensor data coming from different manufacturing machines.

As shown in Fig. 2, to fabricate a slider used for hard drives, an AlTiC wafer goes through multiple processing stages including deposition, coating, lithography, etching, and polishing. Different products have different manufacturing lines, Fig. 2 shows a simplified and general processing. After each processing stage, the wafer is sent to metrology tools for quality control measurements. A metrology step may have a single or multiple different measurements made each of which could have varying degrees of importance.

These processes are highly complex and are sensitive to both incoming as well as point of process effects. A significant amount of engineering and systems resources are employed to monitor and control the variability intrinsic to the factors that are known to affect a process.

Metrology serves a critical function of managing these complexities for early learning cycles and quality control. This, however, comes at high capital costs, increased cycle time and considerable overhead to set up correct recipes for measurements, appropriate process control and workflow mechanisms. In each processing tool, there are dozens to hundreds of onboard sensors in the processing machines to monitor the state of the tool. These sensors collect information every few seconds and all these sensing values are collected and stored along with the measurement results.

Fig. 3: Overview of the main categories of processes and the corresponding critical measurement variables per each category. Figure from

As shown in Fig. 3, one time-series of sensor data are mapped to several measurements, and the same measurement can be applied to multiple processing sensor data points. Each measurement contains a few numerical values to indicate the condition of the wafers, and a decision of pass or fail is made based on these numbers. For the sake of simplicity, we only cover the pass/fail information for each measurement, so that each sample of time-series sensor data are mapped to several binary classification labels, resulting in a multi-task classification problem. On the other hand, some of measurements are linked to multiple processing stages, so that the SST model can learn the representations from one stage and apply to another stage when it’s trained on data covering all the stages. Given this inter-correlation, training such a multi-task learning SST model leads to a better performance comparing with training the measurement tasks individually. From the perspective of industrial application, it’s also more maintainable and scalable to have a single model instead of many ones.

The data sets in this paper cover 92 weeks of data. The first 70 weeks are taken as training data, and the following 14 weeks as validation data, last 8 weeks as testing data. The data sets are prepared by querying from raw data and doing some necessary pre-processing steps. While the sensors are collecting data every few seconds, there are a lot of redundancies, so we aggregated the data into short sequences. In each processing stage, a wafer goes through a few modules, and we aggregate the data by the module and get short time sequences. Other pre-processing steps on the data include a min-max scaling, imputation with neighbors, one-hot encoding for categorical variables, and necessary anonymization. The min-max scaler is fit only on training data, and applied on the entire data sets. Imputation is done by filling the missing values first by it’s neighbors (a forward filling followed by a backward filling) if non-missing values exist in the same processing stage, otherwise filling by the mode of all the data. Categorical variables such as the processing stage information, the type of the wafer, the manufacturing machine in function, are one-hot encoded and concatenated to the sensor data as model input. As for the anonymization, only confidential information like the data headers is removed.

Using data in different timezframes for training and testing reflects the application prospect of the SST model, because in this way the model can be directly deployed into factories once it performs well enough in testing data. However, this setting also makes it harder for the model to achieve a high performance because in reality there are too many uncontrollable factors in the factories and the data distribution of training and testing data may be different with each other.

These data sets are in Numpy format, which only include numerical values without any headers. Input files are rank 3 tensors with dimension (n_sample, time_step, features), and outputs are rank 2 tensors with dimension (n_sample, 2*n_tasks). Each binary classification task has two columns in the output file, first column for negative cases and second for positive cases.

Three data sets are covered in this paper. They are from slightly different manufacturing tool families, and each has different processing stages and corresponding measurements. The number of samples for each measurement is summarized in Table. II. More detailed information for each tool family is described below, and all the data are available at

P1 P2 P3
Task pos neg pos neg pos neg
1 295 8328 256 6433 109 2496
2 40 12747 773 26811 335 12857
3 291 56198 2069 78844 46 1026
4 188 14697 582 27809 15 4180
5 568 40644 247 9652 300 22254
6 863 84963 884 27337 166 40811
7 2501 153970 2108 53921 875 75706
8 490 2919 2016 77473 1097 18890
9 104 29551 644 23305 537 4247
10 57 10813 270 25651 1547 129914
11 306 47219 3792 354328
TABLE II: Summary for the data sets: number of positive and negative samples for each task


The sensor data are generated by a deposition tool that include both deposition and etching steps. There are 90 sensors installed in the tools and they capture data at a frequency of about every second. The critical parameters measured for this family of tools are magnetics, thickness, composition, resistivity, density, and roughness.

After pre-processing mentioned above, there are 194k data samples in training, 34k samples in validation, and 27k samples in testing data. Each sample has 2 time steps, with 817 features. Some of the second time steps are missing and replaced with zero padding, and the 817 features come from 90 sensors, one-hot encoded categorical variables including the types of the wafer, the processing stages, and specific manufacturing tools etc, and a padding indicator as the last feature.

For the labels, there are 11 individual measurement tasks, each is a binary classification. We set the model output dimension as 22 to have separate predictions for negative and positive probabilities, and normalize them to get the predicted probabilities after applying class weights for the data imbalance. As shown in Table. II, the data set are highly imbalanced, there are about 1.2% of the samples have positive labels.


This second data set contains data generated by a family of ion milling (dry etch) equipment, which utilize ions in plasma to remove material from a surface of the wafer. There are 57 sensors for this data set, and the critical parameters measured for this family of tools are similar to P1 tools, but with slightly different measurement machines.

There are 457k training samples, 80k validation samples, and 66k testing samples in the data set. For this data set, there is no time-series information, but we treat it as 1 time step to fit into the same SST model. This data set is more complex in terms of categorical variables, resulting in 1484 features in total.

The number of measurement tasks is 11, with an output dimension of 22, and about 1.9% of the samples are positive as in Table. II. Note that these 11 tasks are not the same as those in P1.


The last data set is generated by sputter deposition equipment containing multiple deposition chambers, with unique targets. The number of sensors is 43, and critical parameters measured are the same but with different machines.

There are 205k training data samples, 35k for validation, and 20k for testing. The maximum time-series length is 2, with outliers filtered out and short series padded. The number of features is 498, the least among these three data sets.

The number of measurement tasks is 10, and output dimension is 20. Note that these tasks are not the same as those in P1 and P2 data. The percentage of positive cases is about 1.6%.

4 Results

The SST models have been run on the three data sets mentioned in the last section. The hyper-parameters are tuned within the range shown in Table. I, and the best combinations are chosen to present below for each data set.

To validate the effectiveness of SST, the results are compared with two baseline models. The first one is variance weighted multi-headed quality driven autoencoder (VWMHQAE) [32] which was developed by our team in 2020. The model is based on stacked autoencoder architecture, and utilized the output (quality-control variables) information by reconstructing both the input and output after encoding. It added the multi-headed structure to do the multi-task learning, and applied a variance-based weight to the tasks that are same as SST model as in Equation. 4. It has been proven to work well with non-time-series data in our previous experiments with similar sensor data, therefore serves as a good baseline model for SST. Since it doesn’t have an architecture to cover the time dimension, the data is flattened before feeding into the model. Also, we trained a second baseline model: a bidirectional LSTM model (Bi-LSTM), which is one of the golden standard models for time series data, to have a comprehensive benchmark on the performance of SST.

Due to the highly imbalanced nature of the data sets, accuracy would not make much sense to evaluate the models. The most important metrics that the industry cares are True Positive Rate (TPR, also called recall or sensitivity) and False Positive Rate (FPR, also called fall-out or false alarm ratio). However, comparing two metrics together is not intuitive, so we chose to use the Receiver Operating Characteristic (ROC) curve and the Area Under Curve (AUC) as the main metric in this paper. More detailed results are covered in Appendix.


For the P1 data set, SST model is set as 3 layers, both and are 128, dropout rate is 0.5, batch size is 2048, is 1, learning rate factor is 0.5, and the uncertainty based weighting is off. the VWMHQAE model is set as three layers with hidden dimension [512, 256, 128], and Bi-LSTM model with dimension equal to . All models are followed by a three-layer MLP classifier with all hidden dimensions as .

TABLE III: Result comparison with baseline models: P1
Fig. 4: ROC curve for SST and baseline models on P1 data set, 4th task

The results for the 11 tasks are summarized in Table. III. in 7 of the tasks SST are the best performer, especially for the high performing tasks where AUC larger than .

From the results we can also see that some of the tasks have poor results for all three models. They are difficult to get a high AUC value with any model due to the intrinsic complex and noisy nature. Only those measurement tasks with decent results can lead to realistic value in industry applications. This is one of the primary motivations behind our decision to open-access these data sets: researchers all around the world are welcomed to use and explore this data. This will not only help us to gain more understanding about the data sets, but also enrich the research field.

To further illustrate the results, the ROC curve is plotted for the task with highest AUC as in Fig. 4. SST has a higher score than the two baseline models, and the curve is smoother, meaning a more even distribution of the prediction probabilities and a finer grid in the prediction space. The source code can be found at


SST model is set as 3 layers, both and are 128, dropout rate is 0.3, batch size is 2048, is 1, learning rate factor is 0.5, and the uncertainty based weighting is on. Baseline models are the same as P1.

TABLE IV: Result comparison with baseline models: P2
Fig. 5: ROC curve for SST and baseline models on P2 data set, 1st task

The results for the 11 tasks are summarized in Table. IV. SST is the best performer in 4 of the tasks, including the two tasks with the best prediction. Same as in P1, some of the tasks have poor results for all three models due to the intrinsic complexity and noise in the data set, and we mostly care about the tasks with best results. In this data set, there is only one time step, and as expected the VWMHQAE model, which is not designed for time series data, is showing better results comparing to P1 data, and it has the best performance in 5 out of the 11 tasks.

The ROC curve for the task with highest AUC as in Fig. 5 is very similar to the previous one. SST is slightly smoother than the baseline models, with a higher AUC.


SST model is set as 3 layers, both and are 128, dropout rate is 0.3, batch size is 2048, is 1, learning rate factor is 0.3, and the uncertainty based weighting is on. Baseline models are the same as P1.

TABLE V: Result comparison with baseline models: P3
Fig. 6: ROC curve for SST and baseline models on P3 data set, 2nd task

The results for 7 out of 10 tasks are summarized in Table. V, because the others has too few testing data samples. SST is the best performer in 4 of the tasks, including the first task with the best prediction. Some of the tasks have poor results for all three models and even with an AUC lower than 0.5, meaning it’s worse than a random guess. Its main cause is the distribution shift and further experiments will be carried out when we accumulated more data in Seagate factories. The ROC curve for the task with highest AUC as in Fig. 6 is very similar to the previous ones.

5 Discussion and Conclusion

We have explored the direct application of Transformers to soft sensing. To our knowledge, we are the first to provide large scale soft sensing data sets, and the first to benchmark the results with the original transformer model in the soft sensing field. Also, this is the first time that transformer model goes beyond human in the sense that the input data is not human-interpretable. We analogize the time-series data as a sequence of sensor values, each time step is taken as a word, and process the sentence-like data by a standard transformer encoder exactly as in NLP. This direct and intuitive strategy has shown an exciting result for our data sets that outperforms our previous model and Bi-LSTM model.

We share these data sets with the excitement of advancing interest and work in research and applications of soft sensing. We invite future work into the exploration of improving SST performance on some tasks that have been particularly challenging in our experiments to learn. Another future direction can be the examination of appropriate time sequences for these data sets, and exploration of better ways to address missing data. We are working on acquiring more data with longer sequences, to better understand the impact of time series length in the prediction of quality. In the meantime, we have provided three data sets to cover a variety of sensors, and to examine the generalizability of deep learning models, and we believe these data sets can enrich the soft sensing research field and serve as one of the standard tools to evaluate the effectiveness of future research.


The authors would like to thank Seagate Technology for the support on this study, the Seagate Lyve Cloud team for providing the data infrastructure, and the Seagate Open Source Program Office for open sourcing the data sets and the code. Special thanks to the Seagate Data Analytics and Reporting Systems team for inspiring the discussions.


  • [1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. (2016)

    Tensorflow: large-scale machine learning on heterogeneous distributed systems

    arXiv preprint arXiv:1603.04467. Cited by: §2.2.
  • [2] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer (2002) SMOTE: synthetic minority over-sampling technique.

    Journal of artificial intelligence research

    16, pp. 321–357.
    Cited by: §2.2.
  • [3] F. Chollet et al. (2015) Keras. Note: Cited by: §2.2.
  • [4] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1.
  • [5] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. Cited by: §1.
  • [6] T. Fawcett (2006) An introduction to roc analysis. Pattern recognition letters 27 (8), pp. 861–874. Cited by: §2.2.
  • [7] L. Fortuna, S. Graziani, A. Rizzo, M. G. Xibilia, et al. (2007) Soft sensors for monitoring and control of industrial processes. Vol. 22, Springer. Cited by: §1.
  • [8] K. Fukunada (1990) Introduction to statistical pattern recognition. Academic Press Inc., San Diego, CA, USA. Cited by: §1.
  • [9] Z. Geng, Z. Chen, Q. Meng, and Y. Han (2021)

    Novel transformer based on gated convolutional neural network for dynamic soft sensor modeling of industrial processes

    IEEE Transactions on Industrial Informatics. Cited by: §1.
  • [10] K. Han, Y. Wang, H. Chen, X. Chen, J. Guo, Z. Liu, Y. Tang, A. Xiao, C. Xu, Y. Xu, et al. (2020) A survey on visual transformer. arXiv preprint arXiv:2012.12556. Cited by: §1.
  • [11] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §1.
  • [12] Y. Huang, Y. Tang, J. VanZwieten, and J. Liu (2020) Reliable machine prognostic health management in the presence of missing data. Concurrency and Computation: Practice and Experience, pp. e5762. Cited by: §1.
  • [13] Y. Huang, Y. Tang, and J. Vanzwieten (2021) Prognostics with variational autoencoder by generative adversarial learning. IEEE Transactions on Industrial Electronics. Cited by: §1.
  • [14] A. Kendall, Y. Gal, and R. Cipolla (2018) Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7482–7491. Cited by: §2.1.
  • [15] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §2.2.
  • [16] Y. LeCun, Y. Bengio, and G. Hinton (2015) Deep learning. nature 521 (7553), pp. 436–444. Cited by: §1.
  • [17] T. Lin, Y. Wang, X. Liu, and X. Qiu (2021) A survey of transformers. arXiv preprint arXiv:2106.04554. Cited by: §1.
  • [18] C. Liou, W. Cheng, J. Liou, and D. Liou (2014) Autoencoder for words. Neurocomputing 139, pp. 84–96. Cited by: §1.
  • [19] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo (2021) Swin transformer: hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030. Cited by: §1.
  • [20] V. Nair and G. E. Hinton (2010) Rectified linear units improve restricted boltzmann machines. In Icml, Cited by: §2.2.
  • [21] X. Qian, H. Chen, H. Jiang, J. Green, H. Cheng, and M. Huang (2020) Wearable computing with distributed deep learning hierarchy: a study of fall detection. IEEE Sensors Journal 20 (16), pp. 9408–9416. Cited by: §1.
  • [22] X. Qian, H. Cheng, D. Chen, Q. Liu, H. Chen, H. Jiang, and M. Huang (2019) The smart insole: a pilot study of fall detection. In EAI International Conference on Body Area Networks, pp. 37–49. Cited by: §1.
  • [23] D. E. Rumelhart, G. E. Hinton, and R. J. Williams (1986) Learning representations by back-propagating errors. nature 323 (6088), pp. 533–536. Cited by: §1.
  • [24] P. Smolensky (1986) Information processing in dynamical systems: foundations of harmony theory. Technical report Colorado Univ at Boulder Dept of Computer Science. Cited by: §1.
  • [25] F. A.A. Souza, R. Araújo, T. Matias, and J. Mendes (2013)

    A multilayer-perceptron based method for variable selection in soft sensor design

    Journal of Process Control 23 (10), pp. 1371 – 1378. Note: External Links: ISSN 0959-1524, Document, Link Cited by: §1.
  • [26] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §2.2.
  • [27] Q. Sun and Z. Ge (2020) Deep learning for industrial kpi prediction: when ensemble learning meets semi-supervised data. IEEE Transactions on Industrial Informatics 17 (1), pp. 260–269. Cited by: §1.
  • [28] Q. Sun and Z. Ge (2021) A survey on deep learning for data-driven soft sensors. IEEE Transactions on Industrial Informatics. Cited by: §1.
  • [29] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1, §2.1, §2.2.
  • [30] X. Yuan, J. Zhou, B. Huang, Y. Wang, C. Yang, and W. Gui (2020) Hierarchical quality-relevant feature representation for soft sensor modeling: a novel deep learning strategy. IEEE Transactions on Industrial Informatics 16 (6), pp. 3721–3730. External Links: Document Cited by: §1.
  • [31] X. Zhai, A. Kolesnikov, N. Houlsby, and L. Beyer (2021) Scaling vision transformers. arXiv preprint arXiv:2106.04560. Cited by: §1.
  • [32] C. Zhang and S. Bom (2021) Auto-encoder based model for high-dimensional imbalanced industrial data. External Links: 2108.02083 Cited by: §4.


Fig. 7: ROC curves for SST and baseline models on P1 data set, all tasks
Fig. 8: ROC curves for SST and baseline models on P2 data set, all tasks
Fig. 9: ROC curves for SST and baseline models on P3 data set, all tasks
Fig. 10: TPR for SST and baseline models on P1 data set
Fig. 11: FPR for SST and baseline models on P1 data set
Fig. 12: TPR for SST and baseline models on P2 data set
Fig. 13: FPR for SST and baseline models on P2 data set
Fig. 14: TPR for SST and baseline models on P3 data set
Fig. 15: FPR for SST and baseline models on P3 data set