In the last decades, the development of smart sensors has attracted a lot of attention from government, academia and industry. The European Union’s 20-20-20 goals (20% increase in energy efficiency, 20% reduction of CO2 emissions, and 20% renewable by 2020) rely on smart metering as one of their key enablers. Smart meters usually involve real-time or near real-time sensors, notification and monitoring. In 2013, Germany proposed the concept of Industry 4.0, the main aim of which is to develop smart factories for producing smart products. The US government in September 2020 announced that the US is providing more than $1 billion towards establishing research and hubs for Industry 4.0 technologies. Singapore’s current five-year SU$13.8 billion R&D is injecting more funds into expanding fields such as advanced manufacturing. China’s China Manufacturing 2025 goal is also to make the manufacturing process more intelligent. These initiatives require that we have better sensing technologies to understand and drive our processes. Sensors have the potential to contain information about process variables which can be exploited by data-driven techniques for smarter monitoring and control of manufacturing processes. Soft sensing is the general term used for the approaches and the algorithms that are used to estimate or predict certain physical quantities or product quality in the industrial processes based on the available sensing modalities, measurements, and knowledge.
As the industrial process have become more complicated and the size of available data has increased dramatically, there has been growing body of research on deep learning methods with applications in the soft sensing field. A recent survey on deep learning methods for soft sensor 
has illustrated the significance of the deep learning applications and reviewed the most recent studies in this field. The deep learning models are mostly based on autoencoder24]16]
, and recurrent neural network. The applications varies from traditional factories  to wearable IoT devices [22, 21]
There has been a variety of novel deep learning models such as variational autoencoder models which attempts to enhance the representation ability or augment the data [12, 13], semi-supervised ensemble learning model that quantifies the contribution of different hidden layers in stacked autoencoder 
, and gated convolutional transformer neural network that combines several state-of-art algorithms together to deals with a time-series data set. As the deep learning models become more and more complex, their capabilities to handle complex processes and large data sets also increase. However, in these studies researchers are still using very small data sets such as wastewater treatment plant and Debutanizer column [25, 7] containing low dimensional data with only hundreds to thousands of data samples. These small data sets are not sufficient to illustrate the effectiveness of these advanced deep learning models with millions of parameters. To solve this issue, we collected gigabytes of numerical sensor data from Seagate’s wafer manufacturing factories in USA and Ireland. These data sets contain high-dimensional time-series sensor data that is collected directly from the Seagate wafer factories with only the necessary anonymization, and they are big, complex, noisy and impossible to interpret in their raw form by humans.. In this article, We evaluate a soft sensing transformer model against the most commonly methods applied to soft sensing problems including models based on autoencoder and LSTM . The key components of the original transformer model is maintained and the other parts of the architecture are modified to fit into our data sets and tasks.
Transformer, since it’s proposal in 2017 , together with it’s derivatives such as BERT, have been the most active research topic in the natural language processing (NLP) field as well as the top performer in many NLP tasks. Due to its extraordinary representative capability, transformer model has also shown equally good performance in the computer vision area . First proposed in 2020, vision transformer  and its variants have achieved the state-of-art performances on many computer vision benchmarks such as image classification, semantic segmentation and object detection [19, 31].
From texts in NLP, which can be regarded as categorical data, to images (two dimensional integer values) in computer vision, a natural further extension would be soft sensing data which is time series with continuous floating numbers. While the Bayes error rate 
in NLP and computer vision tasks are usually defined as human-level performance, our soft sensing task is impossible for a human to classify based on the hundreds of sensor values. We show in this paper that Transformer architecture not only works great for natural language and images, but also for numerical data, and it is able to represent the data that is not interpretable by human.
While implementing the soft sensing model, we follow the original transformer architecture as closely as possible. The input module of the model is modified to fit the time-series sensor data, and the output module is modified for multi-task classification problems. This is the first study for benchmark results on these large scale sensor data sets with deep learning methods, also the first study for transformer model applied on large scale numerical sensor data.
2.1 Soft Sensing Transformer (SST)
We illustrate the structure of the soft sensing transformer model in Fig. 1. Given that the data format of time-series sensor data is different from texts, we used a dense layer for the embedding at the starting point, which reduces the dimension of the input high dimensional sensor data. After this layer, the data format is the same as embedded sentences so that it can be feed into the transformer encoder without any modifications. Right before the encoder block, a positional encoding using sine and cosine functions of different frequencies is added as Equation. 1 and Equation. 2 to cover the information of relative positions of different time steps. stands for positional encoding, is the position of a time step, and
is the dimension of embedded vectors.
Since the SST model requires a fixed size of input data, we added padding to samples with too few time steps so that each sample has the same time length. The time length is chosen as 99 percentiles of the sequence lengths in the raw data to cover most of the data and exclude outliers. The padding masks are also applied accordingly. In the encoder, multi-head scaled dot product attention, feed forward and residual connections are set up in the the same way as in the original transformer paper. The multi-head attention is described as in Equation. 3, the query, key and value are projected to heads with the weight matrices . Each head has a dimension of , and a scaled dot-product attention is calculated for each head. Then the heads are concatenated and projected back to the original shape.
The Seagate data sets are contain measurement pass/fail information, and the SST model is built as an classification model. After the encoder blocks, a multi-layer perceptron (MLP) classifier is attached on top after a global average pooling. Because of the intrinsic complexity of the data, the classifier comprises a few individual binary classifiers. These binary classifiers partly share the input data and may be correlated with each other, resulting an inter-correlated multi-task problem (further discussed in Section.3). In order to achieve the best performance in the multi-task learning, a weighting method based on uncertainty 
is applied, and we define the combined loss function as Equation.4:
where is the total loss, is the loss of the classification task, and is the uncertainty of the classification loss, which is trainable during the model fitting.
In the industrial settings, the data are highly imbalanced. As a classification model, we have only 1% to 2% of the data samples as positive. To deal with the imbalance, we experimented on both weighting methods and data sampling algorithms like SMOTE . We found that class weighting gives the best efficiency and performance in our experiments. The weight of the task, label ( or ) is calculated based on the number of samples:
in which is the total number of sample, is the number of tasks, and is the number of samples for label in the task.
Combined with the uncertainty based multi-task learning as Equation. 4, the final loss function of SST model is defined as weighted cross entropy:
are the true labels and predicted probabilities for thesample in task for label . Note that the cross entropy loss is calculated in a multi-label classification manner and the loss for positive and negative cases are computed separately. We take for positive samples, and for negative samples. The weights for the positive and negative cases in a single binary classification task is also further tuned by
, which is the uncertainty or variance of the loss for labelin task . In this multi-task learning setting, we have ’tasks’ for the binary classifications.
L2 regularizers are applied to all the dense layers in SST model, with a regularization factor of . Dropout  is also applied to residual layers and embedding layers. We also applied dropout to each layer in the MLP block except for the final prediction layer. All dropout ratios are kept the same and a grid search in is performed to find the best dropout ratio.
We experimented with two kinds of optimizers: default adam optimizer  with fixed learning rate, and scheduled adam optimizer similar as in . The learning rate scheduled optimizer has shown a more stable result, so it’s kept in further experiments.
For the scheduled adam optimizer, the parameters are set as , , . The learning rate is varied during the training process based on Equation. 7. is in SST model, is the training step, and is set as . An extra is added to tune the overall learning rate. A grid search for the in is performed to find the optimal factor.
There are a few hyper-parameters to be tuned for the SST model training. As shown in Table. I, in total 7 hyper-parameters are tuned using a grid search. The hyper-parameters include number of the encoder block (), the size of embedding layer (), the size of feed forward layer (), the dropout ratio, learning rate factor as in Equation. 7, batch size, number of heads for the multi-head attention layer (), and whether or not to use the uncertainty based weighting as in Equation. 4
. For the process of grid search, a smaller size of data are randomly sampled from the data sets, which contains 5000 samples for training and 3000 for validation. The best model is picked based on the validation results, evaluated by the area under a Receiver Operating Characteristic Curve (ROC-AUC).
|2, 3, 4|
|32, 128, 512|
|64, 128, 256|
|0.1, 0.3, 0.5|
|0.1, 0.3, 0.5|
|512, 1024, 2048|
|1, 2, 4|
To fill the gap of publicly available large scale soft sensing data sets, we queried and processed several gigabytes of data sets from Seagate manufacturing factories in both the US and Ireland. These data sets contain high dimensional time-series sensor data coming from different manufacturing machines.
As shown in Fig. 2, to fabricate a slider used for hard drives, an AlTiC wafer goes through multiple processing stages including deposition, coating, lithography, etching, and polishing. Different products have different manufacturing lines, Fig. 2 shows a simplified and general processing. After each processing stage, the wafer is sent to metrology tools for quality control measurements. A metrology step may have a single or multiple different measurements made each of which could have varying degrees of importance.
These processes are highly complex and are sensitive to both incoming as well as point of process effects. A significant amount of engineering and systems resources are employed to monitor and control the variability intrinsic to the factors that are known to affect a process.
Metrology serves a critical function of managing these complexities for early learning cycles and quality control. This, however, comes at high capital costs, increased cycle time and considerable overhead to set up correct recipes for measurements, appropriate process control and workflow mechanisms. In each processing tool, there are dozens to hundreds of onboard sensors in the processing machines to monitor the state of the tool. These sensors collect information every few seconds and all these sensing values are collected and stored along with the measurement results.
As shown in Fig. 3, one time-series of sensor data are mapped to several measurements, and the same measurement can be applied to multiple processing sensor data points. Each measurement contains a few numerical values to indicate the condition of the wafers, and a decision of pass or fail is made based on these numbers. For the sake of simplicity, we only cover the pass/fail information for each measurement, so that each sample of time-series sensor data are mapped to several binary classification labels, resulting in a multi-task classification problem. On the other hand, some of measurements are linked to multiple processing stages, so that the SST model can learn the representations from one stage and apply to another stage when it’s trained on data covering all the stages. Given this inter-correlation, training such a multi-task learning SST model leads to a better performance comparing with training the measurement tasks individually. From the perspective of industrial application, it’s also more maintainable and scalable to have a single model instead of many ones.
The data sets in this paper cover 92 weeks of data. The first 70 weeks are taken as training data, and the following 14 weeks as validation data, last 8 weeks as testing data. The data sets are prepared by querying from raw data and doing some necessary pre-processing steps. While the sensors are collecting data every few seconds, there are a lot of redundancies, so we aggregated the data into short sequences. In each processing stage, a wafer goes through a few modules, and we aggregate the data by the module and get short time sequences. Other pre-processing steps on the data include a min-max scaling, imputation with neighbors, one-hot encoding for categorical variables, and necessary anonymization. The min-max scaler is fit only on training data, and applied on the entire data sets. Imputation is done by filling the missing values first by it’s neighbors (a forward filling followed by a backward filling) if non-missing values exist in the same processing stage, otherwise filling by the mode of all the data. Categorical variables such as the processing stage information, the type of the wafer, the manufacturing machine in function, are one-hot encoded and concatenated to the sensor data as model input. As for the anonymization, only confidential information like the data headers is removed.
Using data in different timezframes for training and testing reflects the application prospect of the SST model, because in this way the model can be directly deployed into factories once it performs well enough in testing data. However, this setting also makes it harder for the model to achieve a high performance because in reality there are too many uncontrollable factors in the factories and the data distribution of training and testing data may be different with each other.
These data sets are in Numpy format, which only include numerical values without any headers. Input files are rank 3 tensors with dimension (n_sample, time_step, features), and outputs are rank 2 tensors with dimension (n_sample, 2*n_tasks). Each binary classification task has two columns in the output file, first column for negative cases and second for positive cases.
Three data sets are covered in this paper. They are from slightly different manufacturing tool families, and each has different processing stages and corresponding measurements. The number of samples for each measurement is summarized in Table. II. More detailed information for each tool family is described below, and all the data are available at https://github.com/Seagate/softsensing_data.
The sensor data are generated by a deposition tool that include both deposition and etching steps. There are 90 sensors installed in the tools and they capture data at a frequency of about every second. The critical parameters measured for this family of tools are magnetics, thickness, composition, resistivity, density, and roughness.
After pre-processing mentioned above, there are 194k data samples in training, 34k samples in validation, and 27k samples in testing data. Each sample has 2 time steps, with 817 features. Some of the second time steps are missing and replaced with zero padding, and the 817 features come from 90 sensors, one-hot encoded categorical variables including the types of the wafer, the processing stages, and specific manufacturing tools etc, and a padding indicator as the last feature.
For the labels, there are 11 individual measurement tasks, each is a binary classification. We set the model output dimension as 22 to have separate predictions for negative and positive probabilities, and normalize them to get the predicted probabilities after applying class weights for the data imbalance. As shown in Table. II, the data set are highly imbalanced, there are about 1.2% of the samples have positive labels.
This second data set contains data generated by a family of ion milling (dry etch) equipment, which utilize ions in plasma to remove material from a surface of the wafer. There are 57 sensors for this data set, and the critical parameters measured for this family of tools are similar to P1 tools, but with slightly different measurement machines.
There are 457k training samples, 80k validation samples, and 66k testing samples in the data set. For this data set, there is no time-series information, but we treat it as 1 time step to fit into the same SST model. This data set is more complex in terms of categorical variables, resulting in 1484 features in total.
The number of measurement tasks is 11, with an output dimension of 22, and about 1.9% of the samples are positive as in Table. II. Note that these 11 tasks are not the same as those in P1.
The last data set is generated by sputter deposition equipment containing multiple deposition chambers, with unique targets. The number of sensors is 43, and critical parameters measured are the same but with different machines.
There are 205k training data samples, 35k for validation, and 20k for testing. The maximum time-series length is 2, with outliers filtered out and short series padded. The number of features is 498, the least among these three data sets.
The number of measurement tasks is 10, and output dimension is 20. Note that these tasks are not the same as those in P1 and P2 data. The percentage of positive cases is about 1.6%.
The SST models have been run on the three data sets mentioned in the last section. The hyper-parameters are tuned within the range shown in Table. I, and the best combinations are chosen to present below for each data set.
To validate the effectiveness of SST, the results are compared with two baseline models. The first one is variance weighted multi-headed quality driven autoencoder (VWMHQAE)  which was developed by our team in 2020. The model is based on stacked autoencoder architecture, and utilized the output (quality-control variables) information by reconstructing both the input and output after encoding. It added the multi-headed structure to do the multi-task learning, and applied a variance-based weight to the tasks that are same as SST model as in Equation. 4. It has been proven to work well with non-time-series data in our previous experiments with similar sensor data, therefore serves as a good baseline model for SST. Since it doesn’t have an architecture to cover the time dimension, the data is flattened before feeding into the model. Also, we trained a second baseline model: a bidirectional LSTM model (Bi-LSTM), which is one of the golden standard models for time series data, to have a comprehensive benchmark on the performance of SST.
Due to the highly imbalanced nature of the data sets, accuracy would not make much sense to evaluate the models. The most important metrics that the industry cares are True Positive Rate (TPR, also called recall or sensitivity) and False Positive Rate (FPR, also called fall-out or false alarm ratio). However, comparing two metrics together is not intuitive, so we chose to use the Receiver Operating Characteristic (ROC) curve and the Area Under Curve (AUC) as the main metric in this paper. More detailed results are covered in Appendix.
For the P1 data set, SST model is set as 3 layers, both and are 128, dropout rate is 0.5, batch size is 2048, is 1, learning rate factor is 0.5, and the uncertainty based weighting is off. the VWMHQAE model is set as three layers with hidden dimension [512, 256, 128], and Bi-LSTM model with dimension equal to . All models are followed by a three-layer MLP classifier with all hidden dimensions as .
The results for the 11 tasks are summarized in Table. III. in 7 of the tasks SST are the best performer, especially for the high performing tasks where AUC larger than .
From the results we can also see that some of the tasks have poor results for all three models. They are difficult to get a high AUC value with any model due to the intrinsic complex and noisy nature. Only those measurement tasks with decent results can lead to realistic value in industry applications. This is one of the primary motivations behind our decision to open-access these data sets: researchers all around the world are welcomed to use and explore this data. This will not only help us to gain more understanding about the data sets, but also enrich the research field.
To further illustrate the results, the ROC curve is plotted for the task with highest AUC as in Fig. 4. SST has a higher score than the two baseline models, and the curve is smoother, meaning a more even distribution of the prediction probabilities and a finer grid in the prediction space. The source code can be found at https://github.com/Seagate/SoftSensingTransformer.
SST model is set as 3 layers, both and are 128, dropout rate is 0.3, batch size is 2048, is 1, learning rate factor is 0.5, and the uncertainty based weighting is on. Baseline models are the same as P1.
The results for the 11 tasks are summarized in Table. IV. SST is the best performer in 4 of the tasks, including the two tasks with the best prediction. Same as in P1, some of the tasks have poor results for all three models due to the intrinsic complexity and noise in the data set, and we mostly care about the tasks with best results. In this data set, there is only one time step, and as expected the VWMHQAE model, which is not designed for time series data, is showing better results comparing to P1 data, and it has the best performance in 5 out of the 11 tasks.
The ROC curve for the task with highest AUC as in Fig. 5 is very similar to the previous one. SST is slightly smoother than the baseline models, with a higher AUC.
SST model is set as 3 layers, both and are 128, dropout rate is 0.3, batch size is 2048, is 1, learning rate factor is 0.3, and the uncertainty based weighting is on. Baseline models are the same as P1.
The results for 7 out of 10 tasks are summarized in Table. V, because the others has too few testing data samples. SST is the best performer in 4 of the tasks, including the first task with the best prediction. Some of the tasks have poor results for all three models and even with an AUC lower than 0.5, meaning it’s worse than a random guess. Its main cause is the distribution shift and further experiments will be carried out when we accumulated more data in Seagate factories. The ROC curve for the task with highest AUC as in Fig. 6 is very similar to the previous ones.
5 Discussion and Conclusion
We have explored the direct application of Transformers to soft sensing. To our knowledge, we are the first to provide large scale soft sensing data sets, and the first to benchmark the results with the original transformer model in the soft sensing field. Also, this is the first time that transformer model goes beyond human in the sense that the input data is not human-interpretable. We analogize the time-series data as a sequence of sensor values, each time step is taken as a word, and process the sentence-like data by a standard transformer encoder exactly as in NLP. This direct and intuitive strategy has shown an exciting result for our data sets that outperforms our previous model and Bi-LSTM model.
We share these data sets with the excitement of advancing interest and work in research and applications of soft sensing. We invite future work into the exploration of improving SST performance on some tasks that have been particularly challenging in our experiments to learn. Another future direction can be the examination of appropriate time sequences for these data sets, and exploration of better ways to address missing data. We are working on acquiring more data with longer sequences, to better understand the impact of time series length in the prediction of quality. In the meantime, we have provided three data sets to cover a variety of sensors, and to examine the generalizability of deep learning models, and we believe these data sets can enrich the soft sensing research field and serve as one of the standard tools to evaluate the effectiveness of future research.
The authors would like to thank Seagate Technology for the support on this study, the Seagate Lyve Cloud team for providing the data infrastructure, and the Seagate Open Source Program Office for open sourcing the data sets and the code. Special thanks to the Seagate Data Analytics and Reporting Systems team for inspiring the discussions.
Tensorflow: large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467. Cited by: §2.2.
SMOTE: synthetic minority over-sampling technique.
Journal of artificial intelligence research16, pp. 321–357. Cited by: §2.2.
-  (2015) Keras. Note: https://keras.io Cited by: §2.2.
-  (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1.
-  (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. Cited by: §1.
-  (2006) An introduction to roc analysis. Pattern recognition letters 27 (8), pp. 861–874. Cited by: §2.2.
-  (2007) Soft sensors for monitoring and control of industrial processes. Vol. 22, Springer. Cited by: §1.
-  (1990) Introduction to statistical pattern recognition. Academic Press Inc., San Diego, CA, USA. Cited by: §1.
Novel transformer based on gated convolutional neural network for dynamic soft sensor modeling of industrial processes. IEEE Transactions on Industrial Informatics. Cited by: §1.
-  (2020) A survey on visual transformer. arXiv preprint arXiv:2012.12556. Cited by: §1.
-  (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §1.
-  (2020) Reliable machine prognostic health management in the presence of missing data. Concurrency and Computation: Practice and Experience, pp. e5762. Cited by: §1.
-  (2021) Prognostics with variational autoencoder by generative adversarial learning. IEEE Transactions on Industrial Electronics. Cited by: §1.
-  (2018) Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7482–7491. Cited by: §2.1.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §2.2.
-  (2015) Deep learning. nature 521 (7553), pp. 436–444. Cited by: §1.
-  (2021) A survey of transformers. arXiv preprint arXiv:2106.04554. Cited by: §1.
-  (2014) Autoencoder for words. Neurocomputing 139, pp. 84–96. Cited by: §1.
-  (2021) Swin transformer: hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030. Cited by: §1.
-  (2010) Rectified linear units improve restricted boltzmann machines. In Icml, Cited by: §2.2.
-  (2020) Wearable computing with distributed deep learning hierarchy: a study of fall detection. IEEE Sensors Journal 20 (16), pp. 9408–9416. Cited by: §1.
-  (2019) The smart insole: a pilot study of fall detection. In EAI International Conference on Body Area Networks, pp. 37–49. Cited by: §1.
-  (1986) Learning representations by back-propagating errors. nature 323 (6088), pp. 533–536. Cited by: §1.
-  (1986) Information processing in dynamical systems: foundations of harmony theory. Technical report Colorado Univ at Boulder Dept of Computer Science. Cited by: §1.
A multilayer-perceptron based method for variable selection in soft sensor design. Journal of Process Control 23 (10), pp. 1371 – 1378. Note: External Links: Cited by: §1.
-  (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §2.2.
-  (2020) Deep learning for industrial kpi prediction: when ensemble learning meets semi-supervised data. IEEE Transactions on Industrial Informatics 17 (1), pp. 260–269. Cited by: §1.
-  (2021) A survey on deep learning for data-driven soft sensors. IEEE Transactions on Industrial Informatics. Cited by: §1.
-  (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1, §2.1, §2.2.
-  (2020) Hierarchical quality-relevant feature representation for soft sensor modeling: a novel deep learning strategy. IEEE Transactions on Industrial Informatics 16 (6), pp. 3721–3730. External Links: Cited by: §1.
-  (2021) Scaling vision transformers. arXiv preprint arXiv:2106.04560. Cited by: §1.
-  (2021) Auto-encoder based model for high-dimensional imbalanced industrial data. External Links: Cited by: §4.