ATCN: Agile Temporal Convolutional Neural Networks for Real-time Processing of Time Series on Edge
This paper presents a scalable deep learning model called Agile Temporal Convolutional Network (ATCN) for high-accurate fast classification and time series prediction in resource-constrained embedded systems. ATCN is primarily designed for mobile embedded systems with performance and memory constraints such as wearable biomedical devices and real-time reliability monitoring systems. It makes fundamental improvements over the mainstream temporal convolutional neural networks, including the incorporation of separable depth-wise convolution to reduce the computational complexity of the model and residual connections as time attention machines, increase the network depth and accuracy. The result of this configurability makes the ATCN a family of compact networks with formalized hyper-parameters that allow the model architecture to be configurable and adjusted based on the application requirements. We demonstrate the capabilities of our proposed ATCN on accuracy and performance trade-off on three embedded applications, including transistor reliability monitoring, heartbeat classification of ECG signals, and digit classification. Our comparison results against state-of-the-art approaches demonstrate much lower computation and memory demand for faster processing with better prediction and classification accuracy. The source code of the ATCN model is publicly available at https://github.com/TeCSAR-UNCC/ATCN.READ FULL TEXT VIEW PDF
Binary convolutional networks have lower computational load and lower me...
Vehicle trajectory prediction is an essential task for enabling many
We present a class of efficient models called MobileNets for mobile and
In this paper, a 1d convolutional neural network is designed for
The common practice of quality monitoring in industry relies on manual
Convolutional neural networks have witnessed remarkable improvements in
Deep 3-dimensional (3D) Convolutional Network (ConvNet) has shown promis...
ATCN: Agile Temporal Convolutional Neural Networks for Real-time Processing of Time Series on Edge
The astonishing growth in deep learning algorithms has changed how embedded and cyber-physical systems (CPS) process the surrounding environment and has significantly improved the overall CPS performance on delivering their assigned tasks. For instance, the deep learning algorithms and architectures have powered the embedded systems in visual sensing applications such as pedestrian and object tracking [revampt, obj_tracking_emb], action detection [rnn_actioDetection, fast_actionDetection]. Another dimension of deep learning, which has recently emerged in the edge, is time series analysis and forecasting. Healthcare [ecg_tcn, health_1, health_2], device health monitoring [Deep_race, biglar_dev_health, dev_lstm_3], machine translation [lstm_machine_trans1, gru_mt] are some examples of deep learning use in time sequence analysis.
For most deep learning practitioners, recurrent networks and especially two elaborated models, namely, LSTM [lstm_main] and GRU [chung2014empirical]
, are synonymous with time series analysis due to its notable success in sequence modeling problems such as machine translation, language processing, and device health monitoring. These models interpolate the output based on the current and temporal information, which is learned and captured in the hidden states and propagated through the time from one cell to the next adjacent cell. The propagation chain of hidden state causes two significant issues[DBLP:empirical]: 1) gradient instability such as vanishing/exploiting gradients and 2) fewer levels of parallelization due to existing dependencies across the cells.
Temporal Convolutional Networks (TCN) was first proposed based on an adaptation of WaveNet [oord2016wavenet] and Time-Delay Neural Network [timeDelay]. It orchestrates dilated convolutions in Encoder-Decoder architecture to have a unified framework for action segmentation. Later, Bai et al. [bai2018empirical] designed a Generic TCN (GTCN) architecture for sequence modeling, which outperforms LSTM on time-series and sequence modeling tasks. However, the GTCN suffers from two main drawbacks: 1) the size of dilation increases exponentially by the layer, which prevents the designer from increasing the depth of the network, 2) it uses two standard convolutions per each layer, which is computationally expensive for resource-constrained embedded systems.
This paper proposes a novel extension of TCN called ATCN for light-weight processing of time series on embedded and edge devices. We introduced three main computational blocks to decrease the number of MAC operations and the model size of TCN to make it applicable for embedded devices while maintaining a comparable or better accuracy over mainstream TCNs. These three blocks can be sequenced in different configurations to build a scalable ATCN capable of meeting design constraints. We demonstrate the capacities of our proposed ATCN on accuracy and model complexity trade-off on three embedded applications. In the heartbeat classification of ECG signals selected from the health care domain, ATCN was not only able to improve accuracy and F1 metric for 3% and 4% respectively, but it also decreased the model size and MAC operation 2.99 and 3.09
, respectively. To show the capability of remembering long distant past, we also trained ATCN to classify digits from the MNIST dataset. For the digit classification, ATCN decreases the model size and MAC operation for 7.9and 16.56, respectively, with the same accuracy. We also trained ATCN to predict MOSFET transistor degradation for device health monitoring, and we compared it against LSTM. ATCN also decreases the number of parameters and the prediction error by 4.27 and about 30%, respectively.
Overall, the key contributions of this paper are:
Proposing ATCN, which achieves higher or comparable accuracy over state-of-the-art models with significantly lower computation complexity for embedded devices.
Creating a network template supported by automated design flow for scalable generation and training different configurations of ATCN concerning the complexity of problem and latency requirements.
Demonstrating the benefits of ATCN in two significant embedded problems with a need for real-time low-power embedded processing: (1) heartbeat classification of ECG signals and (2) transistor device health monitoring.
The rest of this article is organized as the following: Section 2 briefly discusses the use of time-series analysis in embedded and CPS. Section 3 provides background on generic TCN and its architecture. In section 4
, we elaborate on the Temporal-Spectral block, the ATCN architecture, and its hyperparameters. Section5 presents the experimental results including comparison with existing approaches, and finally Section 6 concludes this article.
Traditional convolutional neural networks are used in computer vision applications due to their success in capturing the spatial features within a two-dimensional frame. Recently, research has shown that specialized CNNs can recognize patterns in data history to predict future observations. This gives researchers interested in time-series forecasting options to choose from over RNNs, which have been regarded in the community as the established DNN for time-series predictions. In one such case, Dilated Convolutions (DC) have been shown to achieve state-of-the-art accuracy in sequence tasks. In the first use of DC, WaveNet[Wavenet] was designed to synthesize raw audio waveform, and it outperforms the LSTM. Later, Lea et al. [ActionDetection] proposed TCN, a unified network based on WaveNet DC, for video-based action segmentation. In the same trend, the gated DC was used for the sequence to sequence learning [DBLP:convseq2seq]. The proposed approach beats deep LSTM in both execution time and accuracy.
GTCN [DBLP:empirical] is a generic architecture designed for sequence modeling. The design of GTCN was based on two main principles: 1) there shouldn’t be any information leakage from future to past, 2) the network should be able to receive any arbitrary input length similar to RNN. Since the main fundamental component of GTCN is based on variable-length DC, it brought higher parallelization and flexible receptive field in comparison to RNN. Also, since the gradient flow of GTCN is different from the temporal path of RNN, it is more resistant to the problem of gradient instability. Recent researches have taken advantage of GTCN benefits or similar architectures in their works. In the work of [depthWise_Audio], a modified version of GTCN with depth-wise convolution has been used to enhance the speech in time-domain. The DeepGLO [sen2019think] is another work that used a global matrix factorization model regularized by a TCN to find global and local temporal in high dimensional time series.
In this paper, we revisited the structure of GTCN to improve its performance for embedded and resource-constrained hardware. Our approach is orthogonal and can be used in all research that has used GTCN before. We have put our claim on test in Section 5 for three embedded applications, such as heartbeat classification of ECG signal, MOSFET transistor health monitoring, and digit classification. We have shown that ATCN improves or maintains the overall system accuracy for these three cases while minimizing computational complexity and model size. In the next section, we study the structure of GTCN in-depth to prepare the ground for introducing ATCN in Section 4.
GTCNs are designed around two basic principles: 1) the convolutional operations are causal, i.e., predictions are made based only on current and past information; 2) the network receives an input sequence of arbitrary length and maps it to an output sequence of the same length [DBLP:empirical]. Based on principle number 2, in order to map the final output to an arbitrary size, the output of the last DC output can be connected to a linear layer. This adds flexibility by allowing a final output length to be independent of the input length. The naive causal convolutions, which have a dilation rate of 1, are inherently inefficient as their sequence history scales with size linear to the depth of the network.
The solution here incorporates dilated convolutions to exponentially scale the receptive field, as shown in Fig. 1. The first convolution with dilation rate
=1 maps the input vectorto the higher dimension. Then, GTCN increases the for the next convulsions exponentially to increase the receptive field. The minimum output sequence length, before mapping to the linear layer, can be determined by calculating its receptive field: [tcnfpga]:
where is the layers, is the kernel size, and is the dilation rate at layer . This means that as the depth of the network increase, so does the receptive field. The dilated convolution of on element of a sequence is given as:
where is a 1-D input sequence, is dilated convolution operator, is a kernel of size and is the dilation rate [DBLP:empirical, tcnfpga]. For applications requiring a very large
, it is also essential to provide stability in the later layers subject to the vanishing gradient problem. A popular technique in traditional CNN architectures, the residual block[resnet], provides a “highway” free of any gated functions, allowing information to flow from the early layers to the last layers unhindered.
These connections can be seen in the final GTCN architecture shown in Fig. 2. The GTCN consists of hidden layer and an optional linear layer to map the input size
to arbitrary output size. Each hidden layer has two regular dilated convolution and two ReLU activation function. There can also be an upsampling unit, such as point-wise convolution, in the first hidden layer of GTCN to map 1-D input sequence to a higher dimension to guarantee the element-wise addition receives tensor of the same dimension.
The design of GTCN suffers from two problems: 1) exponential growth of dilation size, 2) the existence of two regular convolutions per layer. The exponential growth of dilation size and requirement of having the same length for both input and output of dilated convolution force the network designers to have excessive padding at the higher layers. Also, the implementation of two convolutions blocks per layer makes the GTCN costly for CPS. In the next section, we address the problems mentioned above by introducing ATCN architecture.
In this section, we introduce the architecture of ATCN. At first, we discuss the essential components, and then we elaborate on the hyper-parameters, and in the end, we present the ATCN architecture and its model builder.
Designers can build ATCN architecture by chaining three basic blocks, namely, Regular Convolution Block (RCB), Spectral-Temporal Convolution Block (STCB), and residual Linear Convolution Block (LCB). We visualized these three basic blocks of ATCN in Fig. 3. The MaxPooling layers, both in RCB and STCB, are optional. Architects can activate them when they need to downsample the temporal information to minimize computational complexity while embedding the information to the higher dimension. In the rest, we discuss the details of each basic block.
We design the ATCN architecture in a way that it starts with RBC. This unit consists of a padding unit, conventional CNN, and an optional MaxPooling for the case of downsampling the input. A padding unit is added into all three main blocks to ensure that the input and output tensors of the block have the same size to satisfy principle number 2 of GTCN. For doing that, the zeros are added symmetrically where is given by:
where is the output size, is the input size,
is the stride,is the kernel, and is the dilation. We can mathematically formulate the the RCB block by:
where is the padding size, is the convolution operator, is the dilation rate, is the kernel vector, is non-linearity activation function, is the batch normalization, is down sampling function, is the stride, and input ration, which is the input size to the output size, .
can be max pooling or average pooling. For the non-linear activation function,
, designers can select it base on the performance on its performance on validation loss function. For ATCN,can be set to one of these activation function:
where is Hadamrd or elemnt-wise multiplication, , and . In Fig. 4(a), We depicts the performance of four widely used activation functions, namely, , , , and , for MNIST digit classification task, which is explained its ATCN architecture in Section 5.1. As we can see, performs better than the three others to decrease the validation loss function.
The linear convolution block consists of pointwise convolution (expansion), followed by a depthwise and another pointwise (projection). The task of expansion convolution is to map input channel size, , to higher or same dimension, , where
. On the contrary, pointwise projection embeds and maps the feature extracted from depthwise to the block output,. For the case of depthwise convolution, we set , which manages the connection between input and output, to . For this case, the convolution weigh shape changes from to , where is the kernel size. We designed the network builder so that if , the skip line is automatically created from input to the elementwise addition. Then, the input will be added to the residual output from the pointwise projection. The residual connection helps the designers increase the network’s depth without being worried about the vanishing gradient problem. Similarly to RCB, we can formulate the LCB by:
The next component is the STCB. The principle is the same as the linear block; however, we consider group convolution rather than depthwise. The reason for doing so is based on this observation that for downsampling the input, which has an activated max-pooling unit, group-convolution helps to better map temporal information to a higher dimension without drastically increasing computation complexity and the model size. The only constraint imposed by group convolution is that its output channel size, , should be divisible by . The two extreme G-CNN cases are when and . In the former case, the G-CNN is a depthwise convolution, and in the latter, G-CNN is a standard convolution. Formally, the weight shape for G-CNN is . We depict the effect of altering the values in Fig. 4(b) for MNIST digit classification. As we can see, reducing the value increases the network capacity to minimize the validation loss.
For designing the ATCN network architecture, three knobs should be altered based on the problem complexity (sequence classification, prediction, or segmentation), the input size, the network models, and computational cost trade-off. In the rest, we fully elaborate on each of them.
For a fixed input size, if we increase the number of layers, based on the GTCN architecture guideline, we need to increase the dilation rate exponentially. This decision will help the network have a higher receptive field; however, based on principle number 2, we need to pad the features excessively to have the same input and output size. This unnecessary padding results in 1) more computation and 2) CNN performance degradation. We observed linear growth for dilation would help the network with more than six layers to have better feature representation. Although the dilation rate can be defined as a function of layer number, we increased it after each block with activated downsampling in the experimental results. This decision helps design a deep ATCN for the cases where input size, , is small.
We couldn’t observe a straight forward rule to set the kernel size, , depending on the problem complexity. The value should be large enough so that the receptive field covers enough feature context; however, based on Eq. 3, it is a good practice to decrease the kernel size for higher layers to make sure is not growing exponentially. As explained for the dilation rate, as we increase the dilation rate after blocks with downsampling, we decrease the kernel size. This decision has two crucial benefits for embedded devices: it reduces 1) the computational complexity, and 2) the model size.
Similar to the dilation rate, if we need to increase the network’s depth to increase its capacity, it is recommended to gradually decrease kernel size and have a linear growth for dilation. We can alter both after each block with downsampling units. This decision helps the final structure to have enough receptive field to cover feature context without increasing the MAC operations and model sizes.
We depict the inputs of ATCN Model Builder and the framework for its training in Fig. 5. The ATCN Model Builder receives Input Channels, Kernel Sizes, Dilation Rates, Input Ratios, and finally, the Output Size to design the ATCN network architecture. In the rest, we explain each of the ATCN Model Builder inputs in detail.
: It is a vector of size , where is the number of layers (blocks). The defined in , decides the input channel for layer . The values in vector can be 1) ascending to map the input, , to a higher dimension, 2) descending to map the input to a lower dimension, or 3) descending-ascending (Auto-Encoder) architecture. Each of these architectures can be defined based on the problem complexity and targeted task.
: The vector , , defines the kernel sizes for each layer. Based on the discussion of Section. 4.2.2, it is suited to decrease the to minimize both model size and required model computation.
: The vector , where , defines the dilation rates per each layer. On the contrary to , it is necessary to increase in order to achieve higher receptive field.
: The vector , where , defines the input ratios. For the value of , the ATCN Model Builder selects STCB block with max-pooling unit. For the case of , the first layer will be an RCB with an activated max-pooling; otherwise, the input and out of RCB will have the same size. For this paper, the can only be defined as or . For other ratios, the ATCN Model Builder can be modified to change the stride of max-pooling to satisfy the targeted ratio.
: The output channel size, , is a scalar value that defines the arbitrary values and is based on the problem. For instance, for the digit classification problem, we should set the to ten. For this case, the ATCN Model Builder adds an adaptive Average Pooling to downsample (, , ) to (, , ), where is the batch size. Then it adds a Linear layer that has wight, , with the shape of (, ). The final output of entire network for the batch is:
where is downsampling block with stride . An example of ATCN Model Builder output can be seen in Fig. 6. The illustrated network consists of five layers, one average pooling, and one linear block to map the final output to an arbitrary size. In the next section, we used the ATCN Model Builder to generate the model for three different use cases.
In this section, we targeted three cases to show off the capabilities of ATCN. Two instances are classification problems, and the other one is regression. For classification, we selected MNIST and heartbeat classification of ECG signals, and for regression, ATCN is set up to predict the MOSFET transistor remaining useful life. In the rest of this section, we explain the setup for each of these examples in detail.
For the first case, we select the MNIST as it is frequently used to test the capability of remembering long distant past [mnist_long_distance]. In this case, the ATCN is configured to classify the MNIST dataset. Table 1 represents the final ATCN configuration. The seven layers of the network consist of one regular convolution, two STCBs, four LCBs. The is set to for all STCBs. We also set the activation function, , to Swish for all the blocks. A dropout rate of 0.2 was used to randomly ”ignore” 20% of the layer outputs as a way of regularizing the network and prevent the model from over-fitting [overfitting, dropout].
|Block type||Kernel size ()||Dilation Size ()||Output channel size (|
Dataset For the first case, we have selected the MNIST dataset to examine the ability of ATCN to recall information from the distant past [jing2017tunable, krueger2016zoneout]. The task is to classify the input images of gray-scale hand-written digits. The image is 2828, which is later represented in 1784 sequence for ATCN.
Training. We used Adam optimizer with the initialized learning rate of . The is reduced automatically when the validation loss is not improved over epochs with a ratio of . We also set the total epochs to 50.
Results We summarized the performance of ATCN against GTCN in Table 2. As we can see, ATCN improves the number of model parameters and MAC operations for 7.9 and 16.56, respectively, while the accuracy is the same as GTCN. The results indicate that ATCN can have the same performance as GTCN while minimizing both model size and number of operations significantly.
As one of the critical embedded applications, we configured the ATCN to classify heartbeats of ECG signals. We configured the ATCN structure based on the dilated convolution developed for DeepECG [goodfellow2018towards]. The final architecture of ATCN is summarized in Table 3. The network consisted of 13 layers of regular convolution, Linear Convolution Block (LCB), and Spectral-Temporal Convolution Block (STCB). Similar to MNIST case, we also set the to for all STCBs. We also set the activation function, , to ReLU for all the blocks. A dropout rate of [dropout] 0.3 was also used for regularizing the network and preventing the model from over-fitting.
Dataset. For the sake of comparison to the DeepECG, we have similarly used the 2017 Physionet challenge dataset. It has 12,186 ECG waveforms consists of Normal Sinus Rhythm (N), Atrial fibrillation (A), Other rhythms (O), or noise signals, which are shown in Fig. 7. The training set (8,528) is available, and the test set (3,658) is hidden for system evaluation. We applied the 70%/30% rules to the training set and split it into two training and validation subsets. We also merge the noisy and other rhythms and consider them as the same class as DeepECG has done the same. For the validation dataset, we uniformly select the signals from each of these three classes to make sure it covers all types of classes.
|Block type||Kernel size ()||Dilation Size ()||Output channel size (|
Similarly to the DeepECG, we passed both the training and validation signals through a bandpass filter with lower and upper limits of 3Hz and 45Hz, respectively. The input size for both ATCN and DeepECG is 18000 samples, equivalent to 60 seconds at 300Hz sampling frequency. We also change the polarity of signals randomly with a probability of 50% during the training phase. We used Adam optimizer with the initialized learning rate of. We used a step-wise scheduler to decay the every 70 epochs, with a ratio of for a total of 200 epochs.
Results. We compared the performance of the ATCN and DeepECG and summarized them in Table. 4. We can see that ATCN reduced the model size from 5.39 million parameters to 1.8, which is a 2.99 reduction. Similarly, we also observed a 3.09 reduction for the number of MAC operations from 45.59 GMacs to 14.7 GMac. F1 accuracy is also improved by 0.02 units for (N) and (O) classes, while it decreased for (A) class; however, ATCN improved the average F1 by 0.04. The ATCN has also increased the TOP1 accuracy for 3% overall.
Another application frequently used in embedded systems is device health monitoring. For this case, we configure and trained ATCN for MOSFET transistor reliability prediction, and we compared the results against Deep RACE[Deep_race]. Based on the most acceptable standards for device qualification in the industry, such as AEC-Q101 [StressTestQualification], and the state-of-the-art research, , , , , and are the most common aging parameters for the device degradation tracking, and are essential for prediction the remaining useful life (RUL) of MOSFET devices [conditionmonitoring, uncertainty, progandhealth, failuremech]. , which refers to drain current at zero bias, can be used for early detection of die-level failures, shows the device junction temperature and corresponds to thermal runaway failures, shows the gate threshold voltage shifting, is the thermal resistance of the device and represents device overheating mostly in the package level. Finally, shows the device drain-source resistance, which represents both device degradation in the die and package level where inherently shows the device internal loss.
Figure 8 illustrates the deviation of from its pristine condition over time for eleven different MOSFET transistors (IRF520NPbf) extracted from the data set provided by NASA [NASA]. As the sample rate was not uniform, data points were re-sampled and then filtered by averaging samples over a one minute window. As shown in Figure 8, it may seem that these eleven transistors share a similar degradation pattern at the first observation; however, the deterioration pace and individual behaviors are significantly varied across the devices with the same underlying physics. This is primarily due to diverse workloads, different environmental conditions, and varying manufacturing processes. Due to these unique unit-to-unite conditions and the complexity of power electronics systems, the reliability-based predictions that are based on theoretical approaches, such as physics-of-failure, result in significant errors in real-world uses. Deep RACE [Deep_race] shows the LSTM can predict the
trajectory more accurately with respect to classical approaches such as Kalman Filter[dusmez2016remaining_KF] and Particle Filter [celaya2011prognostics_PF]. Similarly, we trained the ATCN to predict the . The structure of ATCN is summarized in Table 5. The network consists of 4 Block of RCB and LCB. Like LSTM used in Deep RACE, we set the input network to 121 sequence of . We deactivated the max-pooling unit in RCB by setting to 1, as we realized if we halve the first input, the ATCN performance will drop. We also set the to the Swish function. We also used the dropout rate of [dropout] 0.2.
Dataset. The experimental data sets for both training and testing are from eleven power MOSFET (IRF520NPbf) transistors introduced above. The TCN network forecasts a transistor degradation behavior based on acquired knowledge during the training phase, and it is without any prior knowledge in advance for the testing transistors.
Training. For all of the experiments, the input size was set to 21 samples (13 seconds) to predict the next 104 samples (62 seconds). For the application with higher window resolutions (i.e., higher output sequence), the network input sequence should also be increased to minimize the prediction error. The input data is normalized to [-1, 1]. This allows for easier and faster training. We used Adam optimizer to reduce the Mean Squared Error (MSE) between measured , and the output of both networks. The weights of the ATCN were initialized using the Xavier algorithm [xavier] and underwent 4,000 epochs of training with a starting learning rate of . The decreased by a factor of when the validation loss function is not improved over the last 200 epochs.
|Block type||Kernel size ()||Dilation Size ()||Output channel size ()|
Result. For validation and testing, four of the eleven devices (8, 9, 14, 36) were chosen to be individual test sets. When a single test set is chosen, i.e., device 8, the remaining ten devices become the training set. This ensures that the model is being tested on a device it has not seen before. The results are shown in Table 6 compare the LSTM (Deep RACE) against the ATCN in both model size and complexity, as well as the metric used for regression accuracy, log(MSE). For log(MSE), the lower the value, the better the accuracy. As we can see, ATCN decreases model size for about 10.87, while the number of computation increases by 33.77%; however, it decreases the reliability prediction error for 29.94%. We visualized the qualitative results of both Deep RACE (LSTM) and ATCN in Fig. 9.
|Input size||Model Size (K)||Operations (K)
We compare the execution time of ATCN against LSTM (Deep RACE) in Table 7. The results are based on running both models for 1000 iteration. We also set the batch size for both cases to one to consider the latency of both models. As we can see, the ATCN improves the average time and the best time for 3.02 and 5.20, respectively. The results indicate that ATCN utilizes the embedded GPU better than LSTM due to its massive parallelization levels in contrast to LSTM.
|Model||Execusion Performance (mS)|
This paper proposes a novel extension of TCN called ATCN for light-weight real-time processing of time series on embedded and edge devices. We introduced three main computational blocks to decrease the number of MAC operations. These three blocks can be sequenced in different configurations to build a scalable ATCN capable of meeting design constraints. We also presented a framework, called ATCN Model Builder, to generate ATCN models. The result of ATCN Model Builder is a family of compact networks with formalized hyper-parameters that allow the model architecture to be configurable and adjusted based on the application requirements. We demonstrated the performance of ATCN through three embedded application cases. In the heartbeat classification of ECG signals selected from the health care domain, ATCN improves both accuracy and F1 metric for 3% and 4%, respectively. It also shrank the model size and MAC operation 2.99 and 3.09, respectively. ATCN demonstrated its performance on remembering long distant past as it classified digits from the MNIST dataset. For the digit classification, ATCN decreases the model size and MAC operation for 7.9 and 16.56 to GTCN, respectively, while maintaining the same accuracy. We also trained ATCN to predict MOSFET transistor degradation for device health monitoring, and we compared it against LSTM. ATCN also decreases the model size and the prediction error for 4.27 and about 30%, respectively.