Earthquake forecasting and prediction have long been of interest because of the obvious practical and societal implications. While research has waxed and waned with many failed directions, recent work on early warning systems, hazard assessment, and precursors has provided renewed interest (Allen and Stogaitis, 2022; Kohler et al., 2020; Ben‐Zion and Lyakhovsky, 2001; Kong et al., 2018; Pritchard et al., 2020; Beroza et al., 2021). Laboratory work has fueled this interest via: 1) the discovery that machine-learning can predict several aspects of lab earthquakes (Rouet-Leduc et al., 2017, 2018; Johnson et al., 2021) and 2) recent work on the mechanisms of precursors to labquakes that adds to a strong group of existing studies (Shreedharan et al., 2021; Bolton et al., 2021; Dieterich, 1978; Scholz, 1968; Dresen et al., 2020; Scuderi et al., 2016; Hedayat et al., 2014; Acosta et al., 2019; Johsnon et al., 2013; Main et al., 1989, 1992; Thompson et al., 2005; Passelègue et al., 2017).
Recent works show that acoustic emissions can be used to predict labquake failure time, fault zone shear stress, and labquake stress drop (Rouet-Leduc et al., 2017, 2018; Lubbers et al., 2018; Hulbert et al., 2018; Bolton et al., 2020). Existing works show that seismic radiation from lab faults scales with time to failure and that its recent history can be used to predict the current fault zone stress state (Bolton et al., 2019; Corbi et al., 2019; Jasperson et al., 2021) with both active and passive seismic monitoring of the fault zone (Shreedharan et al., 2021; Shokouhi et al., 2021). These studies show that both passive signals coming from the fault zone and active acoustic signals passing through the fault can be used to predict failure. The active source signals record changes in fault properties prior to failure and thus offer the possibility of incorporating physics-based models (i.e, of asperity contact mechanics) in the ML/DL algorithm. Other studies show that ML can be used to connect lab AE events with fault zone microstructure (Chaipornkaew et al., 2022; Trugman et al., 2020; Ma et al., 2021) and that ML methods can be augmented by directly incorporating physics into the prediction models (Raissi et al., 2019). Here, we build on these works to further the basis for ML/DL work in earthquake physics and forecasting.
Recent lab studies have also identified reliable precursors to failure in connection with labquake prediction. These include strong relations between acoustic transmissivity, elastic properties and fault strength (Schubnel et al., 2006; Nagata et al., 2008; Scuderi et al., 2016; Tinti et al., 2016; Shreedharan et al., 2020, 2021; Bolton et al., 2021; Rivière et al., 2018). Other studies provide insight on labquake nucleation processes and how the evolution of frequency-magnitude (b-value) statistics during the seismic-cycle might inform failure forecasts (Dresen et al., 2020; Rivière et al., 2018; Scholz, 2015; Goebel et al., 2017; Kwiatek et al., 2014; Latour et al., 2013; Shreedharan et al., 2020; Johnson et al., 2021; Ma et al., 2021; McBeck et al., 2020; McLaskey and Lockner, 2014). These studies provide a framework for understanding how machine learning can predict labquakes using statistics of the continuous AE signal emanating from lab faults (Hulbert et al., 2018). Similar ML approaches have been applied to predict geodetic signals of episodic tremor and slip (ETS) on tectonic faults (Hulbert et al., 2020; Johnson et al., 2020; Wang et al., 2021; McBrearty et al., 2019; Mousavi et al., 2020; Michelini et al., 2021; Ross et al., 2018).
The majority of existing ML studies of labquakes use decision-tree-based algorithms and a gradient boosting framework (e.g., XGBoost model). In such studies models are built and errors are minimized by gradient descent methods using multiple learning algorithms that iteratively improve predictive performance beyond that of a single classical learning algorithm. These studies show that it is possible to successfully predict labquakes with reasonable accuracy. The success of the original approaches was based on continuous records of AE events broadcast from the lab fault zones. These AE represent a form of lab foreshock; they reflect the critical stress conditions that produce precursors to lab earthquakes.
Despite the dramatic recent advances in lab earthquake prediction our understanding of how and why these methods work is far from complete. Simple questions remain such as how the AE signal scales with labquake timing and magnitude and how AE signal characteristics encode fault zone shear stress even during aperiodic failure. Here, we address these questions by exploring a wider range of ML/DL methods. We also address the open discussion regarding the predictability of earthquakes and if faults slip deterministically or stochastically.
We test two different methods and three families of Neural Network (NN) architectures. The two methods refer to two different tasks: prediction and forecasting. The former has been addressed in previous works (Hulbert et al., 2018; Rouet-Leduc et al., 2017), and the main goal is how to predict the present value of target variables (e.g., shear stress) given past information and some memory of sequence evolution. With the second method we introduce a novel autoregressive (AR) forecasting approach to predict not only the present value of the target variable, but also its step-by-step future evolution.
The three families of NN architectures adopted in this paper are Long short-term memory (LSTM), Temporal Convolutional Network (TCN), and Transformer Network (TF). LSTM is a Recurrent Neural Network (RNN) designed to capture long-range data dependencies in time series and sequential data(Hochreiter and Schmidhuber, 1997). RNNs are Deep Neural Networks (DNNs) that recursively update an internal status (this may represent the current status of the fault) from which predictions are made. This approach provides the capability of modeling sequences of events. LSTM is designed to solve gradient vanishing problems of RNN’s for long-term prediction. LSTM has shown great potential for modeling non linear structural seismic responses (Johnson et al., 2021). TCN, one of the simplest NN architectures for sequence modeling, is a Convolutional Neural Network (CNN) with adjustments that allow longer sequences by addressing the relation between the depth and the length of the considered input sequence. Because it is convolution, TCN may be applied to sequences of variable length. The TF architecture is the latest and it has shown promising results for earthquake detection (Mousavi et al., 2020). It’s based on mechanisms of self-attention and is effective in capturing long-term dependencies in a variety of sequence modeling tasks (Vaswani et al., 2017). One notable weakness of TF is its complexity and that it requires significantly larger datasets than other architectures.
2 Laboratory Earthquake Experiments
We use data from experiments conducted in the double-direct shear (DDS) configuration with a biaxial deformation machine (see Figure 1). The DDS experiments consist of two identical fault zones sheared simultaneously between three loading blocks. Two stationary side blocks are supported from below and a central block is driven between them to impose shear.
Our experiments were performed following a standard procedure that was developed to obtain highly reproducible results (Karner and Marone, 1998; Bolton et al., 2020). First, fault normal stress was imposed with a servo-controlled hydraulic system and maintained at a controlled value throughout shear. Then, fault zone shear was imposed by driving the central block of the DDS configuration at a constant displacement rate, thus imposing a constant shear strain rate within the fault zone (Table 1). Fault zones were composed of granular materials that began as 3-mm thick layers of nominal contact area 10 × 10 . We use data from two types of lab fault zones: 1) glass beads (experiment p4581) and 2) quartz powder, (experiments p4679 and p5198). Table 1 provides experiment details. The glass beads range in size from 100-150 µm and for these experiments the normal stress is held below 10 MPa so that grain fracturing is minimal (e.g., Mair et al., 2002). The quartz powder has a mean particle size of 10 µm (Bolton et al., 2020) and a power-law particle size distribution that does not change significantly with shear for the stresses of our experiments. The fault normal stress was 7 MPa for experiment p4679 and it ranged from 6 to 11 MPa in experiment p5198 and from 2 to 8 MPa in p4581. These experiments exhibit a range of stick-slip behaviors from fast to slow (Leeman et al., 2016) and have seismic cycles that range from highly periodic to complex and aperiodic.
We study a range of frictional sliding behaviors from stable sliding to highly unstable slip. In the framework of rate-and-state friction this range of behaviors is understood in terms of the ratio of loading stiffness to a critical rate of fault weakening with slip, which scales with normal stress (Leeman et al., 2016). Each experiment includes hundreds of lab earthquakes (Figure S1). By adjusting the loading stiffness and/or the fault normal stress, the same fault gouge can host slow-slip events, fast lab earthquakes, or complex, quasi-periodically events (Leeman et al., 2016; Scuderi et al., 2017; Mele Veedu et al., 2020). Mechanical data of stress and strain are recorded at 1 kHz and acoustic signals are recorded at 4 MHz using lead-zircon-titanate piezoceramic sensors embedded in the DDS (Figure 1) (Rivière et al., 2018; Bolton et al., 2020). A high-precision signal is used to synchronize the mechanical and acoustic data.
We study data from three experiments chosen to represent a range of lab earthquake behaviours. In some cases the events are quasi-periodic, as for experiment p5198 (Figure 2 a), and in others, as for experiment p4679 (Figure 2 d) they are aperiodic with alternating slower and faster events. Experiments like p4679 often have complex stress evolution throughout the lab seismic cycle. Some of this complexity can be seen at the scale of the whole experiment (Figure S1) and other aspects of it are visible only at a larger scale (Figure 2
). In the case of p4581, the pre-seismic phase of the seismic cycle is characterized by almost no pre-slip prior to failure. The acoustic emission record for this type of experiment is distinct because the signal variance is quite low until just prior to the labquake (FigureS2). Nonetheless there is still AE activity before the main labquake. In contrast, experiment p5198 shows significant pre-seismic slip before the peak stress and failure (see Figure 2). Both of these experiments show somewhat regular, quasi-periodic seismic cycles (Figures 2 and S2). The third experiment (p4679) shows aperiodic seismic cycles that are much more challenging to predict. For each experiment we focus on a data segment of 30 to 50 events shown in the red boxes in Figure S1.
3 Prediction and forecasting models
We use DNNs and three of the most highly-accredited sequence modelling approaches: LSTM (Long short-term memory) (Hochreiter and Schmidhuber, 1997), TCN (Temporal Convolutional Network) (Bai et al., 2018) and TF (Transformer Network) (Vaswani et al., 2017). Figures 3 and 4 and Appendix 9.1 provide a summary of our models.
3.1 Input, output and model performance
We use measurements of shear stress and radiated elastic energy as ML features. Typically, model input consists of statistical measures of the raw data (for the prediction task) or the raw time series data (for forecasting). The models output estimates that can be compared to ground truth. Our input data are time series with a fixed sampling rate. Thus, unlike other ML applications, it is inappropriate to shuffle data because of causality in the processes that connect data on stress/strain with AE and viceversa.
As in Rouet-Leduc et al. (2017), we consider the variance of the continuous acoustic signal as the most important ML feature for the prediction task. An example of raw data for the acoustic signal and shear stress is shown in Figure 1 while the variance is shown in Figure 2.
We focus on fault zone shear stress as a primary ML/DL target (Figures 1 and S1) because it provides a comprehensive summary of the lab seismic cycle. Three stages of the lab seismic cycle can be identified: an initial –interseismic– stage of stress increase, a stage with pre-seismic slip and then a last stage with a co-seismic stress drop (Figures 1 and 2). The interseismic period of the lab seismic cycle is identified from the initial, linear-elastic portion of the curve, where load increases in proportion to stiffness. Pre-seismic slip is marked by a deviation from elastic loading. This typically corresponds to a relatively short time interval preceding failure and seismic energy release. Pre-seismic slip is part of the labquake nucleation process.
For the ML prediction task, our models predict the time to failure (TTF) defined as the time remaining before the next labquake. We also predict both the time of start of failure (TTsF) and end of failure (TTeF). The former derives from the time of maximum shear stress preceding labquakes (Figure 2, orange line), whereas the latter is derived from the minimum shear stress after events (Figure 2, red line). We show TTF along with shear stress and variance of the continuous AE record for several lab seismic cycles in Figure 2. This highlights the stepwise nature of TTsF and TTeF as linear functions that our models are designed to predict. Note the differences in the seismic cycles for our experiments: p5198 shows a somewhat periodic sequence of labquakes with lower acoustic energy compared to p4679, which has complex seismic cycles and greater acoustic energy release (Figure 2 and S2).
To calculate AE statistics and determine features for our ML algorithms, we use a moving window on the continuous records made at 4 MHz (Table S1). We adjust this window for each experiment based on the seismic cycle duration (Figure 2). This allows preserves all of the shear stress evolution while not losing small differences between various cycles (in the case of experiments p4581 and p5198) and to distinguish fast from slow events in experiment p4679. In particular, we adopt window lengths similar to the literature (e.g., Rouet-Leduc et al., 2018; Hulbert et al., 2018) to allow a reliable comparison in terms of performance.
For our forecasting approach we also smooth the signal using a running average window. Then we reduce the resolution to make the length of the sequences suitable for our ML models. In particular, we sub-sample the signal by a factor of 100 (upon smoothing to limit aliasing). A typical ML protocol requires at least two subsets of data, one for training, to learn the model parameters, and one for testing, to measure the generalizability of the results to unseen data (Figures 2 and S2). We also use a validation dataset to evaluate and optimize the model fit during training and to avoid overfitting. Validation is done with a small portion of data (generally ) not used for training nor for testing. The split among the three segments preserves the statistical properties of the dataset.
To evaluate candidate models and build the ML algorithm with appropriate weights requires a loss function. We use a root mean squared error (RMSE) loss function comparing the output prediction and the ground truth data. From this comparison, we iteratively adjust model parameters (weights and biases) to minimize the loss. Features and target are generally normalized because their ranges typically vary widely and because normalization helps DL algorithms converge with better results.
). For evaluation metrics we use both the coefficient of determinationand the RMSE. We use RMSE as loss, following common practise in ML, but also as a test metric representing the discrepancy between model output and ground truth. We also use the metric to compare our ML predictions with null models; that is, models that always predicts the average value of the label. Moreover, we quote because it has been used commonly in previous works and thus it is useful for comparison of our results with the state-of-the-art.
3.2.1 Problem definition
Our goal is to predict the present value of shear stress or TTF, as target variables (e.g., where t is the present time) given current information about AE, as an input variable, (e.g., ) and it’s recent history (e.g. from to
). Previous works have observed that AE statistics (in particular signal amplitude variance and higher-order moments) are highly effective at predicting laboratory earthquakes(Rouet-Leduc et al., 2017). Thus we begin by following this approach. However, whereas previous works focused on data from only one acoustic sensor, we use data from two sensors, one on either side of the DDS assembly (Figure 1). We note that the temporal evolution of the AE signal during the lab seismic cycle differs between our experiments, in particular during the creep phase preceding labquakes (Figures 2 and S2).
We consider for the ML process two channels of AE variance as input, from time to time , and we predict the target (shear stress or TTF) at time using the input variable and its recent history. We want to learn the ideal function that maps our input into the desired output . With ML we can approximate that function with , leveraging input data, and then approximating the variable of interest In particular, the estimator makes predictions for sample . The quantity determines the length of the input sequence (
in this case) and represents the LSTM’s memory in the internal state. We chose the hyperparameterusing an ad hoc iterative approach (Section 4.1.1).
3.2.2 Deep learning model
We use a DL model based on the implementation of Levinson et al. (https://www.kaggle.com/c/LANL-Earthquake-Prediction/overview). This model, from Team "The Zoo," won the Kaggle competition for lab earthquake prediction (Johnson et al., 2021). We optimize the model according with our purposes, including varying hyperparameters and other minor changes. The architecture, represented in Figure 3, is a combination of an LSTM layer, which scans the input sequence and produces an embedding, followed by three stacked convolution layers (Figure 3
). This architecture extracts the patterns of the sequence and predicts the target. The model has a total of 421277 parameters. We implement the model in Keras (https://keras.io/https://www.tensorflow.org/about).
3.3.1 Problem definition
Autoregressive forecasting methods are distinct from existing predictive models. They predict not only the current state of the target (e.g. , where is the present time), but also future steps (e.g. from to , where is the length of the sequence to predict) given past information (e.g. from to ). In particular, in the forecasting problem the input and output variables need to be of the same nature; that is, we can input shear stress data from time to and predict shear stress in the future from time to . This type of DL model uses an autoregressive technique (AR) in which regression-based training occurs on the input variable itself. Autoregressive models predict their own input features one step at a time and keep predicting longer-term future horizons by using the previous output as input for the next step (Figure 4). The estimator makes predictions for samples from measurements of the target and continues predicting autoregressively , using the previous outputs . This proceeds iteratively as:
where . During AR training we apply the Teacher forcing technique. Teacher forcing is known to introduce a so-called exposure bias (He et al., 2019), since the training procedure focuses on predicting the next step, while at inference the model predicts from history and from its own (auto-regressive) predictions. However, it has been shown that the exposure bias may effectively be neglected in most cases (He et al., 2019).
With our use of teacher forcing, the prediction at time uses ground truth at rather than the prediction of , and proceeds iteratively as:
Thus, we introduce and accept differences between the train and test protocol (i.e., applying the "teacher-forcing" during training) but we still leverage the full data set. This training is more efficient because data are more informative than the predicted values. Model validation and testing are done without teaching-forcing, because we lack ground truth in those cases. The AR approach has several advantages. First, it is self-supervised and does not require labelling. This also extends to TTF: in fact, TTF is computed from the actual stress time series. So it is computed from the actual sequence of stress values for the (self-supervised) training and it is computed from the forecast stress values at test. Since TTF is a quantity that is manually built, this may be prone to ambiguity due to pre-processing. Also, AR has the potential to predict beyond the TTF estimates, by predicting multiple cycles into the future.
For AR training we set the time interval for input history (from to ) as the "steps in" variable. Then to establish the prediction time (from to ) we use length as the "steps out" variable. The values of and for training are also used for validation and testing. While it is possible to increase N and extend the forecast horizon, we expect a deterioration in performance for values greater than thehttps://pytorch.org/). Additional details of the procedure are provided in the Supplement.
3.3.2 Forecasting with LSTM
A key component of an LSTM (Figure 3(a)) is the memory cell that regulates information flow using three gates (i.e, Forget gate, Input gate, Output gate). The Forget gate deletes useless information by not releasing it to the next stage. The Input gate regulates new information into the network. The Output gate decides which part of the memory to output from the cell. During the training process, inputs and model weights pass to all gates, which in turn are connected to the self-recurrent memory cell (Figure 3(a)). Our model has 3 stacked layers with size 300, for a total of 1808701 parameters (see Supplement 9.1.2 for details).
3.3.3 Forecasting with Temporal Convolutional Networks (TCN)
Temporal Convolutional Networks (TCN) (Figure 3(b)) are Convolutional Neural Networks (CNN) that have been adopted for sequence modeling because of their performance (Dessì and Baroni, 2019). TCN consists of causal and dilated 1D convolutional layers with the same input and output lengths (Bai et al., 2018)
. Here, the term causal refers to the fact that convolution is applied only with the present and past elements but not the future. The term dilation in the context of a convolutional layer refers to the distance between the elements of the input sequence that are used to compute one entry of the output sequence: this increases the receptive field in each layer, making it possible to model long temporal patterns. In sequence modelling, TCN can be viewed as the process of sliding a 1D-window over the sequence to predict the part of the sequence with the length of the receptive field, using the chosen kernel. Such predictions are passed to the subsequent layers and the procedure continues until the receptive field has the size of the input sequence. The convolutions in TCN can be parallelized at training, because the filter that extracts the features, used in each layer, can be computed jointly. RNNs instead need to be unrolled during backpropagation and can not be parallelized.
We adopt a model composed of three convolutional 1D layers and scan the sequence in a causal fashion, with dilation=1. Our models have a hidden size of 64 in the first layer and 256 in the second layer. The last layer has the dimension of the output, which is 1 in this case (because the model forecasts the next step). The model has a total of 29761 parameters, which is 2 orders of magnitude less than what we use for LSTM.
3.3.4 Forecasting with Transformer Network
Our Transformer Network (TF) consists of a modular architecture with an encoder and a decoder that contains three blocks: attention modules (self-attention and encoder-decoder attention), a feed-forward fully-connected module, and residual connections (Figure3(c)). The network’s capability to capture sequence non-linearities lies mainly in the attention module. TF maintains the encoding output (memory) separate from the decoded sequence, which means that training is parallelizable. TF is parallelizable also because of the absence of the internal status (as in LSTMs), so interactions between input and output are direct and do not need recursion for loops of backpropagation.
The model we adopt is based on the work of Giuliari et al. (2020). This model has =128, 2 layers and 4 attention heads, for a total of 663938 parameters. We use RMSE and train the network via backpropagation with the NoamOpt optimizer; dropout value of 0.1 (Supplementary Material 9.1.3).
Although TF is often favored for its high performance, a weakness is the need for huge training datasets. This explains the poor performance of TF for our experiments. Thus to improve TF we pre-trained using a sine function as input. This allows the model to learn the oscillatory behavior of the experiments. We used a pre-training dataset of 1e7 rows to describe a long set of sine waves with the same resolution of our lab data. The sine function roughly matches our lab data with amplitude equal to one frequency of 0.1 Hz and sample rate of 1000 Hz. This approach shows that 1 or 2 epochs are enough to pre-train (Section 9.1.4). After pre-training, we fine-tune the model using training data for each experiment.
Shear stress and time to failure are reasonably well predicted by the LSTM+CNN architecture (Figure 3). The predictions match ground truth with an accuracy > (Table 1). Figure 5 shows model predictions and indicates that long term memory length is a key hyperparameter.
4.1.1 Best length for the past memory of LSTM
We investigate LSTM long term memory duration using both RMSE and to assess performance. The optimum length for is about one seismic cycle (Figure S3). Note that the the red lines show the optimum . For experiments p4581 and p5198 we adopt a lower resolution because of event periodicity and because we compute variance every 0.1 seconds (see Section 3.1 and Table S1). Thus, one seismic cycle corresponds to about 100 data points and the best length for the observation is or seconds. For p4679 we adopt a higher resolution, 0.003 s, to describe both fast and slow events. Here, one seismic cycle corresponds to about 1700 data points, and the optimum value of ( seconds) for predicting shear stress (Figure S3). For predicting TTF, the optimum value of ( s).
We found high values for all three experiments using an observation length of about one seismic cycle. This suggests that there is a saturation of performance at one seismic cycle, so this may be sufficient to understand the signal. After one cycle there is a decrease in performance, due to experiment aperiodicity and LSTM memory limitations.
LSTMs struggle to learn long-term trends. The presence of slow and fast events in p4679 requires higher resolution to calculate AE variance. Compared to sequences of quasiperiodic events, p4679 requires about a factor of 10 more data. Long seismic cycles with rapid stress drops represent challenging conditions for LSTM. The problem arises because of observation lengths and the fact that Forget gates tend to remove too much information.
4.1.2 Prediction dataset split
We train and validate with of the data and test with (Figure 2). Validation data were chosen randomly from the first of the data and this value () was removed from the training set. Thus the final division was: for training, for validation and for testing. We normalized our data using: .
4.1.3 Prediction results
Figure 5 and Table 1 summarize prediction results. Black lines show measurements and the colored lines (green, yellow and red) are predictions for shear stress, TTsF and TTeF, respectively. Note that the predictions are quite accurate. The model is able to accurately predict shear stress, with > 0.9. Also TTF predictions are accurate ( up to 90 %), even if noisy in a few cases. We observe a general trend of better performance for TTeF than TTsF. Also the performance is better for experiments p5198 and p4679 than for p4581. Figure 2 shows that TTeF is maximum where shear stress is minimum and variance is maximum. The peak variance scales directly with stress drop amplitude, with smaller values of peak variance preceding labquakes with smaller stress drop. As a consequence of smaller stress drop the time to reach a critical failure stress is smaller for constant loading rate,. Indeed, the maximum shear stress and the slope of the restrengthening phase are quite similar in all the seismic cycles. In contrast, the values of TTsF derived from maximum shear stress show greater variability –possibly because of creep prior to the stress drop.
We used only AE variance as model input to predict shear stress and TTF and found the same result for data from each AE sensor. Our model performance for each experiment exceeds the state-of-the-art (Table S2) as defined by existing works (Rouet-Leduc et al., 2017, 2018). While this suggests that we could further improve model performance, by using more complex features, we did not pursue this direction and instead focused on new DL models.
4.2 Experiments on Forecasting
Autoregressive forecasting models provide a method to predict shear stress variations based on past observations. We test AR models using our three DNN architectures: LSTM, TCN and TF. We analyzed all experiments but focus here on p4679 because of its complex behavior and aperiodic seismic cylces. For this task, we decimated shear stress data to s. Although this reduces the original data resolution, downsampling is necessary for computational load and to limit the size of the step-in hyperparameter, which is particularly important for representing multiple seismic cycles. To establish the AR models we use (20 seconds) which is a value that both describes seismic cycles and fits GPU memory limitations.
Our data include many seismic cycles but this number is actually quite small for training DL models. Thus, we forecast using overlapping windows. To compare performance of our DL architectures (LSTM, TCN and TF) for each experiment, we set up the model to predict 100 data points (10 seconds), which corresponds to 1-2 seismic cycles depending on the experiment. See Figure 6 for a sketch of these overlapped sequences. Shear stress forecasting is possible, although prediction accuracy varied between the DL models.
4.2.1 Forecasting dataset split
Our AR work used the same normalization as for prediction model, described above. The training set consists of the first 70 of the data and the testing set consists of the last 20. Model validation was done with the remaining of the data. Model performance was evaluated for future predictions in each window (Figure 6). We also do an average among windows to measure overall performance (Table 2). Different data segments result in different performances, thus in Figure 6 we show predictions for a range of different starting locations within the data stream. Note that AR models perform well for a range of starting times and data segments.
4.2.2 LSTM forecasting results
LSTM models generally produced the poorest predictions of future stress states (Figure 7). Our AR forecast results are reasonable for only p5198, which has the least complex seismic cycles. For the other experiments LSTM did not provide accurate predictions. While LSTM is generally good for time series forecasting, because of its flexibility and optimization, it has trouble with data from our experiments because of the data density, even if significantly decimated (dt=0.1 s), and seismic cycle length (of order 10 s). LSTM works well with fewer data points. For example, in the work of Giuliari et al. (2020) they have 8 data points in input and 12 data points in output, whereas we have 200 in input and 100 in output. In essence, for our experiments the LSTM network "forgets" the past too quickly and does not predict accurately the signal in the future.
4.2.3 Temporal Convolution Network forecasting results
While TCN is the simplest of our models, it produced some of the best AR results. The TCN models provide the best compromise between training complexity and the number of parameters required. TCN requires only one tenth of the parameters needed for LSTM and TF. From Table 2 and Figure 6 we can appreciate the TCN capability for forecasting. Results for experiment p5198 are quite good as are those for the more complex case of p4679. For p4581, TCN is the best of the tested models based on average forecast accuracy. However this experiment turned out to be challenging for each of the AR models as none of the forecasts were very good (Figure 6), possibly due to the lack of appreciable preseismic creep in this experiment.
4.2.4 Transformer Network forecasting results
TF is the second best model, after TCN. It is not the best on average in any experiment, however with the flexible attention focusing it captures higher order variations of the data. For the same reason it does not forecast well the average behaviour of the signal, as a simpler model like TCN does. TF is the most complex among the tested models for optimization and training, and it is also the most flexible, because of its complex connections. It requires a lot of data to start working, and for this reason we designed a pretraining stage in which we fed the model sine waves with a frequency similar to our data, as described in 3.3.4.
In several segments of p4581 TF is capable of detecting irregularities in the seismic cycle (Figure 6). However, in other places TF tries to predict irregularities that are noise rather than signal. One example of this occurs close to the local maxima in Figure 6 for p4581. Noise features like this caused trouble for TF. On the other hand, TF did well to forecast the aperiodicity of p4679 (Figure 6). For p5198, TF did well, possibly because the behavior is quite periodic. This is an indication that data overfitting during training may be a problem. We note that the improvement based on simple pre-training using a periodic signal provides some interesting future direction.
4.2.5 Forecasting results: model comparison and analysis
A key comparison between our AR models is how they perform with respect to the longer term temporal variations of the seismic cycle. Thus we evaluate results for several different data segments within the test set (Figure 7). Note that the performances in Table 2 are averaged for the entire test set. In Figure 7 we plot results for windows starting at several positions within the data. We can see that all the models have high variability depending on the tested window, hence the capability of the model in predicting the signal depends on the shape and level of irregularity of the input-output window. As expected from our summary of general results, LSTM produced the worst AR forecasts. The other three models are in general able to forecast the signal with quite low variability. We note also that for p4581 the performance decreases with time in some areas simply because the signal becomes more challenging, for example where there are irregularities before peak stresses as seen on the left in Figure 7.
An important question for the AR forecasting is that of how far into the future these models can predict and in particular if they can predict beyond a single seismic cycle. Figure 8 summarizes such testing. Here we plot the average of all the performance values for each forecast time in the future. Our models are trained to forecast 100 data points in the future (white section), however we extended these to forecast 200 data points in the future (gray section). LSTM simply gets worse as time is extended. TCN decreases in performance slowly, whereas TF has a peculiar behaviour. For forecasting times from 0 to 100, TF is similar to TCN, while afterwards it becomes suddenly worse. This is because in the training phase TF forecasts from 0 to 100, but the behavior is somewhat different in the next seismic cycle – from . This is perhaps surprising given that TF is renowned for generalization. Here, it is possibly the result of data complexity. This is clearly an important question for future work.
The first part of our work was devoted to evaluating DL models based on a combination of LSTM and CNN. Compared to the state-of-the-art based on existing works we find that DL models perform well. As suggested by previous works, we used the variance of the continuous acoustic emissions emanating from the laboratory faults. The DL models work reasonably well for all tested experiments, however results are better when creep occurs before the mainshock (i.e. as in p5198 and p4679). This is perhaps not surprising given that creep represents fault slip which could result in micro fracture and breakage of frictional asperity contacts. Moreover, the shear stress curves reflect creep and AE via the gradual reduction in the rate of increase prior to reaching a peak at the onset of a labquake (Figure 2). In the absence of creep, such as in p4581 (Figure S2), the acoustic variance is initially very low and increases suddenly at the onset of failure. As a result the DL models struggle to identify the exact time of the drop in shear stress. This appears to be part of the reason why Time To Start of Failure (TTsF) is always more challenging to predict than Time To End of Failure (TTeF), because the former is zero when the variance is beginning to grow, while the latter is zero right after the variance has first increased and then decreased.
The limitation of TTF is that it represents the time remaining for just one event. Moreover this quantity is not recorded directly during the experiment in the laboratory. Therefore we manually labelled a suitable dataset for it, before the model training phase. The autoregressive models solve this problem nicely because they are designed to forecast.
Our autoregressive forecasting procedure allows one to predict future values of shear stress during the lab seismic cycle. To the best of our knowledge, this is the first time such an AR approach has been used for lab earthquake data. Here, the innovation is that we can indefinitely forecast into the future without the need for labelling because we use the the same feature (shear stress) for both model input and output.
We tested several DL networks for the AR forecasting. All of the models work at some level. The LSTM network produced the poorest results. In time series forecasting, LSTM is one of the most common models, because of its flexibility, but in our application it has memory problems due to the considerable length of the sequences. TCN is the simplest model in terms of number of parameters and structure, and it performed better, perhaps because the target (shear stress) during the lab seismic cycle is somewhat periodic. TF is also flexible but it is the most complex for optimization and requires a lot of data to initiate a reasonable model because of the huge number of parameters to be trained. We dealt with that problem by pretraining the TF models with sine waves, however that represents another step in the processing.
Of the experiments we evaluated, p5198 was fit best by the DL models. Experiment p4679 was challenging because of its aperiodicity and the presence of slow and fast events. However our TCN and TF models were able to forecast reasonably well. The forecasts for p4581 were more challenging, perhaps because of the lack of appreciable fault creep and/or because of noise in the data. Further work is needed to resolve these issues.
One good way to advance our work would be to apply the AR method on real fault data. This would require measurements of shear stress, which are not generally available, however it is possible that one could use seismograms directly prior to during an earthquake. The idea here would be to use the AR prediction model to infer the shear stress from the seismogram. This would allow forecasts of the shear stress for future time steps. Another approach would be to simply predict the temporal evolution of the seismogram based on the initial part of the signal. A key questions here is then, are lab AE signals uniquely relatable to fault shear zone stress and if so, would that transfer to tectonic faults? Yet another approach would be transfer learning from lab earthquakes to real fault data.
We used Deep Neural Networks to predict and forecast laboratory earthquakes and lab measurements of fault zone shear stress based on seismic signals emanating from lab faults. Previous work showed that using the variance of lab seismic signals (from fault zone acoustic emissions) it is possible to predict fault shear stress and the time to failure. We systematically tested a range of DL models with a variety of lab faults and found that our models significantly outperform the state-of-the-art. Moreover we proved that it is possible to forecast future fault zone shear stress based on the previous history of stress values. This result has significant potential because shear stress is representative of the state of the fault and forecasting it in time means implicitly to forecast the timing of the future failure states. To our knowledge, this is the first application of a forecasting procedure with the goal of inferring autoregressively the future shear stress knowing the stress itself in the past.
Data and resources
The code is public and it is available at:
L.L, E.T., and C.M. were supported by European Research Council Advance grant 835012 (TECTONIC). C.M. acknowledges support from US Department of Energy grants DE-SC0020512 and DE-EE0008763
CRediT authorship contribution statement
Laura Laurenti: Conceptualization, Data curation, Formal analysis, Investigation, Writing – original draft, Writing – review & editing. Elisa Tinti: Conceptualization, Investigation, Writing – original draft, Writing – review & editing. Fabio Galasso: Methods, Data curation, Validation, Supervision, Writing – review & editing Luca Franco: Methods, Validation, Writing – review & editing. Chris Marone: Conceptualization, Data collection, Investigation, Methods, Supervision, Resources, Validation, Software, Writing – review & editing. All authors approve on the submitted article.
- Can precursory moment release scale with earthquake magnitude? a view from the laboratory. Geophysical Research Letters 46 (22), pp. 12927–12937. External Links: Cited by: §1.
- Global growth of earthquake early warning. Science 375 (6582), pp. 717–718. External Links: Cited by: §1.
- An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. CoRR abs/1803.01271. External Links: Cited by: §3.3.3, §3, 3(b).
- Accelerated seismic release and related aspects of seismicity patterns on earthquake faults. Pure and Applied Geophysics 159, pp. 2385–2412. Cited by: §1.
- Machine learning and earthquake forecasting—next steps. Nature Communications 12, pp. . External Links: Cited by: §1.
- Characterizing Acoustic Signals and Searching for Precursors during the Laboratory Seismic Cycle Using Unsupervised Machine Learning. Seismological Research Letters 90 (3), pp. 1088–1098. External Links: Cited by: §1.
- Acoustic energy release during the laboratory seismic cycle: insights on laboratory earthquake precursors and prediction. Journal of Geophysical Research: Solid Earth 125 (8), pp. e2019JB018975. Note: e2019JB018975 2019JB018975 External Links: Cited by: §1, §2, §2.
- Frequency-magnitude statistics of laboratory foreshocks vary with shear velocity, fault slip rate, and shear stress. Journal of Geophysical Research: Solid Earth 126 (11), pp. e2021JB022175. Note: e2021JB022175 2021JB022175 External Links: Cited by: §1, §1.
- Predicting off-fault deformation from experimental strike-slip fault images using convolutional neural networks. Geophysical Research Letters 49 (2), pp. e2021GL096854. Note: e2021GL096854 2021GL096854 External Links: Cited by: §1.
- Machine learning can predict the timing and size of analog earthquakes. Geophysical Research Letters 46 (3), pp. 1303–1311. External Links: Cited by: §1.
- CNNs found to jump around more skillfully than rnns: compositional generalization in seq2seq convolutional networks. External Links: Cited by: §3.3.3.
- Preseismic fault slip and earthquake prediction. Journal of Geophysical Research: Solid Earth 83 (B8), pp. 3940–3948. External Links: Cited by: §1.
- Seismic and aseismic preparatory processes before large stick–slip failure. Pure and Applied Geophysics 177, pp. 5741 – 5760. Cited by: §1, §1.
- . External Links: Cited by: Figure 7.
- Transformer networks for trajectory forecasting. External Links: Cited by: §3.3.4, §4.2.2.
- What allows seismic events to grow big?: insights from b-value and fault roughness analysis in laboratory stick-slip experiments. Geology 45, pp. 815–818. Cited by: §1.
- Quantifying exposure bias for neural language generation. ArXiv abs/1905.10617. Cited by: §3.3.1.
- Precursors to the shear failure of rock discontinuities. Geophysical Research Letters 41, pp. . External Links: Cited by: §1.
- Long short-term memory. Neural computation 9, pp. 1735–80. External Links: Cited by: §1, §3, 3(a), §9.1.2, §9.1.2.
- An exponential build-up in seismic energy suggests a months-long nucleation of slow slip in cascadia. Nature Communications 11, pp. 4139. External Links: Cited by: §1.
- Similarity of fast and slow earthquakes illuminated by machine learning. Nature Geoscience 12, pp. 69–74. Cited by: §1, §1, §1, §3.1, Table S2.
- Attention network forecasts time-to-failure in laboratory shear experiments. External Links: Cited by: §1.
- Cited by: §1.
- Laboratory earthquake forecasting: a machine learning competition. Proceedings of the National Academy of Sciences 118, pp. e2011362118. External Links: Cited by: §1, §1, §1, §3.2.2.
- Acoustic emission and microslip precursors to stick–slip failure in sheared granular material. Geophysical Research Letters 40, pp. . External Links: Cited by: §1.
- The effect of shear load on frictional healing in simulated fault gouge. Geophysical Research Letters - GEOPHYS RES LETT 25, pp. 4561–4564. External Links: Cited by: §2.
- Earthquake early warning shakealert 2.0: public rollout. Seismological Research Letters 91, pp. . External Links: Cited by: §1.
- Machine Learning in Seismology: Turning Data into Insights. Seismological Research Letters 90 (1), pp. 3–14. External Links: Cited by: §1.
- ImageNet classification with deep convolutional neural networks. Neural Information Processing Systems 25, pp. . External Links: Cited by: §9.1.1.
Seismic moment tensor and b value variations over successive seismic cycles in laboratory stick‐slip experiments. Geophysical Research Letters 41, pp. 5838–5846. Cited by: §1.
- Characterization of nucleation during laboratory earthquakes. Geophysical Research Letters 40 (19), pp. 5064–5069. External Links: Cited by: §1.
- Laboratory observations of slow earthquakes and the spectrum of tectonic fault slip modes. Nature Communications 7, pp. 11104. External Links: Cited by: §2, §2.
- Earthquake catalog-based machine learning identification of laboratory fault states and the effects of magnitude of completeness. Geophysical Research Letters 45, pp. . External Links: Cited by: §1.
- Machine learning bridges microslips and slip avalanches of sheared granular gouge. Earth and Space Science Open Archive, pp. 13. External Links: Cited by: §1, §1.
- A reinterpretation of the precursory seismic b-value anomaly from fracture mechanics. Geophysical Journal International 96, pp. 131–138. Cited by: §1.
- Temporal variations in seismic event rate and b-values from stress-corrosion constitutive laws. Tectonophysics 211 (1-4), pp. 233–246 (English). External Links: Cited by: §1.
- Influence of grain characteristics on the friction of granular shear zones. Journal of Geophysical Research 107, pp. 2219. Cited by: §2.
- The mixology of precursory strain partitioning approaching brittle failure in rocks. Geophysical Journal International 221, pp. 1856–1872. External Links: Cited by: §1.
- Pairwise association of seismic arrivals with convolutional neural networks. Seismological Research Letters 90, pp. . External Links: Cited by: §1.
- Preslip and cascade processes initiating laboratory stick slip. Journal of Geophysical Research: Solid Earth 119 (8), pp. 6323–6336. External Links: Cited by: §1.
- Bifurcations at the stability transition of earthquake faulting. Geophysical Research Letters 47 (19), pp. e2020GL087985. External Links: Cited by: §2.
- Cited by: §1.
- Earthquake transformer—an attentive deep-learning model for simultaneous earthquake detection and phase picking. Nature Communications 11, pp. 3952. External Links: Cited by: §1, §1.
- Monitoring frictional strength with acoustic wave transmission. Geophysical Research Letters - GEOPHYS RES LETT 35, pp. . External Links: Cited by: §1.
- Influence of fault strength on precursory processes during laboratory earthquakes. In Fault Zone Dynamic Processes, pp. 229–242. External Links: Cited by: §1.
- New Opportunities to Study Earthquake Precursors. Seismological Research Letters 91 (5), pp. 2444–2447. External Links: Cited by: §1.
Physics-informed neural networks: a deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics 378, pp. 686–707. External Links: Cited by: §1.
- Evolution of b -value during the seismic cycle: insights from laboratory experiments on simulated faults. Earth and Planetary Science Letters 482, pp. 407–413. External Links: Cited by: §1, §2.
- Generalized Seismic Phase Detection with Deep Learning. Bulletin of the Seismological Society of America 108 (5A), pp. 2894–2901. External Links: Cited by: §1.
- Estimating fault friction from seismic signals in the laboratory. Geophysical Research Letters 45 (3), pp. 1321–1329. External Links: Cited by: §1, §1, §3.1, §4.1.3.
- Machine learning predicts laboratory earthquakes. Geophysical Research Letters 44 (18), pp. 9276–9282. External Links: Cited by: §1, §1, §1, §3.1, §3.2.1, §4.1.3, Table S2.
- The frequency-magnitude relation of microfracturing in rock and its relation to earthquakes. 58, pp. . External Links: Cited by: §1.
- On the stress dependence of the earthquake b-value. Geophysical Research Letters 42, pp. . External Links: Cited by: §1.
- Quantifying damage, saturation and anisotropy in cracked rocks by inverting elastic wave velocities. Pure and Applied Geophysics 163, pp. 947–973. External Links: Cited by: §1.
- Frictional stability and earthquake triggering during fluid pressure stimulation of an experimental fault. Earth and Planetary Science Letters 477, pp. 84–96. External Links: Cited by: §2.
- Precursory changes in seismic velocity for the spectrum of earthquake failure modes. Nature Geoscience 9, pp. . External Links: Cited by: §1, §1.
- Deep learning can predict laboratory quakes from active source seismic data. Geophysical Research Letters 48 (12), pp. e2021GL093187. External Links: Cited by: §1.
- Machine learning predicts the timing and shear stress evolution of lab earthquakes using active seismic monitoring of fault zone processes. Earth and Space Science Open Archive, pp. 48. External Links: Cited by: §1.
- Competition between preslip and deviatoric stress modulates precursors for laboratory earthquakes. Earth and Planetary Science Letters 553, pp. . External Links: Cited by: §1, §1, §1.
- Observations of premonitory acoustic emission and slip nucleation during a stick slip experiment in smooth faulted westerly granite. Geophysical Research Letters 32, pp. 10304–. External Links: Cited by: §1.
- Slip heterogeneity and directivity of the m l 6.0, 2016, amatrice earthquake estimated with rapid finite-fault inversion: rupture process of 2016 amatrice event. Geophysical Research Letters 43, pp. . External Links: Cited by: §1.
- The spatiotemporal evolution of granular microslip precursors to laboratory earthquakes. Geophysical Research Letters 47, pp. . External Links: Cited by: §1.
- Attention is all you need. External Links: Cited by: §1, §3, 3(c), §9.1.3.
- Predicting fault slip via transfer learning. Nature Communications 12 (1). External Links: Cited by: §1.
|Time To Start Of Failures||0.6317||0.9313||0.8229|
|Time To End Of Failures||0.8721||0.9697||0.9200|
9 Supplementary material
9.1 Adopted Deep Neural Network architectures
We formulate sequence modelling as a regression task, i.e. lean to minimize the Euclidean distance between the true and the predicted values. Specifically, the sequence is a time series and the model needs to respect causality, i.e. in order to predict the output for some time , we can only use those inputs that have been previously observed . The problem faced in prediction (discussed in 3.2) is supervised because given the input to the model, we know the output that it should produce and therefore we can compute gradient and train the network to turn out the right output. The forecasting explored in 3.3 is also a regression problem, but learn self-supervisedly since we aim to predict the future from the past: the model needs to learn data representations to solve the task.
We will briefly present here the theory behind the architecture adopted for this work.
Convolutional neural network (CNN) is a class of deep and supervised models that was introduced for the first time by LeCun et al. in 1998 for processing data that has a grid-like topology (e.g. images): this first CNN was applied to digit recognition, using MNIST dataset. CNNs get a dominant class of deep learning methods after the ImageNet competition for image recognitionKrizhevsky et al. (2012)
, which has then become popular in the most varied applications. A typical architecture of a 2D convolutional network consists of a set of layers each of which contains several filters for detecting various features in the input image, the model performs convolutions using the chosen kernels and in doing this the procedure adds activation functions (i.e. activation when a feature is matched), constituting the so called feature map.
With 1D-CNN we can do the same for a 1-Dimensional input, e.g. a temporal series. We have a 1-Dimensional array in input and some 1-Dimensional kernels that we use to perform convolution and extract features, as in the 2D case. This is indeed the main difference between 1D and 2D CNNs: 1D arrays replace 2D matrices for both kernels and feature maps. That result in a low computational complexity: for 1D CNNs, with respect to of 2D CNNs. Where is the kernel size of the convolution, is the representation dimension or embedding dimension of a word, is the sequence length.
In the past years, LSTM (Long Short-Term Memory network) Hochreiter and Schmidhuber (1997)
has been successfully applied to a number of sequence model tasks, e.g. speech recognition, language modeling and translation, image captioning, trajectory forecasting and so on . In this work we decided to apply it in geophysical problems.
LSTM it’s a type of Recurrent Neural Network (RNN): RNNs are deep learning models that iteratively combines past informations with the present, to make them persist. Indeed, they have an “internal state” (hidden state) that can be seen as the memory: it is updated as a sequence is processed, by applying a recurrence formula at every time step, using function that combine the past information with the current input. In figure S4 left is represented the RNN if we unroll the loop. In figure S4 right is represented one iteration of the RNN, where:
here is the non-linear function, are the parameters, and are the hidden state at time and and is the input at time .
LSTM works, for many tasks, much better than the RNN standard version. They were introduced by Hochreiter and Schmidhuber (1997) Hochreiter and Schmidhuber (1997), and were improved and popularized by many people in following works.
The main problem of the standard RNNs is the difficulty to access information from many steps back. LSTM instead is explicitly designed to avoid the long-term dependency problem: they are capable of learning long-term dependencies, thanks to some internal mechanisms, called gates, that can regulate the flow of information. These gates can learn which data in a sequence are important to keep or throw away. Another important feature of LSTM is the Cell that performs better in forward (direct connection with past element) and in backward (easy backward of the model and avoid gradient vanishing problem, that is a common problem of other RNNs).
Here there are two internal states: and that proceed in parallel, and represent respectively the long and the short term memory. There is a complex mechanism to manage memory, by using four gates:
Input gate (i): whether to write to cell
Forget gate (f): whether to erase cell
Output gate (o): how much to reveal cell
Gate gate (g): how much to write to cell
In Figure 3(a) is represented one iteration of the LSTM. Details are provided by Formula 1, where and are the non-linear functions, are the parameters, and are the hidden state at time and , and are the cell state at time and and is the input at time . The "forget gate" say how much we should be forgetting about the previous cell information ( is the memory of our system) and then, once decided what to forget we would be modulating with an "input gate" how much we want to memorize from the current input .
To better understand the behavior of the memory, let’s assume we are at time t, then the LSTM memory explicitly consider all the information from time to : . The best length of the long term memory () is not known a priori: we further analyse it in the dedicated section below 4.1.1.
Here the computational complexity is: , where is the representation dimension or embedding dimension of a word, is the sequence length.
9.1.3 Transformer Network
This model was introduced in 2017 by A. Vaswani et al. Vaswani et al. (2017)
and it was born mainly for common natural language processing, but nowadays is successfully used in a variety of different sequence modeling tasks (e.g. video, audio and so on). It has an encoder-decoder structure where the encoder maps an input sequence of symbol representations () to a sequence of continuous representations . The decoder uses z to generates autoregressively an output sequence () of symbols. The Transformer Network (TF) is implicitly autoregressive, in that we use the predicted output in the input of the next step (auto means that it feeds its own prediciton). In particular, in order to let the transformer deal with the input, this is embedded onto a higher D’-dimensional space using a linear projection with a matrix of weights. In the same way, the output is a D”-dimensional vector prediction, which is back-projected to the original 1-D space.
Differently from RNNs that receive one input at a time, TF receives all inputs one-shot. The TF uses a "positional encoding" to encode time for each past and future time instant . Positional encoding is necessary to give an ordered context to the non-recurrent architecture of multi-head attention, because without it the model is permutation invariant. Sine/cosine functions are used to define positional encoding vector, that is: we represent the time in a sine/cosine basis.
The Transformer has 3 fundamental modules (attention, fully connected, residual connections). The attention modules are 2: self-attention and encoder-decoder attention. The encoder (Figure 3(c), left) has six identical layers, where each layer has two sub-layers: a multi-head self-attention mechanism and a position-wise fully connected feed-forward network. All the outputs have a dimension of . The decoder (Figure 3(c), right) has six identical layers and it has an additional layer in addition to the two sub-layers already described in the encoder: this performs multi-head attention over the output of the encoder stack. Decoder uses both self-attention and encoder-decoder attention, but the self-attention sub-layer in the decoder uses a masking mechanism to prevent positions from attending to subsequent positions. This ensures that the predictions for position can depend only on the known outputs at positions before . To start forecasting it uses a special token that indicate the start of the sequence. It is shown with <S> in 3(c). The "Add & Norm" layers in figure 3(c)
refer to Residual Connections (that sum the output of each layer with the input, to avoid vanishing gradient problem) and Layer Normalization.
The attention function maps a query and a set of key-value pairs to an output, where the query Q (dimension , where is the number of element in the sequence and the latent dimension), keys K (dimension ), values V (dimension ), and output are all vectors.Q is related with what we encode (it can be output of encoder layer or decoder layer); K is related with what we use as input to output; V is related with input, as a result of calculations, and it is a learned vector. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.
Instead of performing a single attention function, the model linearly project the queries, keys and values h times with different, learned linear projections to , and dimensions, respectively, then performing the attention function in parallel for each projected query, key and value. This allows the model to jointly attend to information from different representation subspaces at different positions: those are the heads, and we need more than one because each of these capture specific characteristic of the features.
For TF the computational complexity is: , where is the representation dimension or embedding dimension of a word, is the sequence length.
9.1.4 Transformer Network not pretrained forecasting results
As explained in 4.2.4, TF is good in learning the aperiodicity and the singularities, however the common feature of all the experiments is the oscillatory behaviour of the signal. Without the pretraining with the sine wave, TF can’t predict properly the target.
Moreover TF is the most complex among the tested model in optimization and training and it requires a lot of data and computing to start working. We have quite small dataset though, that are not enough in training properly the model. As shown in Table S3, the results of TF not pretrained are always worst than the pretrained TF. Some windows of example for the three experiments are in Figure S5.
9.2 Supplementary tables
|Exp p4581||Exp p5198||Exp p4679|
|Time To Start Of Failures||LSTM+CNN||0.6317||0.9313||0.8229|
|Time To End Of Failures||LSTM+CNN||0.8721||0.9697||0.9200|
|TF not pretrained||0.3410||0.6376||0.6061|