Infrequent adverse event prediction in low carbon energy production using machine learning

01/19/2020 ∙ by Stefano Coniglio, et al. ∙ 0

Machine Learning is one of the fastest growing fields in academia. Many industries are aiming to incorporate machine learning tools into their day to day operation. However the keystone of doing so, is recognising when you have a problem which can be solved using machine learning. Adverse event prediction is one such problem. There are a wide range of methods for the production of sustainable energy. In many of which adverse events can occur which can impede energy production and even damage equipment. The two examples of adverse event prediction in sustainable energy production we examine in this paper are foam formation in anaerobic digestion and condenser fouling in steam turbines as used in nuclear power stations. In this paper we will propose a framework for: formalising a classification problem based around adverse event prediction, building predictive maintenance models capable of predicting these events before they occur and testing the reliability of these models.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In this paper we consider the problem of predicting adverse events which occur infrequently in low carbon power generation. Our aim in this paper is to present a framework for building predictive maintenance models using machine learning techniques capable of predicting when such events are likely to occur. ”Predictive maintenance is a condition-driven preventative maintenance program.”[mobley2002introduction] In our case this means building models which take operating conditions as an input and provide an indication of wether the adverse event is likely to occur as an output. The energy production methods we will focus on are bio-energy and civil nuclear power production, aiming to predict adverse events of foaming and condenser fouling respectively. This work is done in collaboration with DAS Ltd, a consultancy group based in Bristol, UK with expertise in engineering and energy.

1.1 Anaerobic Digestion

In the context of bio-energy we consider the production of biogas via anaerobic digestion (AD), In which feed sludge is fed into a digester where it is broken down by micro organisms to release biogas which is then collected from the top of the digester and burned to produce energy. This feed sludge comprises of organic matter such as food waste crop feed or agricultural waste. Under normal operation gas bubbles rise from feed sludge as it is digested, then collapse releasing biogas. This biogas is then burned in order to power a turbine, producing electricity. Under certain conditions these gas bubbles may take longer to collapse than it takes for new bubbles to form, resulting in the formation of a foam. Foaming is a considerable concern in the use of AD for biogas production, as it can block the gas outlet resulting in the digester having to be shut down for it to be cleaned. This cleaning can take days. During this time the digester is not producing biogas which has a considerable impact on energy output. In extreme cases where the gas outlet and pressure release valve become blocked, pressure can build until the roof is blown off the digester.[ganidi2009anaerobic]

Foaming can be treated using anti-foaming agent which, upon introduction into the digester reduces the surface tension of the bubbles in the digester, allowing them to collapse more easily. This however requires foaming to be detected before it can cause serious problems. In this work, our aim is to create a predictive maintenance model able to reliably predict foaming with enough warning for the plant operators to administer anti-foaming agent and subdue the foaming before it can damage the digester.

Foaming in AD is a well researched area, however many research papers approach foaming from a chemical analysis or engineering perspective [kanu2015understanding] [kanu2018biological]

. There have been some attempts to model foaming using data science methods. Most models for predicting foaming are knowledge based systems (KBSs), meaning they require in-depth knowledge of the specific digester in question, as well as often making use of knowledge of the chemical composition of the feed stock, also known as feed sludge characteristics. Even models which take advantage of all of this additional information are unable to predict foaming before it occurs. ”… the usefulness for prediction is limited: it only provides a warning when the problem is already in a developed stage.”

[dalmau2010model]

. Machine learning models have been used with some success for state estimation in AD

[gaida2012state]

, such as using neural networks to predict methane production

[Fernandes2014ANNforAD]. Machine learning models have also been used to try and gain insight into which variables could be the best indicators of foaming [dalmau2010selecting]. Utilising feed sludge characteristics have also proven to be more effective than simply analysing operating characteristics, however it is not always feasible to monitor feed sludge characteristics. Andigestion, a client of DAS Ltd, have provided us with hourly readings of various operating characteristics over a two year period in which there were multiple foaming events. We do not have sufficient insight into the configuration and running of the digesters to implement one of the more popular KBSs, we do not have the feed stock characteristics required to emulate some of the other works cited above and we also wish to predict foaming before it occurs. These constraints have required us to develop our own method for predicting foaming in AD.

A popular method for avoiding foaming is to regularly inject anti-foaming agent into the digester. This prevents foam from building up but can be very expensive. A model capable of predicting when foaming is likely to occur would enable plant operators to only inject anti-foaming agent during these high risk periods, resulting in a significant reduction in the amount of anti-foaming agent required to avoid foaming and thus a considerable cost reduction.

1.2 Condenser Fouling

In the context of civil nuclear power generation, we consider the operation of the steam turbine to convert thermal energy from steam to electrical power. We define load as the amount of power being generated. In the context of a plant this means the amount of power being generated by all of the plant’s steam turbines, referred to as the set. In the context of a single steam turbine it is simply how much power it is generating. In normal operation the steam, generated in the boilers passes through the turbine rotors causing them to rotate, generating power. A turbine is on-load if it is rotating and hence producing power, similarly it is off-load if it is not rotating and hence not producing power. The steam is then passed through the condenser where it is cooled and then returned to the liquid water phase to be recirculated. The condenser consists of thousands of titanium tubes containing sea water, which acts as the primary coolant. Under normal operating conditions, the sea water circuit remains isolated from the steam circuit however it is possible for a leak to form in one of the condenser tubes causing sea water to contaminate the steam circuite. In the event of a leak in one of the tubes in the condenser the steam turbine is automatically tripped, shutting it down, and remedial action is taken to fix the leak off-load. This is extremely costly as an unplanned trip means electricity has to be generated or acquired elsewhere at a greatly inflated price.

If warning of a tube leak can be provided prior to its advent, the set can be reduced in power and the turbine in question can be isolated and repaired on-load. The set can then be brought back to full power. This ability to provide warning that the load will be reducing means a great reduction in the cost for compensating that loss of power. This warning also enables plant operators to fix the leak much more quickly and efficiently whilst remaining on-load, resulting in much less lost generation. Typically it takes 2-3 days to completely repair a condenser tube leak. It is estimated that this predictive functionality would yield a cost saving of £700,000 per event.

1.3 Contributions of the Paper and Outline

This work has been conducted in collaboration with DAS Ltd. The energy departments at DAS Ltd provided the data sets that we study in this paper.

In this paper we will present a framework for building and testing predictive maintenance models to detect infrequent adverse events from time series data. We will outline the utilised data preparation techniques, formulate adverse event prediction into a classification problem and explain the methods we will use for analyzing and comparing the classifiers used to create predictive maintenance models.

Section 2 will outline the framework used to build and test predictive maintenance models. Section 3 will show the framework described in Section 2 in use on both our AD data set and our nuclear energy production data set. It will also show the predictive maintenance models which are advisable for use on each of the two data sets as well as which additional machine learning techniques best optimise their performance.

2 Formalising The Classification Problem

2.1 Problem Description

In this section we aim to formalise the problem of adverse event prediction from time series data such that this problem can be solved using machine learning algorithms. We begin by considering the data we are provided with: We are given a set of time series data, each data point consisting of observations of variables at time and a set of labels denoting if the event is occurring at time as shown in the following equation:

(1)

The data and labels are recorded over a time period and each data point or label is associated with a time index denoting at how many time intervals into the recording the data point or label was recorded. For example, with a time interval of one hour the data point was observed 1000 hours into the recording. The set of all time indexes is denoted by . We employ a common trick in time series forecasting often referred to as leading or lag inclusion. We define as

(2)

where is the number of lags we wish to include. We will refer to each as a pattern. The problem of adverse event prediction from time series data can now be formalised as finding a function which for a given pattern provides the associated label . This problem is defined as follows:

(3)

However, solving this problem would result in a tool capable of predicting the event when it is occurring. Such a tool is of little use as once an adverse event is occurring it is often to late to prevent or minimise the damage it will cause. Our aim in this paper is to predict adverse events before they occur, with enough warning that maintenance can be carried out to prevent or minimise the damage caused by the event. To do this we will create a new set of labels , which we define by

(4)

where equals if the event occurs within the time period to and equals otherwise. We will refer to as the event association. Selecting an appropriate event association is a compromise between being sufficiently large as to allow time for required maintenance and small enough as to still be closely associated to the event. Our new problem of predicting an adverse event before it occurs from time series data is defined:

(5)

Intuitively we aim to find a function which given a pattern comprised of combined with the previous data points can provide a label indicating if an event will occur within the following time intervals.

This a is binary classification problem. In classification we assume there is labeling function , which correctly maps from each data point to each label. Our learning algorithms are required to output a prediction rule . This function is also commonly referred to as a predictor, a hypothesis or a classifier. The goal of our learning algorithms is to find a

such that the probability of

is as low as possible [shalev2014understanding]. In the next section we will outline the methodology we will use in our experiments to optimise and validate the accuracy of this prediction rule.

2.2 Methodology

The standard approach to approximating this prediction rule would be to split the data into training and testing sets: and . Where is a subset of , created by randomly sampling of the objects in without replacement. The remaining constitute the testing set, as shown in Equation 6. determines the ratio of training to tesing data and can range from 0.7 to 0.9.

(6)

Once we have sampled a training and testing set, we would build a classifier, a machine learning algorithm for approximating the prediction rule in classification problem. This classifier would find the prediction rule, as defined in Equation 5, which best mapped the training data to their associated labels . There are multiple ways to evaluate a prediction rule. Some popular metrics include mean absolute prediction error (MAPE) and mean squared prediction error (MSPE). In the case of binary classification, these two metrics function the same. Equation 7 shows the MAPE of a given prediction rule on a training set. Fitting a classifier to a training set means finding the function which minimises the function in Equation 7.

(7)

Different classifiers make different assumptions about the prediction rule in order to simplify this optimisation problem, resulting in each classifier finding a different optimal prediction rule. In order to assess which classifier produced the best prediction rule we would test the accuracy of each classifier’s prediction rule by using it to predict the labels of the testing data and comparing these predicted labels to the true labels . The classifier which performs best at predicting labels for this unseen data is assumed to have best located the features of the training data which indicates it’s label.

To ensure this result is reliable we could then re-sample the training and testing sets a number of times and repeat the test. This process is called K-fold cross validation (CV). To prepare our data for K-fold CV we must first randomly sample from our data without replacement into K, equally sized, folds. We define as the fold for and as the set of all folds, as shown in Equation 8.

(8)

In K-fold CV we iteratively select one of our folds to serve as a testing set. The remaining

folds are used to train a classification model. Which as we have seen means finding the prediction rule which minimises the loss function shown in Equation

7. Equation 9 shows the optimal prediction rule for labeling the data points from all of the folds excluding the .

(9)

We can then measure the accuracy of this prediction rule on the reaming fold which was reserved for testing. This process is continued for all iterations leaving us with a MAPE for each testing set. The MAPE given by K-fold CV is shown in equation 10, is simply a metric ranging from 0 to 1 and is the average of the MAPE for each of the testing sets. For an example of this form of K-fold CV used to compare the performance of classifiers see [zhang1999artificial].

(10)

As our data is time series it is not advisable to shuffle our data. There are two reasons why it is important to preserve chronology in our data sets.

Firstly, for ease of interpretability. Once a model is trained we intend to test it on an unseen data set. If that data set is in chronological order we would expect to see the predicted probability of the event occurring increasing as we approach the event and to stay high during it. If the data was in a random order it would be more difficult to determine if our classifier performs well as a predictive maintenance model.

Secondly, if the data set is randomly sampled into training and testing then for a given observation in your testing set it is likely that either . In many systems variables do not vary drastically over a single time interval. Because of this the observation will likely be very similar to the observation . This means that if data is randomly sampled into training and testing then your testing set is not truly unseen when training the classifier, as for each point in the testing set there is likely a very similar point in the training set. This will result in a classifier with very high predictive accuracy on the testing set but which would be of very little use as a predictive maintenance model.

K-fold CV can also be performed without shuffling our data. This time we equally partition our data, without shuffling it, into K folds as shown in Equation 11. We then proceed to perform K-fold CV as shown in Equation 10 using in place of . This method minimises shuffling. It is inevitable that for the data point at the beginning of the block reserved for testing that the point preceding it will be in the training set. However, this is the only point for which this can occur, a vast improvement on randomly sampling.

(11)

There are problems with partitioning our entire data set for using K-fold CV:

Firstly, due to the fact that the events we are trying to predict occur very infrequently, our classification problem has a large class imbalance. In classification the set of data points or in our case patterns associated with each label is referred to as a class. In our classification problem we have two classes, the patterns associated with the event and the patterns which are not. Because the events are very infrequent, the vast majority of the patterns are not associated to the event. A large class imbalance often results in models which predict labels for the majority class much better than for the minority class. In scenarios such as ours the majority class can constitute over 99% of the total data.

Secondly, due to this class imbalance it is likely when splitting the data set into folds for K-fold CV that some folds will not contain any events. This means that when performing CV some of the testing sets will only contain one class of data: Patterns which are not associated to the event. This means that classifiers that accurately predict when the event will not occur will score well on these rounds of CV, regardless of how well they predict the event it’s self. This introduces a bias towards predicting when the event will not occur.

A final drawback to using all of the data provided is the inclusion of patterns immediately folowing an event. In the time period immediately following an event you would not expect the system to have returned to normal operation, this could be as a result of changes in the system directly caused by the event or changes caused by the maintenance required to remedy the damage caused by the event. Patterns during this time are labeled as unrelated to the event. While these patterns are not typically indicators of the event they are not indicators that the system is in a stable state, as suggested by their label. It is advisable to not include these patterns in the data we use to conduct experiments.

In order to avoid the issues mentioned perviously we sample our data in to a number of blocks equal to the number of events that occur in our data set. Each block is of an equal size and consists of the end of the event together with the preceding time intervals as defined in Equation 12.

(12)

These blocks constitute our new data set on which we can preform k-fold CV. This sampling avoids all of the problems previously: It reduces the class imbalance drastically as for a reasonable value of we leave out a large portion of the data points where the event does not occur, it guarantees that there is an event in each fold, meaning there is no additional bias to predicting when an event will not occur and patterns which follow an event are left out.

For each selected iteratively (13)

In Section 3 of this paper we will use K-fold CV as defined in Equation 10 using the blocks defined in Equation 12 in order to compare the performance of classifiers.

Most classifiers have hyperparameters, parameters which must be selected before the algorithm begins and remain unchanged during process of the algorithm. This means that the prediction rules of these classifiers are a function of the hyperparameters as well as the data point

. It is common to use preselect values for these hyperparaeters. These values are chosen to give good results across a wide range of data sets. Tuning these hyperparameters can result in a significant increase in the performance of these classifiers.

A common method for tuning these hyperparaeters is using Gridsearch together with K-fold CV. Where as previously we use used K-fold CV to select the best classifier, here we use it to select the best set of hyperparameters for optimising the classifiers performance. This means that when we wish to compare the performance of multiple classifiers, whilst tuning their hyperparameters, we again use K-fold CV method shown in Equation 10. However instead of simply training the model on the remaining K-1 folds, as shown in Equation 9 we perform an additional round of CV to find the best set of hyperparameters as shown in Algorithm 1.

In Algorithm 1 is as defined in Equation 12. is the hyperparameter grid, the grid of all combinations of hyperparaeter selections you want to test. is the hyperparameter selection which gives the best results over the data set which is all folds except for .

could be a vector of selections of multiple hyperparameters.

is the MAPE associated with . is the MAPE given by the overarching K-fold CV with hyperparameter selection.

Result: K-Fold CV MAPE with hyperparameter tuning using gridsearch
for  do
      
       for  do
            
             if  then
                  
                  
             end if
            
       end for
      
end for
Algorithm 1 Nested K-Fold CV For Model Selection Using Gridsearch For Hyperparameter Selection

Our aim is to build models capable of predicting if an event is likely to occur in the next time intervals, where is the event association defined in Equation 4. We are not interested in wether our models can accurately predict if an event is currently occurring. For this reason we remove all data points who’s original label is 1 from our CV folds when we use them for testing. Equation 14 defines , the CV fold with the data points recorded during the event removed.

(14)

We may also choose to include or exclude the events from our training folds. In excluding the event from all folds we are performing our K-Fold CV on , as opposed to using defined in Equation 12. By Including the event we perform our K-Fold CV on and at each iteration remove the event only from the fold reserved for testing. Including the event in the folds used for training makes the assumption that the behavior of the system during the event is indicative of it’s behavior leading up to the event, so including these data points increases the number of data points associated to the event, helping to reduce our large class imbalance and improving the strength of our classifiers. Excluding the event makes the assumption that the behavior of the system during the event is independent to it’s behavior leading up to the event, meaning including this data would only serve to add noise to our minority class resulting in weaker classifiers. For this reason, choosing wether to include or exclude the event from the training sets is case specific. Preliminary experimentation or in depth knowledge of the system and event is required to make this decision.

In our experiments we will not use MAPE for our final comparison of classifiers performance on the various testing sets. This is due to the fact that when a data set is highly imbalanced MAPE has a strong bias towards models which predict the majority class well. In our case this would lead to a bias towards models which accurately predict when the event is not going to happen over models which accurately predict when it will. To avoid this bias we will score our models performance on the testing set in each iteration of K-fold CV using area under reviver operating curve scoring (AUROC). ”This is a measure of discrimination power without regard to class distribution or misclassification cost.”[brown2012experimental] AUROC scoring does not penalise models for being conservative in predicting the minority class as long as they are very accurate in predicting the majority class.

As we are not trying to predict the event its self, it makes little sense to include the events in the training of our model. Including the events in the training makes the assumption that the behavior of the variables during the event is very similar to their behaviour just before the event. If this is the case then training on the events in the training set may help you classify data points lying shortly before the event in your testing set. For this reason while the event should not be used for testing, in some instances it may be advisable to use the events for training.

3 Experiments

Both of our data sets have been provided by DAS Ltd, a consultancy group based in Bristol with expertise in engineering and energy.

Andigestion is a UK based anaerobic digestion company producing renewable energy for the national grid through the combustion of Biogas obtained from natural waste. DAS have been working with Andigestion with the aim of finding a solution to the problem of foam formation in Anaerobic digestion. A time series set of sensor data has been provided by an Andigestion AD plant based in Holsworthy, UK. The data set consists of 14617 hourly readings of 9 numeric variables. This amounts to 20 months of runtime from December 1st 2015 to July 31st 2017. Over this period there were 5 distinct foaming events. This data set can be analysed using the framework laied out in Section 2.2. As detailed in 2.2 we break this data set up into 5 blocks, one for each event, of the 1000 readings which includes data from before the event and ending with the event its self. In real time each of these blocks represents slightly over a month of sensor readings before an event followed by the sensor data for the event itself.

The data set for condenser fouling is provided to by DAS and is collected from a UK civil nuclear plant. The water used to turn the turbine is cooled in a sea water condenser. Our aim is to predict when a defect is beginning to form in the condenser before the sea water is able to enter the system, or to catch the defect in its infancy. This data set consists of 30664 readings taken at 3 hour intervals of 14 numeric variables. This amounts to 10 years of data from 2009 to 2019. Over this period there were 10 recorded fouling events. However only 6 of these events can be considered independent as the remaining 4 occurred shortly after another fouling event. This data can also be analysed using the framework laied out in Section 2.2. As detailed in Section 2.2 we break this data set up into 6 blocks, one for each event, of the 250 readings which includes data from before the event and ending with the event its self. In real time each of these blocks represents slightly over a month of sensor readings before an event followed by the sensor data for the event itself.

The structure of these two problems is strikingly similar; we have a large amount of time series data comprising of regular readings of operating conditions and we have an negative event occurring at unpredictable time intervals. Figure 3 shows two time lines, the top time line showing incidents of condenser fouling from 2009 to 2019 and the bottom time line shows incidents of foaming in between December 2015 to July 2017. The two expanded sections show the behavior of the variables in the run up to an event. Attempts made by DAS to interpret the behavior of the variables in the time period proceeding an event have been unsuccessful as there is no clear trends or patterns which commonly proceed an event.

Figure 1: Scaled Condenser Variables (Top) and Anerobig Digester Variables (Bottom)

These experiments were run using Python 3.7 on Iridis 4, a high performance computing clusters at the University of Southampton. In this experiments we will evaluate 10 popular classifiers using the framework laid out in Section 2.2

. The following classifiers will be tested: Radial Basis Function Kernel Support Vector Machine (SVM), Random Forrest (RF), Multilayer Perception (NN), Logistic Regression (LR), AdaBoost (AB), K-Nearest Neighbours (KNN), Decision Tree (DT), Gaussian Naive Bayes (GNB), Quadratic Discriminant Analysis (QDA), Gradient Boosting (GB). Each of them is implemented using the Python package Sklearn. Due to the need to sometimes include the event in the training of the models but exclude it from the testing, it is not possible to use the Sklearn function cross_val_score for k-fold CV or GridSearchCV for hyperparameter tuning. Therefor we insead use the method laid out in Algorithm

1 in Section 2.2.

The scaling of variables can be important for the good performance many classifiers. Min-max scaling is commonly employed to restrict each variable to a range between 0 and 1. The scaler was fit on the training set then used to scale the data from the training and testing sets. This was implemented using the function MinMaxScaler from the Python package sklearn.

Sampling can be used to address the class imbalance in our data sets. In our experiments we will test the effects of under and over sampling our training sets before constructing classifiers and testing them on the un-sampled testing sets. When under sampling we sample a subset of our majority class of the same cardinality as our minority class. When over sampling we sample, with replacement, a set from our minority class that has the same cardinality as our majority class. This is implemented using the functions RandomUnderSampeler and RandomOverSampler from the Python package imblearn.

Figures 4 and 5 in Section 5 show how the process of the experiments we conduct. This includes the use of scaling and sampling techniques.

For each data set we will evaluate each classifier using the k-fold CV method laid out in Section 2.2. We will then attempt to use gridsearch for hyperparameter tuning using a grid of 100 hyperparameter combinations for the best three performing classifiers. We will also evaluate sampling techniques using the Python package Imblearn. We will evaluate the performances of the classifiers with the following of event associations: 12 Hours, 24 Hours, 36 Hours, 48 Hours and 98 Hours. Choosing the most appropriate event association will depend on firstly, how much warning you require to perform maintenance to prevent the event from occurring, and secondly, which event association yields the most accurate predictions. The best three performing classifiers will be determined by their average rank among the other classifiers for each event association size.

3.1 AD Foaming

It was determined through preliminary experiments that including the event in the training of classifiers increased the predictive accuracy of these models. We decided to include in each data point lags of the observations from the previous 4 hours. We then proceeded to test each of the 10 classifiers using 5-fold CV where the event was included in the training set but excluded from the testing as outlined in Figure 5. For these experiments preselect values for hyperparameters were used as determined by Sklearn’s default hyperparameter selection. The results of these experiments can be shown in Table 3.1, where each classifiers average AUC score over the 5 folds is given for varying event association sizes.

Assoc. Size SVM RF NN LR AB KNN DT GNB QDA GB
12 Hours 0.988 0.978 0.992 0.990 0.843 0.711 0.738 0.981 0.792 0.937
24 Hours 0.971 0.903 0.969 0.967 0.852 0.689 0.735 0.946 0.661 0.870
36 Hours 0.906 0.882 0.900 0.916 0.831 0.668 0.635 0.905 0.682 0.844
48 Hours 0.831 0.766 0.808 0.844 0.638 0.664 0.630 0.843 0.596 0.708
96 Hours 0.602 0.750 0.729 0.536 0.486 0.650 0.586 0.690 0.663 0.667
Table 1: Classifier Comparison

There is a general theme shown in Table 5

that Foaming becomes more difficult to predict the larger the event association size. In other words the larger the warning you require, the less accurately the model can provide said warning. The best performing classifier is the Multilayer Perceptron, which had an average rank of 2.6 over the 5 event association sizes. Radial Basis Function Kernel Support Vector Machine, Logistic Regression and Gaussian Naive Bayes all performed equally well, each having an average rank of 3.2. The Gaussian Naive Bayes classfier only outperformed other classifiers on the larger event association sizes where all classifiers performed poorly. For this reason the classifiers we will look to tune the hyperparameters of and test sampling techniques on are Multilayer Perceptron, Radial Basis Function Kernel Support Vector Machine and Logistic Regression.

We implemented hyperparameter tuning by performing the 5-fold CV with hyperparameter tuning method shown in Section 2.2. We tune each classifiers using gridsearch over a grid of 100 hyperparameter combinations. Table 3.1 shows the hyperparameter values used in this gridsearch. Table 3.1 show the results of hyperparameter tuning on our best performing classifiers.

SVM for
and for
NN Layer Size 10, 25, 50, 100
Number of Layers 1, 2, 3
Learning Rate constant, adaptive
for
LR log range of 50 values from to
Penalty l1, l2
Table 2: Table of Hyperparameter Values
Assoc. Size SVM SVM* NN NN* LR LR*
12 Hours 0.988 0.963 0.992 0.962 0.990 0.962
24 Hours 0.971 0.954 0.969 0.962 0.967 0.907
36 Hours 0.906 0.743 0.900 0.944 0.916 0.844
48 Hours 0.831 0.735 0.808 0.821 0.844 0.795
96 Hours 0.602 0.618 0.729 0.610 0.536 0.599
Table 3: Tuned Classifiers Compared to Un-tuned Classifiers

In most cases tuning hyperparameters using grid search did not result in higher performance. In the instances where tuning did improve performance it was not by a significant margin. This suggests that the tuned hyperparameters were over fit to the training set, implying that the hyperparameter selection which resulted in the best perdictive accuracy for 4 of the 5 foaming events will not necessarily ensure good predictive accuracy on the 5th.

Table 3.1 show the effects of under and over sampling on our best performing classifiers.

Assoc. Size SVM NN LR
SVM Under Over NN Under Over LR Under Over
Total Sampled Sampled Total Sampled Sampled Total Sampled Sampled
12 Hours 0.988 0.970 0.966 0.992 0.978 0.975 0.990 0.980 0.987
24 Hours 0.971 0.925 0.950 0.969 0.942 0.892 0.967 0.954 0.970
36 Hours 0.906 0.891 0.913 0.900 0.896 0.803 0.916 0.907 0.908
48 Hours 0.831 0.846 0.838 0.808 0.784 0.743 0.844 0.821 0.796
96 Hours 0.602 0.650 0.527 0.729 0.590 0.733 0.536 0.498 0.504
Table 4: Sampling Technique Comparison

It is clear that, for almost all event associations, sampling techniques do not improve the performance of the Multi-Layer Perception classifier or the Logistic Regression classifier. And where they do improve performance it is by a very slim margin. for event associations larger than 24 hours the performance of the Support Vector Machine classifier is marginally improved by the use of sampling techniques. However, the SVM classifier’s perforce is better without the use of sampling techniques for event associations of 12 and 24 and its performance seems to be quite poor for event associations larger than 24 hours despite the slight improvements gained from the use of sampling techniques.

In light of these experiments, an event association of 24 hours results in the most accurate models whilst giving sufficient warning to carry out necessary maintenance to prevent the foaming. At the event association of 24 hours the best performing model is the SVM classifier which has an average AUROC score of 0.971 over the 5 folds of our CV testing. Figure 2 shows the testing of the various SVM classifiers trained as part of our CV experiment. Each graph shows a models predictions when testing on one of blocks where the other 4 blocks have been used for training. Beneath each plot is the associated ROC plot for those predictions.

Figure 2: Best performing predictive maintenance model’s predictions on unseen foaming data blocks with their associated ROC plots bellow.

3.2 Condenser Fouling

It was determined through preliminary experiments that excluding the event from the training of classifiers increased the predictive accuracy of these models. There is a number missing values in this data set. These were replaced with the mean value of that variable within that block. We decided to include in each data point the lag of the previous observation, this corresponds to including the values of the variables 3 hours before the time in question. We then proceeded to test each of the 10 classifiers using 6-fold CV where the event its self was excluded from the training and testing as outlined in Figure 5. For these experiments preselect values for hyperparameters were used as determined by Sklearns default hyeprparameter selection. The results of these experiments can be shown in Table 3.2, where each classifiers average AUC score over the 6 folds is given for varying event association sizes.

Assoc. Size SVM RF NN LR AB KNN DT GNB QDA GB
12 Hours (4 Obs.) 0.733 0.601 0.945 0.925 0.626 0.728 0.420 0.809 0.500 0.556
24 Hours (8 Obs.) 0.827 0.629 0.889 0.878 0.725 0.682 0.473 0.790 0.532 0.658
36 Hours (12 Obs.) 0.788 0.651 0.849 0.846 0.717 0.636 0.456 0.749 0.515 0.588
48 Hours (16 Obs.) 0.586 0.630 0.637 0.636 0.748 0.583 0.591 0.812 0.641 0.665
96 Hours (32 Obs.) 0.452 0.520 0.477 0.560 0.676 0.565 0.536 0.701 0.516 0.573
Table 5: Classifier Comparison

As in the AD data set condenser fouling becomes more difficult to predict the larger the event association size. The best performing classifiers are Gaussian Naive Bayes, Logistic Regression and Multilayer Perceptron with average rankings of 2.6, 3 and 3.2 respectively.

We implemented hyperparameter tuning by performing the 6-fold CV with hyperparameter tuning method shown in Section 2.2. We tune each classifiers using gridsearch over a grid of 100 hyperparameter combinations. The hyperparameter values used for tuning the Logistic Regression and the Multilayer Perception classifiers in this grid search can be found in Table 3.1

. For the Gaussian Naive Bayes classifier we only tune a single hyperparameter: variance smoothing. We will search over a logarithmic range of 100 values from

to . Table 3.2 show the results of hyperparameter tuning on our best performing classifiers.

Assoc. Size NN NN* LR AB* GNB GNB*
12 Hours (4 Obs.) 0.945 0.766 0.925 0.939 0.809 0.734
24 Hours (8 Obs.) 0.889 0.831 0.878 0.633 0.790 0.730
36 Hours (12 Obs.) 0.849 0.794 0.846 0.786 0.749 0.765
48 Hours (16 Obs.) 0.637 0.748 0.636 0.651 0.812 0.768
96 Hours (32 Obs.) 0.477 0.693 0.560 0.680 0.701 0.689
Table 6: Classifiers Compared to Un-tuned Classifiers

As was the case for the AD data set, In most cases tuning the hyperparameters using grid search did not result in higher performance. In the instances where tuning does improve performance, it is not by a significant margin. This suggests that the tuned hyperparameters were over fit to the training set, implying that the hyperparameter selection which resulted in the best predictive accuracy for 5 of the 6 foaming events will not necessarily ensure good predictive accuracy on the 5th.

We then attempt to use sampling techniques to improve the performance of our three best classifiers. We evaluated the effect of under and over sampling on our best performing classifiers. The Gaussian Naive Bayes classifier shows no majority class bias and so its performance would not be improved by sampling techniques. Table 3.2 show the effects of under and over sampling on our best performing classifiers.

Assoc. Size NN LR GNB
NN Under Over LR Under Over
Total Sampled Sampled Total Sampled Sampled
12 Hours (4 Obs.) 0.945 0.925 0.726 0.925 0.811 0.900 0.734
24 Hours (8 Obs.) 0.889 0.798 0.768 0.878 0.858 0.765 0.730
36 Hours (12 Obs.) 0.849 0.765 0.694 0.846 0.781 0.766 0.765
48 Hours (16 Obs.) 0.637 0.612 0.586 0.636 0.673 0.608 0.768
96 Hours (32 Obs.) 0.477 0.475 0.490 0.560 0.514 0.545 0.689
Table 7: Sampling Technique Comparison

It is clear that, for almost all event association sizes, sampling techniques do not improve the performance of the Multi-Layer Perception classifier or the Logistic Regression classifier. And where they do improve performance it is by a very slim margin. In light of these experiments, as was the case for the AD foam prediction problem, an event association of 24 hours results in the most accurate models whilst giving a reasonable amount of time for plant operators to notify the grid that they will be reducing load. At the event association of 24 hours the best performing model is the MLP classifier which has an average AUROC score of 0.889 over the 6 folds of our CV testing. Figure 3 shows the testing of the various MLP classifiers trained as part of our CV experiment. Each graph shows a models predictions when testing on one of blocks where the other 5 blocks have been used for training. Beneath each plot is the associated ROC plot for those predictions.

Figure 3: Best performing predictive maintenance model’s predictions on unseen condenser fouling data blocks with their associated ROC plots bellow.

4 Conclusion

In this paper we have proposed a framework for formulating a classification problem based on adverse event prediction. In our experiments we have shown that this framework translates well to our real world applications (predicting foaming in anaerobic digesters and predicting condenser fouling in nuclear power plants).

Our results have shown that it is possible to predict both foaming in AD and condenser fouling in civil nuclear power production with reasonable accuracy. This suggests that the framework laid out in Section 2.2 is a useful tool for testing and comparing a variety of popular machine learning techniques in order to create a reliable predictive maintenance model capable of predicting adverse events from time series data. If implemented the predicted maintenance models derived in this paper could drastically reduce expenditure on anti-foaming agent in the case of AD and dramatically reduce the cost due to lost power generation in the case of condenser fouling in the production of nuclear energy.

Our experiments have shown that hyperparameter tuning did not reliably increase the performance of the models tested on either the condenser fouling data set or the foaming data set. Due to our methodology, we can draw the conclusion that tuning hyperparameters on all bar one of the events does not guarantee that a model trained with that hyperparameter selection will show an improved performance on the remaining event. This drop in performance due to hyperparameter tuning suggests that the selected hyperparameters are overfiting the training set. We also observed that sampling techniques such as over and under sampling show little to no improvement in the performance of models on either data set.

4.1 Summary

We were able to create classification models capable of predicting foaming very well. In our CV experiments, our best classifier was able to achieve an average AUROC score of 0.971 over the 5 foaming events. This meant being able to predict each foaming event with a false positive rate of only 3.4%. This suggests that the implementation of this sort of predictive maintenance model could reduce the amount of anti-foaming agent required by 96.6%. The predictions of the best performing classifier over the 5-fold CV as well as their associated ROC graphs can be seen in Figure 2.

We were also able to predict condenser fouling reasonably well, with the best performing classifier achieving an average AUROC score of 0.889 over the 6 fouling events. This related to us being able to find models capable of predicting 5 of the 6 events with good accuracy, with the 6 still being predicted but with a higher rate of false positives. The predictions of this model can be seen in Figure 3.

References

5 Appendix

Figure 4: Gridsearch Hyperparameter Tuning Process
Figure 5: Experiment Process